Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix gprestore/gpbackup hanging in case the helper goes down #88

Merged
merged 30 commits into from
Aug 14, 2024

Conversation

whitehawk
Copy link

@whitehawk whitehawk commented Jun 5, 2024

Fix gprestore/gpbackup hanging in case the helper goes down

Problem description:
In case restore/backup helper encountered an error, gprestore or
gpbackup could hang forever.

Root cause:
gprestore was hanging, because COPY command was waiting for data from the pipe
file (via 'cat <pipe_file>'), which was removed in helper's DoCleanup() function
before any data was placed to the pipe by the restore helper (as the restore
helper finished its work due to some error at plugin start).
gpbackup was hanging for the similar reasons - backup helper finished execution
before it opened the pipe for reading.

Fix:
gprestore and gpbackup start a goroutine, which polls if helper ended with an
error and cancels the pending COPY commands via the execution context.
Execution context is added to gpbackup (it is already presented for gprestore).

Test is added only for restore. Test for backup case is not presented, because
with current backup helper logic any error from the plugin is handled after the
pipe file is opened, and it is not possible to simulate early backup helper
termination in the test without hacking the backup helper code.

@whitehawk whitehawk changed the title Fix gprestore hanging on copy in case the helper goes down at start Fix gprestore hanging on COPY in case the helper goes down at start Jun 5, 2024
@whitehawk
Copy link
Author

Note: in some cases (for ex. when doing larger-to-smaller cluster resize restore), if restore helper has already shut down, gprestore can be still hanging on a COPY command for 5 minutes. That is expected behavior (refer to code).

Please be aware during testing.

@whitehawk whitehawk marked this pull request as ready for review June 10, 2024 02:46
@RekGRpth

This comment was marked as resolved.

@whitehawk
Copy link
Author

Where is the fix for gpbackup, as described in the task?

According to description in the parent task (ADBDEV-5685) it is all about gprestore.

@RekGRpth

This comment was marked as resolved.

@whitehawk whitehawk changed the title Fix gprestore hanging on COPY in case the helper goes down at start Fix gprestore/gpbackup hanging in case the helper goes down at start Jun 11, 2024
@whitehawk
Copy link
Author

Where is the fix for gpbackup, as described in the task?

According to description in the parent task (ADBDEV-5685) it is all about gprestore.

1) ... Problem affects gpbackup and gprestore
2) Possible solution is to close all pipes (including opened by gpbackup/gprestore) before exit.

I've updated the case for gpbackup.

It is not possible to reproduce a problem with gpbackup by injecting an error from the plugin. It is so because gpbackup helper checks stderr and return code of plugin after it has already started reading from the pipe.

So, to simulate a problem, I had to inject an error directly in the code - refer to test patch
backup_early_error_adbdev-5693.patch

@Stolb27 Stolb27 requested a review from dkovalev1 June 18, 2024 13:40
helper/helper.go Outdated Show resolved Hide resolved
restore/data_test.go Outdated Show resolved Hide resolved
restore/data_test.go Outdated Show resolved Hide resolved
@RekGRpth

This comment was marked as resolved.

@whitehawk whitehawk marked this pull request as draft August 5, 2024 06:48
Plus add checker goroutine into backup.
@whitehawk whitehawk marked this pull request as ready for review August 7, 2024 08:15
@RekGRpth
Copy link
Member

RekGRpth commented Aug 8, 2024

start

Why in start only? It may hang when helper goes down at any time.

end_to_end/plugin_test.go Outdated Show resolved Hide resolved
@whitehawk
Copy link
Author

start

Why in start only? It may hang when helper goes down at any time.

If the helper goes down after it had already opened the pipes, the gpbackup/gprestore will not hang. So the problem is mostly related to cases when the helper fails at its early stage.

@RekGRpth
Copy link
Member

RekGRpth commented Aug 9, 2024

start

Why in start only? It may hang when helper goes down at any time.

If the helper goes down after it had already opened the pipes, the gpbackup/gprestore will not hang. So the problem is mostly related to cases when the helper fails at its early stage.

Why then do we need endless checking that the helpers are still alive?

@whitehawk
Copy link
Author

Why then do we need endless checking that the helpers are still alive?

To be on a safe side, I think we should keep the checker all the time the helper works.

@RekGRpth
Copy link
Member

RekGRpth commented Aug 9, 2024

Why then do we need endless checking that the helpers are still alive?

To be on a safe side, I think we should keep the checker all the time the helper works.

Do you leave five-second launches on all segments during the several-hour restore just in case?

@whitehawk
Copy link
Author

Do you leave five-second launches on all segments during the several-hour restore just in case?

In case of non single data file, the helper will start plugin process for every table. So if the plugin fails to start somewhen in the middle of restore, we need the checker to be running.
And yes, in this case it is better to adjust the summary and the description...

@RekGRpth
Copy link
Member

RekGRpth commented Aug 9, 2024

So if the plugin fails to start somewhen in the middle of restore, we need the checker to be running.

Could you add test for such case?

@whitehawk whitehawk changed the title Fix gprestore/gpbackup hanging in case the helper goes down at start Fix gprestore/gpbackup hanging in case the helper goes down Aug 9, 2024
@whitehawk
Copy link
Author

So if the plugin fails to start somewhen in the middle of restore, we need the checker to be running.

Could you add test for such case?

The test was added

@Stolb27 Stolb27 enabled auto-merge (squash) August 14, 2024 07:07
@Stolb27 Stolb27 merged commit 8060b4e into master Aug 14, 2024
2 checks passed
@Stolb27 Stolb27 deleted the ADBDEV-5693 branch August 14, 2024 08:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants