-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix gprestore/gpbackup hanging in case the helper goes down #88
Conversation
Note: in some cases (for ex. when doing larger-to-smaller cluster resize restore), if restore helper has already shut down, gprestore can be still hanging on a COPY command for 5 minutes. That is expected behavior (refer to code). Please be aware during testing. |
This comment was marked as resolved.
This comment was marked as resolved.
According to description in the parent task (ADBDEV-5685) it is all about gprestore. |
This comment was marked as resolved.
This comment was marked as resolved.
I've updated the case for gpbackup. It is not possible to reproduce a problem with gpbackup by injecting an error from the plugin. It is so because gpbackup helper checks stderr and return code of plugin after it has already started reading from the pipe. So, to simulate a problem, I had to inject an error directly in the code - refer to test patch |
This comment was marked as resolved.
This comment was marked as resolved.
Plus add checker goroutine into backup.
Why in start only? It may hang when helper goes down at any time. |
If the helper goes down after it had already opened the pipes, the gpbackup/gprestore will not hang. So the problem is mostly related to cases when the helper fails at its early stage. |
Why then do we need endless checking that the helpers are still alive? |
To be on a safe side, I think we should keep the checker all the time the helper works. |
Do you leave five-second launches on all segments during the several-hour restore just in case? |
In case of non single data file, the helper will start plugin process for every table. So if the plugin fails to start somewhen in the middle of restore, we need the checker to be running. |
Could you add test for such case? |
The test was added |
Fix gprestore/gpbackup hanging in case the helper goes down
Problem description:
In case restore/backup helper encountered an error, gprestore or
gpbackup could hang forever.
Root cause:
gprestore was hanging, because COPY command was waiting for data from the pipe
file (via 'cat <pipe_file>'), which was removed in helper's DoCleanup() function
before any data was placed to the pipe by the restore helper (as the restore
helper finished its work due to some error at plugin start).
gpbackup was hanging for the similar reasons - backup helper finished execution
before it opened the pipe for reading.
Fix:
gprestore and gpbackup start a goroutine, which polls if helper ended with an
error and cancels the pending COPY commands via the execution context.
Execution context is added to gpbackup (it is already presented for gprestore).
Test is added only for restore. Test for backup case is not presented, because
with current backup helper logic any error from the plugin is handled after the
pipe file is opened, and it is not possible to simulate early backup helper
termination in the test without hacking the backup helper code.