Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Avoid using stderr to detect plugin failures, wait for plugin processes #99

Merged
merged 1 commit into from
Aug 1, 2024

Conversation

whitehawk
Copy link

@whitehawk whitehawk commented Jul 29, 2024

Avoid using stderr to detect plugin failures, wait for plugin processes (#89)

Previously, gpbackup_helper would error out and abort restore operations if any
plugin wrote anything to stderr. Additionally, when using the adb_ddp_plugin to
restore data, gpbackup_helper did not wait for plugin processes, leading to a
large number of zombie processes when restoring with the --resize-cluster flag,
causing the process to stop.

This patch removes the requirement for stderr to be empty. Now, messages
directed to stderr are logged as warnings, allowing the process to continue
without interruption. The helper can still detect when a plugin process has
exited because the exit of a plugin process closes the associated reader
handles, causing an error during subsequent read attempts.

The patch also adds logic to wait and reap plugin processes. Instead of turning
plugin processes into zombies, gpbackup_helper now calls Wait() on them. This
action is performed every time a reader finishes copying its content. Wait() is
not done in case of --single-data-file, because Wait() closes pipes immediately,
but helper will reuse the same reader and read from its stdout pipe multiple
times.

Two new tests are introduced: the first one verifies that gpbackup_helper does
not fail when a plugin writes something to stderr during the restore operation.
The second test ensures that gpbackup_helper errors out when a plugin process
terminates in the middle of the restore operation.

Changes comparing to the original commit:

  1. logWarning() is replaced with already existing logWarn(), that has the same
    functionality.
  2. One of the calls to waitForPlugin() is removed as no more necessary, because
    there is no more nested loop over batches, and we can leave only one call for
    waitForPlugin() after 'LoopEnd' label.
  3. Several variable names in the test were updated as old names do not exist
    anymore. Plus the pipefile name in the test was updated, as now it includes
    batch number.
  4. log() doesn't exist anymore and is replaced with logVerbose().
  5. Unreachable call to logPlugin() is removed.
  6. New tests are added to cover the case with cluster resize.
  7. logPlugin() is merged into waitForPlugin().
  8. Tests are reworked to avoid goroutines.
  9. Cleanup of plugin's test control files in the test is now done from a defer
    function in order not to leave them if test failed (otherwise a failed test
    could affect the subsequent tests).
  10. SpecTimeout is added to some new tests to ensure that if the delta from
    this commit is broken, the test will not hang and will provide more useful
    output.

(cherry picked from commit bb75d5a)


Note: do not squash the commit to preserve authorship.

@whitehawk whitehawk changed the title Avoid using stderr to detect plugin failures, wait for plugin process… Avoid using stderr to detect plugin failures, wait for plugin processes Jul 29, 2024
@whitehawk whitehawk marked this pull request as ready for review July 29, 2024 07:51
helper/restore_helper.go Outdated Show resolved Hide resolved
helper/restore_helper.go Outdated Show resolved Hide resolved
helper/restore_helper.go Outdated Show resolved Hide resolved
helper/restore_helper.go Outdated Show resolved Hide resolved
@RekGRpth

This comment was marked as resolved.

@whitehawk whitehawk force-pushed the ADBDEV-5966 branch 2 times, most recently from b595bb6 to 68da53c Compare August 1, 2024 01:09
helper/restore_helper.go Outdated Show resolved Hide resolved
helper/restore_helper.go Outdated Show resolved Hide resolved
helper/restore_helper.go Outdated Show resolved Hide resolved
helper/restore_helper.go Outdated Show resolved Hide resolved
…es (#89)

Previously, gpbackup_helper would error out and abort restore operations if any
plugin wrote anything to stderr. Additionally, when using the adb_ddp_plugin to
restore data, gpbackup_helper did not wait for plugin processes, leading to a
large number of zombie processes when restoring with the --resize-cluster flag,
causing the process to stop.

This patch removes the requirement for stderr to be empty. Now, messages
directed to stderr are logged as warnings, allowing the process to continue
without interruption. The helper can still detect when a plugin process has
exited because the exit of a plugin process closes the associated reader
handles, causing an error during subsequent read attempts.

The patch also adds logic to wait and reap plugin processes. Instead of turning
plugin processes into zombies, gpbackup_helper now calls Wait() on them. This
action is performed every time a reader finishes copying its content. Wait() is
not done in case of --single-data-file, because Wait() closes pipes immediately,
but helper will reuse the same reader and read from its stdout pipe multiple
times.

Two new tests are introduced: the first one verifies that gpbackup_helper does
not fail when a plugin writes something to stderr during the restore operation.
The second test ensures that gpbackup_helper errors out when a plugin process
terminates in the middle of the restore operation.

Changes comparing to the original commit:
1. logWarning() is replaced with already existing logWarn(), that has the same
functionality.
2. One of the calls to waitForPlugin() is removed as no more necessary, because
there is no more nested loop over batches, and we can leave only one call for
waitForPlugin() after 'LoopEnd' label.
3. Several variable names in the test were updated as old names do not exist
anymore. Plus the pipefile name in the test was updated, as now it includes
batch number.
4. log() doesn't exist anymore and is replaced with logVerbose().
5. Unreachable call to logPlugin() is removed.
6. New tests are added to cover the case with cluster resize.
7. logPlugin() is merged into waitForPlugin().
8. Tests are reworked to avoid goroutines.
9. Cleanup of plugin's test control files in the test is now done from a defer
function in order not to leave them if test failed (otherwise a failed test
could affect the subsequent tests).
10. SpecTimeout is added to some new tests to ensure that if the delta from
this commit is broken, the test will not hang and will provide more useful
output.

(cherry picked from commit bb75d5a)

Co-authored-by: Roman Eskin <[email protected]>
@Stolb27 Stolb27 merged commit 80a85ae into 1.30.5-sync Aug 1, 2024
1 check passed
@Stolb27 Stolb27 deleted the ADBDEV-5966 branch August 1, 2024 07:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants