Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Backup Job Fails to Abort When libvirtd Daemon Restarts or Disconnects #205

Closed
inode64 opened this issue Aug 21, 2024 · 8 comments
Closed
Labels
bug Something isn't working enhancement New feature or request fixed-in-master Fixed in master Needs-testing Feature needs testing

Comments

@inode64
Copy link
Contributor

inode64 commented Aug 21, 2024

Version used
2.11

Describe the bug
When performing a backup, if the libvirtd daemon restarts or disconnects unexpectedly, the backup job fails to complete and throws an error. The process is unable to stop the backup task due to a broken connection, resulting in a "Broken pipe" error. The application does not handle this scenario correctly, leading to an unhandled exception and terminating the process.

Steps to Reproduce:

  • Start a backup task using the application.
  • While the backup is in progress, restart or disconnect the libvirtd daemon.
  • Observe that the backup job fails to stop properly and generates the following error:

Expected behavior

The backup process should be able to detect the disconnection of the libvirtd daemon and handle it gracefully by reconnecting and attempting to abort the job. Ideally, the application should attempt to restore the connection to libvirtd and properly abort the backup task, instead of throwing an unhandled exception.

Hypervisor information:

  • OS: Gentoo
  • HV type libvirt 10.3.0

Logfiles:
[2024-08-20 01:33:55] INFO lib common - printVersion [main]: Version: 2.11 Arguments: /usr/lib/python-exec/python3.11/virtnbdbackup -d svrdynamics --noprogress -q --compress -l auto -o /srv/backup/svrdynamics/2024-33
[2024-08-20 01:33:55] INFO root virtnbdbackup - main [main]: Backup level: [auto]
[2024-08-20 01:33:55] INFO root virtnbdbackup - main [main]: Compression enabled, level [2]
[2024-08-20 01:33:56] INFO root virtnbdbackup - main [main]: Libvirt library version: [10003000]
[2024-08-20 01:33:56] INFO root virtnbdbackup - main [main]: NBD library version: [1.18.1]
[2024-08-20 01:33:56] INFO root check - targetDir [main]: Backup mode auto, target folder is empty: executing full backup.
[2024-08-20 01:33:56] INFO root disktype - Optical [main]: Skipping attached [cdrom] device: [sdb].
2024-08-20 01:33:56] INFO root virtnbdbackup - main [main]: Backup will save [4] attached disks.
2024-08-20 01:33:56] INFO root virtnbdbackup - main [main]: Concurrent backup processes: [4]
2024-08-20 01:33:56] INFO root checkpoint - create [main]: Loading checkpoints from: [/srv/backup/svrdynamics/2024-33/svrdynamics.cpt]
2024-08-20 01:33:56] INFO root checkpoint - redefine [main]: Loading checkpoint list from: [/srv/backup/svrdynamics/2024-33/checkpoints]
2024-08-20 01:33:56] INFO root checkpoint - create [main]: Checkpoint handling.
2024-08-20 01:34:02] INFO root checkpoint - create [main]: Using checkpoint name: [virtnbdbackup.0].
2024-08-20 01:34:02] INFO root virtnbdbackup - main [main]: Local NBD Endpoint socket: [/var/tmp/virtnbdbackup.2045415]
2024-08-20 01:34:02] INFO root virtnbdbackup - main [main]: Temporary scratch file target directory: [/var/tmp]
2024-08-20 01:34:02] INFO root job - start [main]: Starting backup job.
2024-08-20 01:34:33] INFO fs fs - freeze [main]: Freezed [5] filesystems.
2024-08-20 01:34:34] INFO fs fs - thaw [main]: Thawed [5] filesystems.
2024-08-20 01:34:34] INFO root virtnbdbackup - main [main]: Started backup job with checkpoint, saving information.
2024-08-20 01:34:34] INFO root checkpoint - backup [main]: Saving checkpoint config to: [/srv/backup/svrdynamics/2024-33/checkpoints/virtnbdbackup.0.xml]
2024-08-20 01:34:34] INFO nbd client - connect [vda]: Waiting until NBD server at [nbd+unix:///vda?socket=/var/tmp/virtnbdbackup.2045415] is up.
2024-08-20 01:34:34] INFO nbd client - connect [vdb]: Waiting until NBD server at [nbd+unix:///vdb?socket=/var/tmp/virtnbdbackup.2045415] is up.
2024-08-20 01:34:34] INFO nbd client - connect [vdc]: Waiting until NBD server at [nbd+unix:///vdc?socket=/var/tmp/virtnbdbackup.2045415] is up.
2024-08-20 01:34:34] INFO nbd client - connect [sda]: Waiting until NBD server at [nbd+unix:///sda?socket=/var/tmp/virtnbdbackup.2045415] is up.
2024-08-20 01:34:35] INFO nbd client - connect [vda]: Connection to NBD backend succeeded.
2024-08-20 01:34:35] INFO nbd client - connect [vdb]: Connection to NBD backend succeeded.
2024-08-20 01:34:35] INFO root disk - _getExtentHandler [vda]: Using qemu tools to query extents
2024-08-20 01:34:35] INFO root disk - _getExtentHandler [vdb]: Using qemu tools to query extents
2024-08-20 01:34:35] INFO nbd client - connect [vdc]: Connection to NBD backend succeeded.
2024-08-20 01:34:35] INFO nbd client - connect [sda]: Connection to NBD backend succeeded.
2024-08-20 01:34:35] INFO root disk - _getExtentHandler [vdc]: Using qemu tools to query extents
2024-08-20 01:34:35] INFO root disk - _getExtentHandler [sda]: Using qemu tools to query extents
2024-08-20 01:34:35] INFO root disk - backup [vda]: Got 31 extents to backup.
2024-08-20 01:34:35] INFO root disk - backup [vda]: 53687091200 bytes disk size
2024-08-20 01:34:35] INFO root disk - backup [vda]: 823189504 bytes of data extents to backup
2024-08-20 01:34:35] INFO root target - get [vda]: Write data to target file: [/srv/backup/svrdynamics/2024-33/vda.full.data.partial].
2024-08-20 01:34:35] INFO root disk - backup [vda]: Creating thin provisioned stream backup image
2024-08-20 01:34:36] INFO root disk - backup [sda]: Got 357 extents to backup.
2024-08-20 01:34:36] INFO root disk - backup [sda]: 128849018880 bytes disk size
2024-08-20 01:34:36] INFO root disk - backup [sda]: 128540803072 bytes of data extents to backup
2024-08-20 01:34:36] INFO root target - get [sda]: Write data to target file: [/srv/backup/svrdynamics/2024-33/sda.full.data.partial].
2024-08-20 01:34:36] INFO root disk - backup [sda]: Creating thin provisioned stream backup image
2024-08-20 01:34:36] INFO root disk - backup [vdc]: Got 123 extents to backup.
2024-08-20 01:34:36] INFO root disk - backup [vdc]: 107374182400 bytes disk size
2024-08-20 01:34:36] INFO root disk - backup [vdc]: 65519288320 bytes of data extents to backup
2024-08-20 01:34:36] INFO root target - get [vdc]: Write data to target file: [/srv/backup/svrdynamics/2024-33/vdc.full.data.partial].
2024-08-20 01:34:36] INFO root disk - backup [vdc]: Creating thin provisioned stream backup image
2024-08-20 01:34:37] INFO root disk - backup [vdb]: Got 17 extents to backup.
2024-08-20 01:34:37] INFO root disk - backup [vdb]: 214748364800 bytes disk size
2024-08-20 01:34:37] INFO root disk - backup [vdb]: 214353313792 bytes of data extents to backup
2024-08-20 01:34:37] INFO root target - get [vdb]: Write data to target file: [/srv/backup/svrdynamics/2024-33/vdb.full.data.partial].
2024-08-20 01:34:37] INFO root disk - backup [vdb]: Creating thin provisioned stream backup image
2024-08-20 01:34:42] INFO root disk - backup [vda]: Backup of disk [vda] finished, file: [/srv/backup/svrdynamics/2024-33/vda.full.data]
2024-08-20 01:34:42] INFO root metadata - backupChecksum [vda]: Checksum for file: [/srv/backup/svrdynamics/2024-33/vda.full.data]:[948353423]
2024-08-20 01:34:42] INFO root metadata - backupChecksum [vda]: Saving checksum to: [/srv/backup/svrdynamics/2024-33/vda.full.data.chksum]
2024-08-20 01:49:32] INFO root disk - backup [vdc]: Backup of disk [vdc] finished, file: [/srv/backup/svrdynamics/2024-33/vdc.full.data]
2024-08-20 01:49:32] INFO root metadata - backupChecksum [vdc]: Checksum for file: [/srv/backup/svrdynamics/2024-33/vdc.full.data]:[1331928550]
2024-08-20 01:49:32] INFO root metadata - backupChecksum [vdc]: Saving checksum to: [/srv/backup/svrdynamics/2024-33/vdc.full.data.chksum]
2024-08-20 02:02:42] INFO root disk - backup [sda]: Backup of disk [sda] finished, file: [/srv/backup/svrdynamics/2024-33/sda.full.data]
2024-08-20 02:02:42] INFO root metadata - backupChecksum [sda]: Checksum for file: [/srv/backup/svrdynamics/2024-33/sda.full.data]:[496616561]
2024-08-20 02:02:46] INFO root metadata - backupChecksum [sda]: Saving checksum to: [/srv/backup/svrdynamics/2024-33/sda.full.data.chksum]
2024-08-20 02:14:24] INFO root disk - backup [vdb]: Backup of disk [vdb] finished, file: [/srv/backup/svrdynamics/2024-33/vdb.full.data]
2024-08-20 02:14:24] INFO root metadata - backupChecksum [vdb]: Checksum for file: [/srv/backup/svrdynamics/2024-33/vdb.full.data]:[3927655927]
2024-08-20 02:14:24] INFO root metadata - backupChecksum [vdb]: Saving checksum to: [/srv/backup/svrdynamics/2024-33/vdb.full.data.chksum]
2024-08-20 02:14:24] INFO root virtnbdbackup - main [main]: Backup jobs finished, stopping backup task.
2024-08-20 02:14:24] WARNING virt client - stopBackup [main]: Failed to stop backup job: [Cannot write data: Broken pipe]
2024-08-20 02:14:24] INFO root metadata - backupConfig [main]: Saving VM config to: [/srv/backup/svrdynamics/2024-33/vmconfig.virtnbdbackup.0.xml]
2024-08-20 02:14:24] INFO root metadata - backupBootConfig [main]: Save additional boot config [loader] to: [/srv/backup/svrdynamics/2024-33/edk2-x86_64-code.fd.virtnbdbackup.0]
2024-08-20 02:14:24] INFO root metadata - backupBootConfig [main]: Save additional boot config [nvram] to: [/srv/backup/svrdynamics/2024-33/svrdynamics_VARS.fd.virtnbdbackup.0]
2024-08-20 02:14:24] INFO root metadata - backupDiskInfo [main]: Saved qcow image config to: [/srv/backup/svrdynamics/2024-33/vda.virtnbdbackup.0.qcow.json]
2024-08-20 02:14:24] INFO root metadata - backupDiskInfo [main]: Saved qcow image config to: [/srv/backup/svrdynamics/2024-33/vdb.virtnbdbackup.0.qcow.json]
2024-08-20 02:14:24] INFO root metadata - backupDiskInfo [main]: Saved qcow image config to: [/srv/backup/svrdynamics/2024-33/vdc.virtnbdbackup.0.qcow.json]
2024-08-20 02:14:24] INFO root metadata - backupDiskInfo [main]: Saved qcow image config to: [/srv/backup/svrdynamics/2024-33/sda.virtnbdbackup.0.qcow.json]
Traceback (most recent call last):
File "/usr/lib/python-exec/python3.11/virtnbdbackup", line 493, in
main()
File "/usr/lib/python-exec/python3.11/virtnbdbackup", line 471, in main
if domObj.autostart() == 1:
^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3.11/site-packages/libvirt.py", line 741, in autostart
raise libvirtError('virDomainGetAutostart() failed')
libvirt.libvirtError: internal error: client socket is closed

@inode64 inode64 added the bug Something isn't working label Aug 21, 2024
@abbbi
Copy link
Owner

abbbi commented Aug 21, 2024

What happens to the backup job anyway if the libvirtd daemon is restarted? It seems the Qemu process still keeping the NDB handle open, otherwise data transfer would fail. Is the backup job still listed in domjobs output after a libvrtd restart? Really a corner case here, i dont know if want to invest time to fix this..

Problem to fix this is that the libvirt connection is initialized at application startup, there is probably no easy way to re-connect without re-designing a large part of the code base. One way to solve this would probably be a second thread which checks for the connection state and re-initializes it if its broken, but then this is still prone to race conditions.

Another one would be to slap an decorator on each function within the virt/client.py which re-initializes a connection if it is broken, or redesign the module at all that for each operation a new connection is opened (which isnt that great,too). But then you would need to handle this gracefully too, what if libvirtd doesnt come up for several minutes or even hours?)

I dont see any options to the libvirt flags upon connection opening that would introduce an automatic reconnect.

@abbbi
Copy link
Owner

abbbi commented Aug 21, 2024

>>> import libvirt
>>> foo = libvirt.open("qemu:///session")              # restart libvirt daemon at this point
>>> foo.isAlive()
1                                                      # wouldnt expect that
>>> foo.getSysinfo()
libvirt: XML-RPC error : Cannot write data: Broken pipe
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib/python3/dist-packages/libvirt.py", line 4742, in getSysinfo
    raise libvirtError('virConnectGetSysinfo() failed')
libvirt.libvirtError: Cannot write data: Broken pipe
>>> foo.isAlive()
0                                                   # connection drops alive once an error is happening.

@abbbi abbbi added the enhancement New feature or request label Aug 21, 2024
@abbbi
Copy link
Owner

abbbi commented Aug 21, 2024

registerCloseCallback looks interesting.

@abbbi
Copy link
Owner

abbbi commented Aug 21, 2024

>>> def callback(foo,bar,baz):
...     print("help..")
... 
>>> 
>>> 
>>> foo = libvirt.open("qemu:///session")
>>> foo.registerCloseCallback(callback, None)
0
>>> foo.isAlive()
1
>>> foo.getSysinfo()
help..
libvirt: XML-RPC error : Cannot write data: Broken pipe
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib/python3/dist-packages/libvirt.py", line 4742, in getSysinfo
    raise libvirtError('virConnectGetSysinfo() failed')
libvirt.libvirtError: Cannot write data: Broken pipe

@abbbi
Copy link
Owner

abbbi commented Aug 21, 2024

hi,

can you check branch issue_205 with its commits.
in my tests it then attempts to re-connect if connection error handle is called due to issues with
libvirt backend:

[2024-08-21 11:55:23] ERROR root virtnbdbackup - connectionError [main]:  Libvirt connection error detected, reconnecting
[2024-08-21 11:55:23] ERROR root virtnbdbackup - connectionError [main]:  Attempting to stop backup task

@abbbi abbbi added fixed-in-master Fixed in master Needs-testing Feature needs testing labels Aug 21, 2024
@inode64
Copy link
Contributor Author

inode64 commented Aug 21, 2024

Ok, thank you very much, all I can do is try it in a test installation and let you know.

@abbbi
Copy link
Owner

abbbi commented Aug 21, 2024

abbbi added a commit that referenced this issue Aug 21, 2024
@abbbi abbbi closed this as completed Aug 21, 2024
@inode64
Copy link
Contributor Author

inode64 commented Aug 22, 2024

It runs fine for me!!
Thanks

I have created a pull to include retries #206

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working enhancement New feature or request fixed-in-master Fixed in master Needs-testing Feature needs testing
Projects
None yet
Development

When branches are created from issues, their pull requests are automatically linked.

2 participants