Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Restoring a snapshot causes snapshot delete to block until drop_caches is called #736

Open
ticpu opened this issue Aug 27, 2024 · 3 comments

Comments

@ticpu
Copy link

ticpu commented Aug 27, 2024

Details

Setup

  1. Using Kernel: 6.11 / bcachefs-testing / fcd6549.
  2. bcachefs is mounted at /mnt/bcachefs
  3. /opt is a bind mount to /mnt/bcachefs/opt
  4. daily snapshots of opt are taken in /mnt/bcachefs/snapshots/opt

Rollback operation:

  1. During an update using jetbrain-toolbox to update all applications, my computer crashed and I wanted to start from a clean state.
  2. confirm with fuser -vm /opt no application using /opt, nothing is configured to use /mnt/bcachefs/opt directly.
  3. umount /opt
  4. cd /mnt/bcachefs
  5. mv opt opt.dead
  6. bcachefs subvolume snapshot snapshots/opt/@GMT-2024.08.26-05.00.18/ opt
  7. ls opt all files are present.
  8. mount --bind /mnt/bcachefs/opt /opt
  9. Restart services.
  10. bcachefs subvolume del ./opt.dead

Problem

  1. Everything starts as expected, I/O are happening on all devices.
  2. I restart updates from the jetbrain-toolbox.
  3. Repeating stack traces appears in dmesg: stack1.txt
  4. umounting the filesystem fixes the issue and ends the deletion with:
bch2_delete_dead_snapshots: error deleting keys from dying snapshots erofs_trans_commit
bch2_delete_dead_snapshots: error erofs_trans_commit
shutdown complete, journal seq 34783919.

Reproducing the problem with more debug information.

  1. Repeat rollback and step 1, 2 previously.
  2. New debug message appears repeatedly: bch2_evict_subvolume_inodes() waited 10 seconds for inode 671283974:6768 to go away: ref 1 state 65536
  3. echo w shows the same stack for the blocked delete: stack2.txt
  4. after waiting about half an hour, issue echo 3 > /proc/sys/vm/drop_caches
  5. repeating log stops and multiple gigabytes of discard operation start on both NVMe.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants
@ticpu and others