Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow force deletion of resources that have specific cleanup annotation #13869

Closed
travisn opened this issue Mar 4, 2024 · 22 comments
Closed
Assignees
Labels

Comments

@travisn
Copy link
Member

travisn commented Mar 4, 2024

Is this a bug report or feature request?

  • Feature Request

What should the feature do:
Rook has always been designed to to preserve the cluster and the data, adding finalizers to the critical resources and CRs so that the cluster will not be accidentally deleted. When the resources are deleted, the operator will refuse to remove the finalizer if there exist underlying ceph resources related to that CR. This is good for preservation of the data and the cluster, but it is also difficult to work with when it is desired to force cleanup some resource types, or the entire cluster.

Let's consider force deleting a Rook resource and its underlying Ceph resources under two conditions:

  1. The CR is marked for deletion
  2. A specific annotation is found on the CR: ceph.rook.io/force-deletion

Resources to consider cleaning up:

  • RADOS namespaces for the CephRadosNamespace CR
  • Buckets for the CephObjectStore CR
  • Subvolumes for the CephFilesystemSubvolumeGroup CR

For example, if a CephFilesystemSubVolumeGroup CR has added the annotation ceph.rook.io/force-deletion and the CR is marked for deletion, the operator will get the deletion event. Rook will proceed to delete all subvolumes belonging to the subvolumegroup. When that cleanup is completed in ceph, Rook will remove the finalizer and allow the CR to be deleted.

If there are many Ceph resources to cleanup, we may want to consider either a goroutine, or launching a K8s job for the long-running operation.

The cluster CR currently implements a different approach to force deletion of the entire ceph cluster. There is a cleanup policy setting that must be added to the cluster CR. If that setting is found, then Rook will force delete the cluster and kick off a job on each node to clean up the data on disk from mons and wipe the disks. This scenario is still valid and would still be supported.

What is use case behind this feature:
Enable better design for cleaning up resources.

@travisn travisn added the feature label Mar 4, 2024
@BlaineEXE
Copy link
Member

I think this sounds great. Users occasionally (and occasionally very loudly) complain about not having an easy way to wipe away CRs like these when they no longer want the data. I think using an annotation is a good choice (compared to the CephCluster's cleanupPolicy) for these simpler CRs.

The only concern I have is related to a more broad discussion about cleaning up CSI OMAP artifacts that are associated with some of these CRs. If CSI's OMAP resources are not stored in the same RADOS namespaces and/or CephFS SubvolGroups that go along with each CSI tenant, those cleanup operations will have to be managed by some other process.

@travisn
Copy link
Member Author

travisn commented Mar 4, 2024

There is currently a PR in progress in the kubectl plugin to cleanup the OMAP details. If the plugin can do it, I would expect Rook could as well, but certainly need to understand that more...

@nb-ohad
Copy link

nb-ohad commented Mar 5, 2024

@BlaineEXE
According to my understanding, the CSI OMAP table is stored as metadata on the block pool supporting the RADOS namespace / subvolumegroup. A cleanup process will need to list the images on said resource, use the image names to construct a list of OMAP table entire keys, and then remove these entries from the OMAP table on the relevant block pool's metadata.

One thing we can do is create a job that takes care of the entire cleanup process and have Rook reconcile that job for each delete operation on annotated resources, and only after successful completion continue with the existing cleanup procedure

@parth-gr
Copy link
Member

parth-gr commented Mar 5, 2024

Annotation can still be a security issue, what if it is added by mistake and we dont want to hinder rook operator for it as that might be critical and delete user data,

We can have a job like osd-purge-job that can externally run delete commands on the toolbox and automate it, and we can call it anytime in the product if needed for the cleanup at a specific place.

@BlaineEXE
Copy link
Member

Annotation can still be a security issue, what if it is added by mistake and we dont want to hinder rook operator for it as that might be critical and delete user data,

We can have a job like osd-purge-job that can externally run delete commands on the toolbox and automate it, and we can call it anytime in the product if needed for the cleanup at a specific place.

This is a workable suggestion and I think it is always good to discuss alternate implementations for features. This helps us have clarity about what we want our user workflows to look like and how we want to balance different factors.

Personally, I think that what Travis proposes is likely to be better for end users for a few reasons that are somewhat interrelated:

  1. Users often complain about how hard it is to forcefully delete resources that they no longer want. This happens very frequently for users who are developing POCs where they are creating and deleting resources to understand how they work and how to develop good configurations. I think this makes the feature valuable. (I see this very often for CephObjectStore.)

  2. Because the annotation is not in the CRD, it is unlikely that a user will accidentally add the annotation (they need to find it in the docs first). However, as you mention, it is not impossible for users to add it by mistake, and we should keep this in mind in the general plan.

  3. The user workflow to "add an annotation to and then delete the CR" is simple. It is easy for end users to comprehend and execute. The complexity involved with "configure the deletion job carefully, run it, and then delete the CR" is harder to do correctly. In this alternative case, the deletion job's complexity may lead to user confusion that ultimately leads to users accidentally deleting the wrong data.

  4. Creating automation for deletion/cleanup using an annotation is similarly straightforward. Automation for this pattern is unlikely to accidentally delete data for the wrong CR. Automating job configuration would be much more complex, and there would be greater risk for automation accidentally deleting the wrong data.

When taking all of these considerations into account, I think the annotation method has better clarity and simplicity for users, which enables them to make fewer mistakes if/when they decide to use it.

@travisn
Copy link
Member Author

travisn commented Mar 5, 2024

Agreed on all those points, thanks @BlaineEXE.

I'd propose we go with the approach to force deletion based on the annotation. The question then just becomes how to implement the forced cleanup for each type of resource.

For subvolumegroups, we already have a PR in progress in the plugin for deleting stale subvolumes: rook/kubectl-rook-ceph#237. Finding stale subvolumes is very similar to this scenario of deleting all subvolumes in the subvolumegroup to be deleted. This should give a base implementation and we can look at sharing code in the operator for the force deletion scenario.

@Rakshith-R
Copy link
Member

Agreed on all those points, thanks @BlaineEXE.

I'd propose we go with the approach to force deletion based on the annotation. The question then just becomes how to implement the forced cleanup for each type of resource.

+1 for this.

For subvolumegroups, we already have a PR in progress in the plugin for deleting stale subvolumes: rook/kubectl-rook-ceph#237. Finding stale subvolumes is very similar to this scenario of deleting all subvolumes in the subvolumegroup to be deleted. This should give a base implementation and we can look at sharing code in the operator for the force deletion scenario.

I think we should extend subvolumegroup CR to have an option to also create radosnamespace which will be used by cephcsi to maintain omap data. With isolation now at both subvolume level and omap level, we don't have to worry about identifying and cleaning up omap of specific subvolume from a subvolumegroup(which is rather very tedious).

This approach and assumption needs to tested and verified to work before we extend subvolumegroup CR.

@parth-gr
Copy link
Member

parth-gr commented Mar 7, 2024

to also create radosnamespace which will be used by cephcsi to maintain omap data

Why not use the radosnamespace CR? https://github.com/rook/rook/blob/master/deploy/examples/radosnamespace.yaml

@Rakshith-R
Copy link
Member

Rakshith-R commented Mar 7, 2024

to also create radosnamespace which will be used by cephcsi to maintain omap data

Why not use the radosnamespace CR? https://github.com/rook/rook/blob/master/deploy/examples/radosnamespace.yaml

you need rbd.radosnamespace section specified in the entry created by the svg CR for cephcsi to use the radosnamespace.
https://github.com/rook/rook/blob/master/pkg/operator/ceph/file/subvolumegroup/controller.go#L279-L297

similar to https://github.com/rook/rook/blob/master/pkg/operator/ceph/pool/radosnamespace/controller.go#L265-L286

@travisn
Copy link
Member Author

travisn commented Mar 7, 2024

I think we should extend subvolumegroup CR to have an option to also create radosnamespace which will be used by cephcsi to maintain omap data. With isolation now at both subvolume level and omap level, we don't have to worry about identifying and cleaning up omap of specific subvolume from a subvolumegroup(which is rather very tedious).

This approach and assumption needs to tested and verified to work before we extend subvolumegroup CR.

While using a rados namespace sounds like the right approach, note that we will also need the ability to clean up existing clusters that don't have that change, so we shouldn't block the initial implementation of the cleanup on that feature.

@sp98
Copy link
Contributor

sp98 commented Mar 12, 2024

@travisn One suggestion would be use the existing cleanupPolicy confirmation, that is, yes-really-destroy-data to perform the clean up of the resource required in this feature.

That way, we can extend the existing cleanup policy to delete the required resources and don't have to deal with a new annotation ceph.rook.io/force-deletion

Moreover, I don't think (may be wrong) that user would only want to delete only specific resources by adding the ceph.rook.io/force-deletion annotation to them. So it will either be delete all data and don't delete the data.

We will end up having a single option/user confirmation to cleanup data.

The current flow of cluster tear down looks roughly like below:

1. Get dependent resources.
2. Block the deletion of the cephcluster until all the dependents are removed: For example: 
	`'CephCluster "rook-ceph/rook-ceph" will not be deleted until all dependents
        are removed: CephBlockPool: [replicapool]` 
3.  If data cleanup confirmation is there:
	Wait for daemons to cleanup. 
	start clean up jobs; these jobs delete the data on the disk. 
4. Perform following deletion:
	delete networkfence and remove its finalizers. 
	stop monitoring go routines. 
	Delete OSD encryption keys from KMS (if any)

@nb-ohad
Copy link

nb-ohad commented Mar 12, 2024

Moreover, I don't think (may be wrong) that user would only want to delete only specific resources by adding the ceph.rook.io/force-deletion annotation to them. So it will either be delete all data and don't delete the data.

@sp98 This is the exact requirement. Allow a customer to select a specific resource to force-delete and to mark it for force deletion a second before actually deleting it. So for the majority of the time customer's data is protected from deletion

@travisn
Copy link
Member Author

travisn commented Mar 12, 2024

@sp98 This scenario came up for cleaning up the CephFilesystemSubvolumeGroups and CephRadosNamespace CRs. We want to force delete these resources instead of blocking the finalizer. The entire cluster is not being cleaned up, just those resources. The same pattern could apply to a CephObjectStore that has buckets, or a CephFilesystem with subvolumesgroups, etc. We don't want to use a CRD setting to control the forced cleanup. The annotation seems best. Let's discuss.

@travisn
Copy link
Member Author

travisn commented Mar 14, 2024

To capture our conversation on the implementation for forced deletion of the CephFilesystemSubvolumeGroup CR...

  1. To force deletion:
    a. Add the annotation to the CR: ceph.rook.io/force-deletion
    b. Delete the CR
  2. The Rook operator will be notified of the deletion and process the event:
    a. If there are no subvolumes in the group, delete the finalizer and return
    b. If the annotation ceph.rook.io/force-deletion does not exist, requeue the event and wait for the user to delete the subvolumes
    b. If forcing deletion and there is a job already running to cleanup, just requeue the deletion event
    c. Create a job to force delete the subvolumegroup, then requeue the deletion event

The force deletion job will call a CLI command on the rook binary (similar to the osd purge command):

rook ceph subvolumegroup delete <subvolumegroup> <filesystem>

@travisn
Copy link
Member Author

travisn commented Mar 14, 2024

Also to consider is moving the omap cleanup code from the kubectl-rook-ceph repo (in progress with rook/kubectl-rook-ceph#237) into the core rook repo, then the kubectl-rook-ceph repo can reference the rook code for omap cleanup

@Madhu-1
Copy link
Member

Madhu-1 commented Mar 15, 2024

The force deletion job will call a CLI command on the rook binary (similar to the osd purge command):
rook ceph subvolumegroup delete <subvolumegroup> <filesystem>

Can we use rook ceph subvolumegroup delete <filesystem> <subvolumegroup>?

Currently few things are not considered in the cleanup code

  • The snapshots
  • The shallow volumes
  • The NFS volumes
  • Pending clones

More details are covered at rook/kubectl-rook-ceph#251 and rook/kubectl-rook-ceph#211 once this is addressed we are good, will check if am missing anything else.

@travisn
Copy link
Member Author

travisn commented Mar 15, 2024

The force deletion job will call a CLI command on the rook binary (similar to the osd purge command):
rook ceph subvolumegroup delete <subvolumegroup> <filesystem>

Can we use rook ceph subvolumegroup delete <filesystem> <subvolumegroup>?

This is proposed as a new command inside Rook (not kubectl-rook-ceph plugin) that could be called as a job.

Currently few things are not considered in the cleanup code

  • The snapshots
  • The shallow volumes
  • The NFS volumes
  • Pending clones

More details are covered at rook/kubectl-rook-ceph#251 and rook/kubectl-rook-ceph#211 once this is addressed we are good, will check if am missing anything else.

If we are force deleting the subvolumegroup, I assume that means we should force delete all of those resources owned by the svg, including snapshots and other volumes. It might just mean multiple PRs to get these changes done if it makes sense.

@nb-ohad
Copy link

nb-ohad commented Mar 18, 2024

Can we use rook ceph subvolumegroup delete ?

If everything is in the rook codebase why should we call a CLI command? why can't the job just invoke the go code itself?
Also the above command suggest that we are only going to handle subvolumegroups, but the requirement is to also force delete rados namespaces for block pools

@travisn
Copy link
Member Author

travisn commented Mar 18, 2024

Can we use rook ceph subvolumegroup delete ?

If everything is in the rook codebase why should we call a CLI command? why can't the job just invoke the go code itself?

A K8s job needs a CLI entry point.

Also the above command suggest that we are only going to handle subvolumegroups, but the requirement is to also force delete rados namespaces for block pools

Yes, the plan is also for force deleting rados namespaces, that comment just captured discussion specific to subvolumegroups.

@nb-ohad
Copy link

nb-ohad commented Mar 18, 2024

A K8s job needs a CLI entry point.

I understand @travisn
But is the suggestion to directly use the CLI as an entry point it to avoid the need to have another entry point in the operator's executable? Isn't it a mistake to couple both the operator and the CLI together?

In my eyes, the job and the operator can be coupled as the job is an extension of the operator reconciliation responsibilities and is just an implementation detail for the reconcile execution flow

@travisn
Copy link
Member Author

travisn commented Mar 18, 2024

Agreed it is an implementation detail. Rook internally has a number of CLI commands. For example, the OSD purge job calls such a CLI command. Internally, the Go packages are factored so the operator code is separate from ceph or other commands.

@travisn
Copy link
Member Author

travisn commented Apr 16, 2024

Closing with the completion of cleanup of the filesystem subvolumegroups with #14026 and RADOS namespaces with #14052. For forced cleanup of other types of resources, we can implement separately.

@travisn travisn closed this as completed Apr 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
Status: Done
Development

No branches or pull requests

7 participants