Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Better handling of unsupported combos in dynamic configuration #4725

Open
twz123 opened this issue Jul 5, 2024 · 1 comment
Open

Better handling of unsupported combos in dynamic configuration #4725

twz123 opened this issue Jul 5, 2024 · 1 comment

Comments

@twz123
Copy link
Member

twz123 commented Jul 5, 2024

Is your feature request related to a problem? Please describe.

Some k0s configuration options are mutually exclusive, others cannot be changed after cluster creation. Currently, there is minimal sanity checking for dynamic configuration. As a result, unsupported configuration combinations can potentially break a cluster completely. Debugging such a problem is complex, as helpful error messages only appear during cluster creation. See #4721 for an example.

Describe the solution you would like

  1. Prevent invalid configurations from being stored in the cluster:
    Add CEL (Common Expression Language) validation rule markers to the various ClusterConfig structs. This allows validation to be performed by the API server already, preventing invalid configurations from reaching the cluster in the first place.

  2. Graceful handling of unsupported configuration values. Depending on the effectiveness of point 1, there are several options:

    1. If the configuration validation fails, k0s doesn't reconcile the configuration at all, and the components remain on the last valid configuration. This is easy and straightforward to implement. While tempting, this could undermine the reconciliation of other valid and safe configuration parts, and is therefore only a good choice if point 1 is effective, and invalid configurations stored within a cluster can be considered a pathological edge case.

    2. Try to get as close to the desired configuration as possible without breaking the cluster. This might involve "resolving" an invalid desired configuration by comparing it to the last valid one, into a "fixed" target configuration that passes validation and can be safely reconciled. This is a more elaborate and error-prone approach, and may not be necessary if point 1 proves effective.

Describe alternatives you've considered

A validating webhook is a more powerful approach to preventing invalid configurations from being stored in the cluster. At the same time, it's much heavier and more complex. It does not currently add significant value over the simpler CEL approach.

"Let it crash". Terminate the process and wait for a restart. This is not suitable as it will bring down the local API server, making it difficult to fix invalid configurations, and could also harm the entire etcd cluster in an HA scenario.

Additional context

A particular challenge is the stack applier, which is used by almost all k0s components to manage resources in the cluster. Currently, there is no good way to suspend a stack to prevent it from being applied. This may become necessary to prevent bad things from happening. Suspending stack reconciliation could be achieved by suspending leader election globally, but this might prevent partial reconciliation as discussed above.

@twz123
Copy link
Member Author

twz123 commented Jul 5, 2024

#4674 already addresses some non-CEL validation parts for the CRDs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants