Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

StatusManager: consolidate status updates from different zones #3750

Merged
merged 7 commits into from
Nov 29, 2023

Conversation

npinaeva
Copy link
Member

@npinaeva npinaeva commented Jul 6, 2023

Create StatusManager - a centralized component responsible for updating
the centralized status of an object, based on zone-specific statuses.
Created as a part of cluster manager, handles only apbroutepolicy
objects for now.

Update AdminPolicyBasedRouteStatus.Messages to allow patching with
merge strategy. Update update-codegen script to always install
latest controller-gen, so that controller-gen.kubebuilder.io/version
on the generate object doesn't decrease.

Remove updated policy check based on timestamp,
since LastTransitionTime precision is in seconds, and the whole test
takes less than a second to complete, therefore all timestamps will
be the same for multiple updates. Just checking expected policy state
is enough for that test.

Update unit tests to check Status.Messages instead of Status.Status

Add distributed status management for EgressFirewall.
Add Status.Messages field to record statuses from zones,
make egressfirewall status a subresource

Be careful: requires CRD and status subresource permission change

@coveralls
Copy link

coveralls commented Sep 30, 2023

Coverage Status

coverage: 50.718% (+0.2%) from 50.475%
when pulling 65ab2b2 on npinaeva:apbroute-status
into ac6820d on ovn-org:master.

@npinaeva
Copy link
Member Author

caught #3924

@npinaeva npinaeva closed this Oct 13, 2023
@npinaeva npinaeva reopened this Oct 13, 2023
@npinaeva npinaeva changed the title Apbroute merge status StatusManager: consolidate status updates from different zones Oct 18, 2023
@jcaamano
Copy link
Contributor

Missing a bit of documentation somewhere on the general approach that guides future implementations for other resource statuses. It might be confusing that, while we could be doing different things for apb and ef because the code structure supports it, we are doing basically the same thing for both.

@npinaeva npinaeva force-pushed the apbroute-status branch 3 times, most recently from d09c153 to 299ca33 Compare November 2, 2023 10:28
// This label is needed to allow nodes some time to get/restore their zone label without all their
// status messages being removed.
UnknownZone = "unknown"
unknownZoneTimeout = 30 * time.Second
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this account for the total time we would potentially be retrying to annotate the zone to the node? Would it make sense for it to account for that total time? I guess we can be generous here? If it does or it should, should it be tied up by a global constant?

go-controller/pkg/ovn/egressfirewall.go Outdated Show resolved Hide resolved
go-controller/go.mod Show resolved Hide resolved
test/e2e/status_manager.go Show resolved Hide resolved
@npinaeva npinaeva force-pushed the apbroute-status branch 3 times, most recently from d6c5c9b to 325bb04 Compare November 23, 2023 15:16
Copy link
Contributor

@jcaamano jcaamano left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly an initial pass on the new level driven controller. Ongoing...

@npinaeva npinaeva force-pushed the apbroute-status branch 2 times, most recently from 7d8d811 to be43a46 Compare November 24, 2023 11:00
go-controller/pkg/types/resource_status.go Outdated Show resolved Hide resolved
}

// now calculate accumulated status.
// if not all zones reported status, clean it up, since the status is considered unknown until all zone report results.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we sure about this?
Let's say we have 2 zones, one has reported an error and the other has not reported anything.
Can't we conclude that the overall status is an error regardless of what the missing zone ends up reporting?
If not, could we explain further why we would like to wait for all zones to report status?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah we can set failed status without having all zone messages, I just didn't think it is important enough to reconcile for that case, but I can add it

Copy link
Member Author

@npinaeva npinaeva Nov 24, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also didn't want to make resourceManagers zone-aware, so I will need to pass an extra flag to only apply status if it is failure, does it sound fine?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, since it is private to the package it is not a big deal. I guess it wouldn't be a big deal either passing the zones that the resourceManagers should expect a message from, so they filter out messages that should not be taken into account.

@npinaeva
Copy link
Member Author

npinaeva commented Nov 27, 2023

Changes gist:

  • renamed EgressFirewallApplyError to EgressFirewallErrorMsg
  • renamed level_driven_controller to controller
  • left only 1 Start() method for Controller and an interface
  • unified Config and HandlerFuncs in controller pkg to be one generic type that is passed to constructor
  • added timestamp-based tracking for zone tracking unknown zones

Last Diff:

  • update status on failure without knowing all zones

Comment on lines +173 to +183
go func() {
select {
case <-zt.stopChan:
return
case <-time.After(zt.unknownZoneTimeout):
zt.checkUnknownNodeTimeout(nodeName)
}
}()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I associated the idea of timestamp with the idea of having a single persistent thread checking zt.unknownZoneNodes as long as it wasn't empty.

I hope that with your approach, time.After(zt.unknownZoneTimeout) is really precise, and not spurious by a few nanosecs, so that it is guaranteed that time.Since(timestamp) >= zt.unknownZoneTimeout after it triggers.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, I figured having a timer per node is the only way to have a very predictable timeout for unknown zone to be removed
I think getting timestamp before running a timer should guarantee that when timer is triggered - timestamp > timeout, but I can add an extra second to be sure?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I read around that timers might be not that accurate depending on the environment.
https://stackoverflow.com/questions/51415965/about-the-accuracy-of-the-time-timer
I haven't had time to find a formal reference.

The timeout for that persistent thread to start evaluating zt.unknownZoneNodes again could be based on the diff between the current time and the earliest expiration found in zt.unknownZoneNodes on the last evaluation.

You can add that extra second as well.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think timer inaccuracy problem will be present in both cases, so I just added a little delta to the condition check, I think that should be sufficient

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, it is something to consider in both alternatives.

I guess that what my alternative has going for it is single thread vs multiple threads as well as a more reassuring way to keep track that we eventually process all that we need to process from zt.unknownZoneNodes.

But I will take your approach as well.

ReconcileAll()
}

type Config[T any] struct {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: It would be more consistent if InitialSync was kept defined in this struct

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

right, I should add a comment for that. Considering this controller can be extended to handle multiple resources (each resource will be creates with Config), Initial sync should be called only once (not per-resource) before workers are started (e.g. https://github.com/ovn-org/ovn-kubernetes/blob/master/go-controller/pkg/ovn/controller/egressservice/egressservice_zone.go#L229)
So I kept it separate to make less changes in the future, but I can also make it a part of the Config for now if you'd like that more?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the future controller is bound to have a global config and some other config per resource. So we could just have InitialSync in Config which is where it would currently make more sense now, and then decide how to split that Config in the future.

But also no problems if it stays where it is, just that in the current form of the controller looks like it should be placed in Config.

type Controller interface {
Start(threadiness int) error
Stop()
ReconcileAll()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: The ReconcileAll looks a bit out of place now. But I guess this should evolve to have a Reconcile as well as some other methods enable watching different types of resources so not bad from that perspective.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I expected ReconcileAll() to be moved to the resource handler interface, when one controller will allow multiple resources, so that you can specify which resource should be reconciled, if that makes sense

jcaamano
jcaamano previously approved these changes Nov 29, 2023
Update `update-codegen` script to always install
latest controller-gen, so that `controller-gen.kubebuilder.io/version`
on the generate object doesn't decrease. Also overwrite v1/apis
folder for every crd to ensure the latest version is applied
(deleting is required to ensure stale files will be deleted).

Signed-off-by: Nadia Pinaeva <[email protected]>
the centralized status of an object, based on zone-specific statuses.
Created as a part of cluster manager, handles only apbroutepolicy
objects for now.

Update AdminPolicyBasedRouteStatus.Messages to allow patching with
merge strategy.

Remove updated policy check based on timestamp,
since LastTransitionTime precision is in seconds, and the whole test
takes less than a second to complete, therefore all timestamps will
be the same for multiple updates. Just checking expected policy state
is enough for that test.

Update unit tests to check Status.Messages instead of Status.Status

Signed-off-by: Nadia Pinaeva <[email protected]>
Add Status.Messages field to record statuses from zones,
make egressfirewall status a subresource.

Signed-off-by: Nadia Pinaeva <[email protected]>
all zones have reported their statuses.
To do so, ZoneTracker was added to StatusManager, which tracks
existing zones and notifies its subscriber about zones changes.
StatusManager will also cleanup status messages left by deleted zones.

Signed-off-by: Nadia Pinaeva <[email protected]>
Signed-off-by: Nadia Pinaeva <[email protected]>
@jcaamano jcaamano merged commit 3d5a949 into ovn-org:master Nov 29, 2023
29 checks passed
@npinaeva npinaeva deleted the apbroute-status branch November 29, 2023 15:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants