StatusManager: consolidate status updates from different zones #3750

npinaeva · 2023-07-06T14:52:32Z

Create StatusManager - a centralized component responsible for updating
the centralized status of an object, based on zone-specific statuses.
Created as a part of cluster manager, handles only apbroutepolicy
objects for now.

Update AdminPolicyBasedRouteStatus.Messages to allow patching with
merge strategy. Update update-codegen script to always install
latest controller-gen, so that controller-gen.kubebuilder.io/version
on the generate object doesn't decrease.

Remove updated policy check based on timestamp,
since LastTransitionTime precision is in seconds, and the whole test
takes less than a second to complete, therefore all timestamps will
be the same for multiple updates. Just checking expected policy state
is enough for that test.

Update unit tests to check Status.Messages instead of Status.Status

Add distributed status management for EgressFirewall.
Add Status.Messages field to record statuses from zones,
make egressfirewall status a subresource

Be careful: requires CRD and status subresource permission change

coveralls · 2023-09-30T11:21:59Z

coverage: 50.718% (+0.2%) from 50.475%
when pulling 65ab2b2 on npinaeva:apbroute-status
into ac6820d on ovn-org:master.

npinaeva · 2023-10-13T10:15:55Z

caught #3924

go-controller/hack/update-codegen.sh

go-controller/pkg/clustermanager/status_manager/status_manager.go

go-controller/pkg/ovn/controller/apbroute/external_controller_policy_test.go

go-controller/pkg/ovn/controller/apbroute/master_controller.go

go-controller/pkg/clustermanager/status_manager/status_manager.go

go-controller/pkg/types/const.go

jcaamano · 2023-10-25T13:38:13Z

Missing a bit of documentation somewhere on the general approach that guides future implementations for other resource statuses. It might be confusing that, while we could be doing different things for apb and ef because the code structure supports it, we are doing basically the same thing for both.

go-controller/pkg/clustermanager/status_manager/resource_manager.go

go-controller/pkg/clustermanager/status_manager/status_manager.go

go-controller/pkg/clustermanager/status_manager/resource_manager.go

go-controller/pkg/clustermanager/zone_tracker/zone_tracker.go

jcaamano · 2023-11-20T11:49:05Z

go-controller/pkg/clustermanager/zone_tracker/zone_tracker.go

+	// This label is needed to allow nodes some time to get/restore their zone label without all their
+	// status messages being removed.
+	UnknownZone        = "unknown"
+	unknownZoneTimeout = 30 * time.Second


Does this account for the total time we would potentially be retrying to annotate the zone to the node? Would it make sense for it to account for that total time? I guess we can be generous here? If it does or it should, should it be tied up by a global constant?

go-controller/pkg/ovn/egressfirewall.go

go-controller/pkg/ovn/controller/apbroute/master_controller.go

go-controller/go.mod

test/e2e/status_manager.go

jcaamano

Mostly an initial pass on the new level driven controller. Ongoing...

go-controller/pkg/clustermanager/status_manager/status_manager.go

go-controller/pkg/level_driven_controller/level_driven_controller.go

go-controller/pkg/clustermanager/status_manager/status_manager.go

go-controller/pkg/types/resource_status.go

go-controller/pkg/clustermanager/status_manager/zone_tracker/zone_tracker.go

jcaamano · 2023-11-24T12:54:26Z

go-controller/pkg/clustermanager/status_manager/status_manager.go

+		}
+
+		// now calculate accumulated status.
+		// if not all zones reported status, clean it up, since the status is considered unknown until all zone report results.


Are we sure about this?
Let's say we have 2 zones, one has reported an error and the other has not reported anything.
Can't we conclude that the overall status is an error regardless of what the missing zone ends up reporting?
If not, could we explain further why we would like to wait for all zones to report status?

yeah we can set failed status without having all zone messages, I just didn't think it is important enough to reconcile for that case, but I can add it

I also didn't want to make resourceManagers zone-aware, so I will need to pass an extra flag to only apply status if it is failure, does it sound fine?

Yeah, since it is private to the package it is not a big deal. I guess it wouldn't be a big deal either passing the zones that the resourceManagers should expect a message from, so they filter out messages that should not be taken into account.

npinaeva · 2023-11-27T18:54:07Z

Changes gist:

renamed EgressFirewallApplyError to EgressFirewallErrorMsg
renamed level_driven_controller to controller
left only 1 Start() method for Controller and an interface
unified Config and HandlerFuncs in controller pkg to be one generic type that is passed to constructor
added timestamp-based tracking for zone tracking unknown zones

Last Diff:

update status on failure without knowing all zones

jcaamano · 2023-11-28T10:33:52Z

go-controller/pkg/clustermanager/status_manager/zone_tracker/zone_tracker.go

+		go func() {
+			select {
+			case <-zt.stopChan:
+				return
+			case <-time.After(zt.unknownZoneTimeout):
+				zt.checkUnknownNodeTimeout(nodeName)
+			}
+		}()


I associated the idea of timestamp with the idea of having a single persistent thread checking zt.unknownZoneNodes as long as it wasn't empty.

I hope that with your approach, time.After(zt.unknownZoneTimeout) is really precise, and not spurious by a few nanosecs, so that it is guaranteed that time.Since(timestamp) >= zt.unknownZoneTimeout after it triggers.

yeah, I figured having a timer per node is the only way to have a very predictable timeout for unknown zone to be removed
I think getting timestamp before running a timer should guarantee that when timer is triggered - timestamp > timeout, but I can add an extra second to be sure?

I read around that timers might be not that accurate depending on the environment.
https://stackoverflow.com/questions/51415965/about-the-accuracy-of-the-time-timer
I haven't had time to find a formal reference.

The timeout for that persistent thread to start evaluating zt.unknownZoneNodes again could be based on the diff between the current time and the earliest expiration found in zt.unknownZoneNodes on the last evaluation.

You can add that extra second as well.

I think timer inaccuracy problem will be present in both cases, so I just added a little delta to the condition check, I think that should be sufficient

Yeah, it is something to consider in both alternatives.

I guess that what my alternative has going for it is single thread vs multiple threads as well as a more reassuring way to keep track that we eventually process all that we need to process from zt.unknownZoneNodes.

But I will take your approach as well.

jcaamano · 2023-11-28T10:44:09Z

go-controller/pkg/controller/controller.go

+	ReconcileAll()
+}
+
+type Config[T any] struct {


nit: It would be more consistent if InitialSync was kept defined in this struct

right, I should add a comment for that. Considering this controller can be extended to handle multiple resources (each resource will be creates with Config), Initial sync should be called only once (not per-resource) before workers are started (e.g. https://github.com/ovn-org/ovn-kubernetes/blob/master/go-controller/pkg/ovn/controller/egressservice/egressservice_zone.go#L229)
So I kept it separate to make less changes in the future, but I can also make it a part of the Config for now if you'd like that more?

I think the future controller is bound to have a global config and some other config per resource. So we could just have InitialSync in Config which is where it would currently make more sense now, and then decide how to split that Config in the future.

But also no problems if it stays where it is, just that in the current form of the controller looks like it should be placed in Config.

jcaamano · 2023-11-28T11:01:01Z

go-controller/pkg/controller/controller.go

+type Controller interface {
+	Start(threadiness int) error
+	Stop()
+	ReconcileAll()


nit: The ReconcileAll looks a bit out of place now. But I guess this should evolve to have a Reconcile as well as some other methods enable watching different types of resources so not bad from that perspective.

I expected ReconcileAll() to be moved to the resource handler interface, when one controller will allow multiple resources, so that you can specify which resource should be reconciled, if that makes sense

Update `update-codegen` script to always install latest controller-gen, so that `controller-gen.kubebuilder.io/version` on the generate object doesn't decrease. Also overwrite v1/apis folder for every crd to ensure the latest version is applied (deleting is required to ensure stale files will be deleted). Signed-off-by: Nadia Pinaeva <[email protected]>

Signed-off-by: Nadia Pinaeva <[email protected]>

the centralized status of an object, based on zone-specific statuses. Created as a part of cluster manager, handles only apbroutepolicy objects for now. Update AdminPolicyBasedRouteStatus.Messages to allow patching with merge strategy. Remove updated policy check based on timestamp, since LastTransitionTime precision is in seconds, and the whole test takes less than a second to complete, therefore all timestamps will be the same for multiple updates. Just checking expected policy state is enough for that test. Update unit tests to check Status.Messages instead of Status.Status Signed-off-by: Nadia Pinaeva <[email protected]>

Add Status.Messages field to record statuses from zones, make egressfirewall status a subresource. Signed-off-by: Nadia Pinaeva <[email protected]>

all zones have reported their statuses. To do so, ZoneTracker was added to StatusManager, which tracks existing zones and notifies its subscriber about zones changes. StatusManager will also cleanup status messages left by deleted zones. Signed-off-by: Nadia Pinaeva <[email protected]>

Signed-off-by: Nadia Pinaeva <[email protected]>

npinaeva mentioned this pull request Jul 6, 2023

Apbroute fix cache #3733

Merged

npinaeva force-pushed the apbroute-status branch from f503b44 to 004546c Compare July 7, 2023 14:54

npinaeva force-pushed the apbroute-status branch from 004546c to f053f09 Compare July 20, 2023 14:50

tssurya mentioned this pull request Aug 22, 2023

Use the status subresource for setting labels and annotations #3773

Merged

npinaeva force-pushed the apbroute-status branch from f053f09 to c28636d Compare August 25, 2023 07:24

npinaeva mentioned this pull request Sep 19, 2023

APB unit tests: Random failures due to informer/lister cache not updated after event in queue #3896

Closed

npinaeva force-pushed the apbroute-status branch 3 times, most recently from 2d6cddd to 3ac9177 Compare September 30, 2023 10:53

npinaeva force-pushed the apbroute-status branch 2 times, most recently from c37240b to 55d8731 Compare October 2, 2023 16:39

npinaeva force-pushed the apbroute-status branch from 55d8731 to b7dcf4a Compare October 13, 2023 09:03

npinaeva marked this pull request as ready for review October 13, 2023 09:03

npinaeva requested review from trozet, dcbw, girishmg and jcaamano as code owners October 13, 2023 09:03

npinaeva force-pushed the apbroute-status branch from b7dcf4a to bac9679 Compare October 13, 2023 09:05

npinaeva closed this Oct 13, 2023

npinaeva reopened this Oct 13, 2023

npinaeva changed the title ~~Apbroute merge status~~ StatusManager: consolidate status updates from different zones Oct 18, 2023

npinaeva force-pushed the apbroute-status branch from bac9679 to a7fab04 Compare October 23, 2023 15:48

trozet reviewed Oct 24, 2023

View reviewed changes

jcaamano reviewed Oct 25, 2023

View reviewed changes

go-controller/pkg/clustermanager/status_manager/status_manager.go Outdated Show resolved Hide resolved

go-controller/pkg/clustermanager/status_manager/status_manager.go Outdated Show resolved Hide resolved

go-controller/pkg/types/const.go Outdated Show resolved Hide resolved

npinaeva force-pushed the apbroute-status branch 3 times, most recently from d09c153 to 299ca33 Compare November 2, 2023 10:28

jordigilh reviewed Nov 17, 2023

View reviewed changes

go-controller/pkg/clustermanager/status_manager/resource_manager.go Outdated Show resolved Hide resolved

jcaamano reviewed Nov 20, 2023

View reviewed changes

npinaeva force-pushed the apbroute-status branch 3 times, most recently from d6c5c9b to 325bb04 Compare November 23, 2023 15:16

jcaamano reviewed Nov 23, 2023

View reviewed changes

npinaeva force-pushed the apbroute-status branch 2 times, most recently from 7d8d811 to be43a46 Compare November 24, 2023 11:00

jcaamano reviewed Nov 24, 2023

View reviewed changes

npinaeva force-pushed the apbroute-status branch from be43a46 to 75e2fa5 Compare November 27, 2023 18:49

npinaeva force-pushed the apbroute-status branch from 75e2fa5 to 3cbdbb5 Compare November 27, 2023 20:23

jcaamano reviewed Nov 28, 2023

View reviewed changes

npinaeva force-pushed the apbroute-status branch from 3cbdbb5 to 79ebbab Compare November 28, 2023 15:44

npinaeva mentioned this pull request Nov 28, 2023

SDN-4308: Update status merge for APBRoute and EgressFirewall. openshift/cluster-network-operator#2132

Merged

jcaamano previously approved these changes Nov 29, 2023

View reviewed changes

jcaamano force-pushed the apbroute-status branch from 79ebbab to 65ab2b2 Compare November 29, 2023 09:35

npinaeva added 7 commits November 29, 2023 11:13

Add generic functionality to create fake clients and basic objects.

6219881

Signed-off-by: Nadia Pinaeva <[email protected]>

Create a generic level-driven controller functionality

7d39ab3

Signed-off-by: Nadia Pinaeva <[email protected]>

Add distributed status management for EgressFirewall.

c8854ed

Add Status.Messages field to record statuses from zones, make egressfirewall status a subresource. Signed-off-by: Nadia Pinaeva <[email protected]>

Fix linter bug

33b04b6

Signed-off-by: Nadia Pinaeva <[email protected]>

npinaeva dismissed jcaamano’s stale review via 33b04b6 November 29, 2023 10:16

npinaeva force-pushed the apbroute-status branch from 65ab2b2 to 33b04b6 Compare November 29, 2023 10:16

jcaamano approved these changes Nov 29, 2023

View reviewed changes

jcaamano merged commit 3d5a949 into ovn-org:master Nov 29, 2023
29 checks passed

npinaeva deleted the apbroute-status branch November 29, 2023 15:48

npinaeva mentioned this pull request Dec 4, 2023

OVN for APB External Route Operations flaky #3887

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

StatusManager: consolidate status updates from different zones #3750

StatusManager: consolidate status updates from different zones #3750

npinaeva commented Jul 6, 2023 •

edited

Loading

coveralls commented Sep 30, 2023 •

edited

Loading

npinaeva commented Oct 13, 2023

jcaamano commented Oct 25, 2023

jcaamano Nov 20, 2023

jcaamano left a comment

jcaamano Nov 24, 2023

npinaeva Nov 24, 2023

npinaeva Nov 24, 2023 •

edited

Loading

jcaamano Nov 24, 2023

npinaeva commented Nov 27, 2023 •

edited

Loading

jcaamano Nov 28, 2023

npinaeva Nov 28, 2023

jcaamano Nov 28, 2023

npinaeva Nov 28, 2023

jcaamano Nov 28, 2023

jcaamano Nov 28, 2023

npinaeva Nov 28, 2023

jcaamano Nov 28, 2023

jcaamano Nov 28, 2023

npinaeva Nov 28, 2023

StatusManager: consolidate status updates from different zones #3750

StatusManager: consolidate status updates from different zones #3750

Conversation

npinaeva commented Jul 6, 2023 • edited Loading

coveralls commented Sep 30, 2023 • edited Loading

npinaeva commented Oct 13, 2023

jcaamano commented Oct 25, 2023

Choose a reason for hiding this comment

jcaamano left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

npinaeva Nov 24, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

npinaeva commented Nov 27, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

npinaeva commented Jul 6, 2023 •

edited

Loading

coveralls commented Sep 30, 2023 •

edited

Loading

npinaeva Nov 24, 2023 •

edited

Loading

npinaeva commented Nov 27, 2023 •

edited

Loading