-
Notifications
You must be signed in to change notification settings - Fork 338
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
StatusManager: consolidate status updates from different zones #3750
Conversation
004546c
to
f053f09
Compare
f053f09
to
c28636d
Compare
2d6cddd
to
3ac9177
Compare
c37240b
to
55d8731
Compare
55d8731
to
b7dcf4a
Compare
b7dcf4a
to
bac9679
Compare
caught #3924 |
bac9679
to
a7fab04
Compare
go-controller/pkg/clustermanager/status_manager/status_manager.go
Outdated
Show resolved
Hide resolved
go-controller/pkg/clustermanager/status_manager/status_manager.go
Outdated
Show resolved
Hide resolved
go-controller/pkg/clustermanager/status_manager/status_manager.go
Outdated
Show resolved
Hide resolved
go-controller/pkg/clustermanager/status_manager/status_manager.go
Outdated
Show resolved
Hide resolved
go-controller/pkg/ovn/controller/apbroute/external_controller_policy_test.go
Show resolved
Hide resolved
go-controller/pkg/clustermanager/status_manager/status_manager.go
Outdated
Show resolved
Hide resolved
go-controller/pkg/clustermanager/status_manager/status_manager.go
Outdated
Show resolved
Hide resolved
Missing a bit of documentation somewhere on the general approach that guides future implementations for other resource statuses. It might be confusing that, while we could be doing different things for apb and ef because the code structure supports it, we are doing basically the same thing for both. |
d09c153
to
299ca33
Compare
go-controller/pkg/clustermanager/status_manager/resource_manager.go
Outdated
Show resolved
Hide resolved
go-controller/pkg/clustermanager/status_manager/status_manager.go
Outdated
Show resolved
Hide resolved
go-controller/pkg/clustermanager/status_manager/status_manager.go
Outdated
Show resolved
Hide resolved
go-controller/pkg/clustermanager/status_manager/status_manager.go
Outdated
Show resolved
Hide resolved
go-controller/pkg/clustermanager/status_manager/resource_manager.go
Outdated
Show resolved
Hide resolved
// This label is needed to allow nodes some time to get/restore their zone label without all their | ||
// status messages being removed. | ||
UnknownZone = "unknown" | ||
unknownZoneTimeout = 30 * time.Second |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this account for the total time we would potentially be retrying to annotate the zone to the node? Would it make sense for it to account for that total time? I guess we can be generous here? If it does or it should, should it be tied up by a global constant?
d6c5c9b
to
325bb04
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mostly an initial pass on the new level driven controller. Ongoing...
go-controller/pkg/clustermanager/status_manager/status_manager.go
Outdated
Show resolved
Hide resolved
go-controller/pkg/level_driven_controller/level_driven_controller.go
Outdated
Show resolved
Hide resolved
go-controller/pkg/level_driven_controller/level_driven_controller.go
Outdated
Show resolved
Hide resolved
go-controller/pkg/level_driven_controller/level_driven_controller.go
Outdated
Show resolved
Hide resolved
go-controller/pkg/level_driven_controller/level_driven_controller.go
Outdated
Show resolved
Hide resolved
go-controller/pkg/level_driven_controller/level_driven_controller.go
Outdated
Show resolved
Hide resolved
go-controller/pkg/level_driven_controller/level_driven_controller.go
Outdated
Show resolved
Hide resolved
go-controller/pkg/level_driven_controller/level_driven_controller.go
Outdated
Show resolved
Hide resolved
go-controller/pkg/clustermanager/status_manager/status_manager.go
Outdated
Show resolved
Hide resolved
go-controller/pkg/clustermanager/status_manager/status_manager.go
Outdated
Show resolved
Hide resolved
7d8d811
to
be43a46
Compare
go-controller/pkg/clustermanager/status_manager/zone_tracker/zone_tracker.go
Outdated
Show resolved
Hide resolved
go-controller/pkg/clustermanager/status_manager/zone_tracker/zone_tracker.go
Outdated
Show resolved
Hide resolved
go-controller/pkg/clustermanager/status_manager/zone_tracker/zone_tracker.go
Outdated
Show resolved
Hide resolved
} | ||
|
||
// now calculate accumulated status. | ||
// if not all zones reported status, clean it up, since the status is considered unknown until all zone report results. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we sure about this?
Let's say we have 2 zones, one has reported an error and the other has not reported anything.
Can't we conclude that the overall status is an error regardless of what the missing zone ends up reporting?
If not, could we explain further why we would like to wait for all zones to report status?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah we can set failed status without having all zone messages, I just didn't think it is important enough to reconcile for that case, but I can add it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also didn't want to make resourceManagers zone-aware, so I will need to pass an extra flag to only apply status if it is failure, does it sound fine?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, since it is private to the package it is not a big deal. I guess it wouldn't be a big deal either passing the zones that the resourceManagers should expect a message from, so they filter out messages that should not be taken into account.
be43a46
to
75e2fa5
Compare
Changes gist:
Last Diff:
|
75e2fa5
to
3cbdbb5
Compare
go func() { | ||
select { | ||
case <-zt.stopChan: | ||
return | ||
case <-time.After(zt.unknownZoneTimeout): | ||
zt.checkUnknownNodeTimeout(nodeName) | ||
} | ||
}() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I associated the idea of timestamp with the idea of having a single persistent thread checking zt.unknownZoneNodes
as long as it wasn't empty.
I hope that with your approach, time.After(zt.unknownZoneTimeout)
is really precise, and not spurious by a few nanosecs, so that it is guaranteed that time.Since(timestamp) >= zt.unknownZoneTimeout
after it triggers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, I figured having a timer per node is the only way to have a very predictable timeout for unknown zone to be removed
I think getting timestamp before running a timer should guarantee that when timer is triggered - timestamp
> timeout
, but I can add an extra second to be sure?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I read around that timers might be not that accurate depending on the environment.
https://stackoverflow.com/questions/51415965/about-the-accuracy-of-the-time-timer
I haven't had time to find a formal reference.
The timeout for that persistent thread to start evaluating zt.unknownZoneNodes
again could be based on the diff between the current time and the earliest expiration found in zt.unknownZoneNodes
on the last evaluation.
You can add that extra second as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think timer inaccuracy problem will be present in both cases, so I just added a little delta to the condition check, I think that should be sufficient
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, it is something to consider in both alternatives.
I guess that what my alternative has going for it is single thread vs multiple threads as well as a more reassuring way to keep track that we eventually process all that we need to process from zt.unknownZoneNodes
.
But I will take your approach as well.
ReconcileAll() | ||
} | ||
|
||
type Config[T any] struct { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: It would be more consistent if InitialSync
was kept defined in this struct
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
right, I should add a comment for that. Considering this controller can be extended to handle multiple resources (each resource will be creates with Config), Initial sync should be called only once (not per-resource) before workers are started (e.g. https://github.com/ovn-org/ovn-kubernetes/blob/master/go-controller/pkg/ovn/controller/egressservice/egressservice_zone.go#L229)
So I kept it separate to make less changes in the future, but I can also make it a part of the Config for now if you'd like that more?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the future controller is bound to have a global config and some other config per resource. So we could just have InitialSync
in Config which is where it would currently make more sense now, and then decide how to split that Config in the future.
But also no problems if it stays where it is, just that in the current form of the controller looks like it should be placed in Config.
type Controller interface { | ||
Start(threadiness int) error | ||
Stop() | ||
ReconcileAll() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: The ReconcileAll
looks a bit out of place now. But I guess this should evolve to have a Reconcile
as well as some other methods enable watching different types of resources so not bad from that perspective.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I expected ReconcileAll() to be moved to the resource handler interface, when one controller will allow multiple resources, so that you can specify which resource should be reconciled, if that makes sense
3cbdbb5
to
79ebbab
Compare
79ebbab
to
65ab2b2
Compare
Update `update-codegen` script to always install latest controller-gen, so that `controller-gen.kubebuilder.io/version` on the generate object doesn't decrease. Also overwrite v1/apis folder for every crd to ensure the latest version is applied (deleting is required to ensure stale files will be deleted). Signed-off-by: Nadia Pinaeva <[email protected]>
Signed-off-by: Nadia Pinaeva <[email protected]>
Signed-off-by: Nadia Pinaeva <[email protected]>
the centralized status of an object, based on zone-specific statuses. Created as a part of cluster manager, handles only apbroutepolicy objects for now. Update AdminPolicyBasedRouteStatus.Messages to allow patching with merge strategy. Remove updated policy check based on timestamp, since LastTransitionTime precision is in seconds, and the whole test takes less than a second to complete, therefore all timestamps will be the same for multiple updates. Just checking expected policy state is enough for that test. Update unit tests to check Status.Messages instead of Status.Status Signed-off-by: Nadia Pinaeva <[email protected]>
Add Status.Messages field to record statuses from zones, make egressfirewall status a subresource. Signed-off-by: Nadia Pinaeva <[email protected]>
all zones have reported their statuses. To do so, ZoneTracker was added to StatusManager, which tracks existing zones and notifies its subscriber about zones changes. StatusManager will also cleanup status messages left by deleted zones. Signed-off-by: Nadia Pinaeva <[email protected]>
Signed-off-by: Nadia Pinaeva <[email protected]>
65ab2b2
to
33b04b6
Compare
Create StatusManager - a centralized component responsible for updating
the centralized status of an object, based on zone-specific statuses.
Created as a part of cluster manager, handles only apbroutepolicy
objects for now.
Update AdminPolicyBasedRouteStatus.Messages to allow patching with
merge strategy. Update
update-codegen
script to always installlatest controller-gen, so that
controller-gen.kubebuilder.io/version
on the generate object doesn't decrease.
Remove updated policy check based on timestamp,
since LastTransitionTime precision is in seconds, and the whole test
takes less than a second to complete, therefore all timestamps will
be the same for multiple updates. Just checking expected policy state
is enough for that test.
Update unit tests to check Status.Messages instead of Status.Status
Add distributed status management for EgressFirewall.
Add Status.Messages field to record statuses from zones,
make egressfirewall status a subresource
Be careful: requires CRD and status subresource permission change