-
Notifications
You must be signed in to change notification settings - Fork 338
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Apbroute fix cache #3733
Apbroute fix cache #3733
Conversation
3295d6a
to
1c4fcec
Compare
the only failure is https://github.com/ovn-org/ovn-kubernetes/actions/runs/5446728116/jobs/9908304517?pr=3733
not sure if it is related |
https://github.com/ovn-org/ovn-kubernetes/actions/runs/5447602617/jobs/9910201194?pr=3733
https://github.com/ovn-org/ovn-kubernetes/actions/runs/5447602617/jobs/9910199851?pr=3733 |
9940f0a
to
d3589d8
Compare
d3589d8
to
125f194
Compare
https://github.com/ovn-org/ovn-kubernetes/actions/runs/5455237516/jobs/9926404074?pr=3733
retest |
125f194
to
cce672f
Compare
dual-conversion for interconnect is flaking, logs look fine but service connection fails, everything else green https://github.com/ovn-org/ovn-kubernetes/actions/runs/5461347981/jobs/9939728109?pr=3733 |
cce672f
to
7da3b38
Compare
only
failed https://github.com/ovn-org/ovn-kubernetes/actions/runs/5462694377/jobs/9943389996?pr=3733 |
go-controller/pkg/ovn/controller/apbroute/external_controller_policy_test.go
Outdated
Show resolved
Hide resolved
go-controller/pkg/ovn/controller/apbroute/external_controller.go
Outdated
Show resolved
Hide resolved
go-controller/pkg/ovn/controller/apbroute/external_controller.go
Outdated
Show resolved
Hide resolved
go-controller/pkg/ovn/controller/apbroute/external_controller.go
Outdated
Show resolved
Hide resolved
go-controller/pkg/ovn/controller/apbroute/external_controller.go
Outdated
Show resolved
Hide resolved
go-controller/pkg/ovn/controller/apbroute/external_controller_policy.go
Outdated
Show resolved
Hide resolved
I did not review all of the test case changes. I will leave that to @jordigilh |
|
||
Eventually(func() []string { return listNamespaceInfo() }, 5).Should(HaveLen(1)) | ||
Eventually(func() *namespaceInfo { return getNamespaceInfo(namespaceTest.Name) }, 5).Should(BeComparableTo(expected, cmpOpts...)) | ||
It("deletes an existing namespace with one policy and then creates it again and validates the policy has been applied to the new one with equal values", func() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe the original one should have been defined in the delete
context section instead, what do you think about moving it there?
} | ||
return diffStatic | ||
} | ||
if refObjs.targetNamespaces.Intersection(targetNsNames).Len() > 0 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So the idea is that if a namespace is being targeted by multiple namespaces, the event to update the policy will be requeued with no guarantee of success until it reaches the maximum retry and it is then discarded.
Does it make sense to keep it trying when we know that it will fail? Perhaps we can skip retrying in this case by defining a unique error type and comparing it against before requeuing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it has a chance to succeed if another policy targeting this namespace is deleted or updated, and I think we even have a test that checks that case. May be useful e.g. if another policy was deleted, but that delete event wasn't handled yet, wdyt?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the current policy that is targeting the namespace changes, then the reconciliation in the syncNamespace()
should queue both policies. But now I realize that if the processing order in the queue is to first tackle the failed policy, using the logic I just suggested will lead to a namespace without any policy, since there will not be any retry and the second policy will be removed afterwards.
I guess we have to keep retrying for this kind of situations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's even a bit worse :)
namespace doesn't change in this scenario, so there will be no namespace events, and if second policy targeting the same namespace can't be configured in 15 retries, it will be removed from the queue, so then even if another policy targeting that namespace is deleted (which means the second policy can be configured now) there will be no events to trigger second policy update. But we can just document this I guess, advising users e.g. to recreate second policy based on status message
Tests cases reviewed. I added two new comments. |
ab73aa6
to
9aa5be3
Compare
first push addresses comment, second push is master rebase |
apparently not waiting for apbroute.Run to return causes unit test failure, but if we merge this #3763 it should go away |
This reverts commit 926a1dc. Signed-off-by: Nadia Pinaeva <[email protected]>
9aa5be3
to
3d806e0
Compare
and races with namespace and pod handlers. For cache fix, add routePolicySyncCache that stores the lates state for every target pod, and allows retries. For races add policyReferencedObjects cache to allow policy handler share the references objects it used for the latest config. Signed-off-by: Nadia Pinaeva <[email protected]>
Now repair will initialize policies cache by handling every existing policy, and return existing routes for future cleanup. Make sure Repair can return error, because if it fails, some stale routes may be left in the system, and it can't be fixed by any controller. Fix buildExternalIPGatewaysFromAnnotations function to only set dynamic gateway ips for pod in the target namespace instead of all pods. Signed-off-by: Nadia Pinaeva <[email protected]>
Signed-off-by: Nadia Pinaeva <[email protected]>
Signed-off-by: Nadia Pinaeva <[email protected]>
Signed-off-by: Nadia Pinaeva <[email protected]>
for ip without double quotes, the following error occurred: unable to unmarshall annotation on pod e2e-gateway-pod1 k8s.v1.cni.cncf.io/network-status '[{"name":"foo","interface":"net1", "ips":[172.18.0.5],"mac":"01:23:45:67:89:10"}]': invalid character '.' after array element that caused all reconciliation on apb route and external gateway side to fail because they couldn't extract ip for pod. checkAPBExternalRouteStatus call should help in that case to signal that no errors occurred (especially when we expect config to stay the same) Signed-off-by: Nadia Pinaeva <[email protected]>
gateway pod ip should be cleaned up. Signed-off-by: Nadia Pinaeva <[email protected]>
It used to fail parsing namespace annotations, because they were empty. Add RunAPBExternalPolicyController call to handle apbroutes. Signed-off-by: Nadia Pinaeva <[email protected]>
db state, simplify policy creation. Signed-off-by: Nadia Pinaeva <[email protected]>
init. Signed-off-by: Nadia Pinaeva <[email protected]>
[error]Artifact path is not valid: /ovn-control-plane/e2e-dbs/should_provide_Internet_connection_continuously_when_pod_running_master_instance_of_ovnkube-control-plane_is_killed"-nettest-9947/ovn-control-plane-conf.db. Contains the following character: Double quote " Signed-off-by: Nadia Pinaeva <[email protected]>
|
3d806e0
to
f4822af
Compare
1st commit reverts disabling unit tests
there are 2 main fixes:
2nd commit fixes cache inconsistencies and races with namespace and pod handlers
3rd commit fixes repair logic
all the other commit are minor/test/log fixes with separate descriptions.
The first commit is the biggest one, it brings a lot of changes to the controller logic, even though the final networkController logic should be the same.
TODO: