Apbroute fix cache #3733

npinaeva · 2023-06-30T16:41:36Z

1st commit reverts disabling unit tests

there are 2 main fixes:
2nd commit fixes cache inconsistencies and races with namespace and pod handlers
3rd commit fixes repair logic
all the other commit are minor/test/log fixes with separate descriptions.

The first commit is the biggest one, it brings a lot of changes to the controller logic, even though the final networkController logic should be the same.

TODO:

decide how to merge statuses from multiple ovnkube-controllers StatusManager: consolidate status updates from different zones #3750

coveralls · 2023-07-03T16:56:23Z

coverage: 52.657% (-0.3%) from 52.979% when pulling f4822af on npinaeva:apbroute-fix-cache into 86f9e1f on ovn-org:master.

npinaeva · 2023-07-03T18:33:22Z

the only failure is https://github.com/ovn-org/ovn-kubernetes/actions/runs/5446728116/jobs/9908304517?pr=3733

Summarizing 2 Failures:

[Fail] e2e IGMP validation [It] can retrieve multicast IGMP query 
/home/runner/go/pkg/mod/github.com/onsi/[email protected]/internal/leafnodes/runner.go:113

[Fail] e2e IGMP validation [It] can retrieve multicast IGMP query 
/home/runner/go/pkg/mod/github.com/onsi/[email protected]/internal/leafnodes/runner.go:113

not sure if it is related

npinaeva · 2023-07-04T08:42:11Z

https://github.com/ovn-org/ovn-kubernetes/actions/runs/5447602617/jobs/9910201194?pr=3733

Summarizing 3 Failures:

[Fail] Load Balancer Service Tests with MetalLB [It] Should ensure load balancer service works with pmtu 
/home/runner/work/ovn-kubernetes/ovn-kubernetes/test/e2e/service.go:885

[Fail] Services when a nodePort service targeting a pod with hostNetwork:false is created when tests are run towards the agnhost echo service [It] queries to the nodePort service shall work for TCP 
/home/runner/work/ovn-kubernetes/ovn-kubernetes/test/e2e/service.go:254

[Fail] Services when a nodePort service targeting a pod with hostNetwork:false is created when tests are run towards the agnhost echo service [It] queries to the nodePort service shall work for TCP 
/home/runner/work/ovn-kubernetes/ovn-kubernetes/test/e2e/service.go:254

https://github.com/ovn-org/ovn-kubernetes/actions/runs/5447602617/jobs/9910199851?pr=3733
e2e dual conversion timed out

npinaeva · 2023-07-04T13:23:28Z

1 green test run https://github.com/ovn-org/ovn-kubernetes/actions/runs/5454263510/jobs/9924146564?pr=3733

npinaeva · 2023-07-04T13:50:00Z

https://github.com/ovn-org/ovn-kubernetes/actions/runs/5455237516/jobs/9926404074?pr=3733

Informer Event Handler Tests �[0m�[91m�[1m[It] adds existing pod and processes an update event

retest

npinaeva · 2023-07-04T16:04:45Z

2nd green run https://github.com/ovn-org/ovn-kubernetes/actions/runs/5455492186/jobs/9927010102?pr=3733

npinaeva · 2023-07-05T07:00:53Z

3rd green run https://github.com/ovn-org/ovn-kubernetes/actions/runs/5456669268/jobs/9930191049?pr=3733

npinaeva · 2023-07-05T09:31:19Z

dual-conversion for interconnect is flaking, logs look fine but service connection fails, everything else green https://github.com/ovn-org/ovn-kubernetes/actions/runs/5461347981/jobs/9939728109?pr=3733
will count this as 4th green run

npinaeva · 2023-07-05T13:42:59Z

only

 e2e control plane should provide Internet connection continuously when pod running master instance of ovnkube-control-plane is killed

failed https://github.com/ovn-org/ovn-kubernetes/actions/runs/5462694377/jobs/9943389996?pr=3733
no external gateway failures

go-controller/pkg/ovn/controller/apbroute/external_controller_policy_test.go

go-controller/pkg/ovn/controller/apbroute/external_controller.go

go-controller/pkg/ovn/controller/apbroute/master_controller.go

go-controller/pkg/ovn/controller/apbroute/external_controller_policy.go

go-controller/pkg/ovn/controller/apbroute/external_controller.go

trozet · 2023-07-12T19:26:45Z

I did not review all of the test case changes. I will leave that to @jordigilh

jordigilh · 2023-07-13T08:20:39Z

go-controller/pkg/ovn/controller/apbroute/external_controller_namespace_test.go

-
-				Eventually(func() []string { return listNamespaceInfo() }, 5).Should(HaveLen(1))
-				Eventually(func() *namespaceInfo { return getNamespaceInfo(namespaceTest.Name) }, 5).Should(BeComparableTo(expected, cmpOpts...))
+			It("deletes an existing namespace with one policy and then creates it again and validates the policy has been applied to the new one with equal values", func() {


maybe the original one should have been defined in the delete context section instead, what do you think about moving it there?

jordigilh · 2023-07-13T08:30:19Z

go-controller/pkg/ovn/controller/apbroute/external_controller_policy.go

-	}
-	return diffStatic
-}
+		if refObjs.targetNamespaces.Intersection(targetNsNames).Len() > 0 {


So the idea is that if a namespace is being targeted by multiple namespaces, the event to update the policy will be requeued with no guarantee of success until it reaches the maximum retry and it is then discarded.
Does it make sense to keep it trying when we know that it will fail? Perhaps we can skip retrying in this case by defining a unique error type and comparing it against before requeuing?

it has a chance to succeed if another policy targeting this namespace is deleted or updated, and I think we even have a test that checks that case. May be useful e.g. if another policy was deleted, but that delete event wasn't handled yet, wdyt?

If the current policy that is targeting the namespace changes, then the reconciliation in the syncNamespace() should queue both policies. But now I realize that if the processing order in the queue is to first tackle the failed policy, using the logic I just suggested will lead to a namespace without any policy, since there will not be any retry and the second policy will be removed afterwards.

I guess we have to keep retrying for this kind of situations.

it's even a bit worse :)
namespace doesn't change in this scenario, so there will be no namespace events, and if second policy targeting the same namespace can't be configured in 15 retries, it will be removed from the queue, so then even if another policy targeting that namespace is deleted (which means the second policy can be configured now) there will be no events to trigger second policy update. But we can just document this I guess, advising users e.g. to recreate second policy based on status message

jordigilh · 2023-07-13T10:06:55Z

I did not review all of the test case changes. I will leave that to @jordigilh

Tests cases reviewed. I added two new comments.

npinaeva · 2023-07-13T13:40:53Z

first push addresses comment, second push is master rebase

npinaeva · 2023-07-13T16:59:04Z

apparently not waiting for apbroute.Run to return causes unit test failure, but if we merge this #3763 it should go away

This reverts commit 926a1dc. Signed-off-by: Nadia Pinaeva <[email protected]>

and races with namespace and pod handlers. For cache fix, add routePolicySyncCache that stores the lates state for every target pod, and allows retries. For races add policyReferencedObjects cache to allow policy handler share the references objects it used for the latest config. Signed-off-by: Nadia Pinaeva <[email protected]>

Now repair will initialize policies cache by handling every existing policy, and return existing routes for future cleanup. Make sure Repair can return error, because if it fails, some stale routes may be left in the system, and it can't be fixed by any controller. Fix buildExternalIPGatewaysFromAnnotations function to only set dynamic gateway ips for pod in the target namespace instead of all pods. Signed-off-by: Nadia Pinaeva <[email protected]>

Signed-off-by: Nadia Pinaeva <[email protected]>

for ip without double quotes, the following error occurred: unable to unmarshall annotation on pod e2e-gateway-pod1 k8s.v1.cni.cncf.io/network-status '[{"name":"foo","interface":"net1", "ips":[172.18.0.5],"mac":"01:23:45:67:89:10"}]': invalid character '.' after array element that caused all reconciliation on apb route and external gateway side to fail because they couldn't extract ip for pod. checkAPBExternalRouteStatus call should help in that case to signal that no errors occurred (especially when we expect config to stay the same) Signed-off-by: Nadia Pinaeva <[email protected]>

gateway pod ip should be cleaned up. Signed-off-by: Nadia Pinaeva <[email protected]>

It used to fail parsing namespace annotations, because they were empty. Add RunAPBExternalPolicyController call to handle apbroutes. Signed-off-by: Nadia Pinaeva <[email protected]>

db state, simplify policy creation. Signed-off-by: Nadia Pinaeva <[email protected]>

init. Signed-off-by: Nadia Pinaeva <[email protected]>

[error]Artifact path is not valid: /ovn-control-plane/e2e-dbs/should_provide_Internet_connection_continuously_when_pod_running_master_instance_of_ovnkube-control-plane_is_killed"-nettest-9947/ovn-control-plane-conf.db. Contains the following character: Double quote " Signed-off-by: Nadia Pinaeva <[email protected]>

npinaeva · 2023-07-15T11:35:08Z

OVN for APB External Route Operations �[0m�[0mon setting namespace gateway static hop �[0m�[90mreconciles deleting a pod with namespace double exgw static gateway already set IPV6 �[0m�[91m�[1m[It] BFD IPV6 �[0m
failed https://github.com/ovn-org/ovn-kubernetes/actions/runs/5551758745/jobs/10138247045?pr=3733, can't reproduce it locally

npinaeva force-pushed the apbroute-fix-cache branch from 3295d6a to 1c4fcec Compare July 3, 2023 16:32

npinaeva closed this Jul 3, 2023

npinaeva reopened this Jul 3, 2023

npinaeva force-pushed the apbroute-fix-cache branch 2 times, most recently from 9940f0a to d3589d8 Compare July 4, 2023 11:39

npinaeva force-pushed the apbroute-fix-cache branch from d3589d8 to 125f194 Compare July 4, 2023 13:23

npinaeva closed this Jul 4, 2023

npinaeva reopened this Jul 4, 2023

npinaeva force-pushed the apbroute-fix-cache branch from 125f194 to cce672f Compare July 4, 2023 16:05

npinaeva marked this pull request as ready for review July 4, 2023 16:09

npinaeva requested review from trozet, dcbw, girishmg and jcaamano as code owners July 4, 2023 16:09

npinaeva changed the title ~~[WIP] Apbroute fix cache~~ Apbroute fix cache Jul 4, 2023

npinaeva closed this Jul 5, 2023

npinaeva reopened this Jul 5, 2023

npinaeva closed this Jul 5, 2023

npinaeva reopened this Jul 5, 2023

npinaeva force-pushed the apbroute-fix-cache branch from cce672f to 7da3b38 Compare July 5, 2023 13:34

jordigilh reviewed Jul 5, 2023

View reviewed changes

jordigilh reviewed Jul 13, 2023

View reviewed changes

npinaeva force-pushed the apbroute-fix-cache branch 2 times, most recently from ab73aa6 to 9aa5be3 Compare July 13, 2023 13:40

Revert "Disable UTs for APB Temporarily"

11f93d1

This reverts commit 926a1dc. Signed-off-by: Nadia Pinaeva <[email protected]>

npinaeva force-pushed the apbroute-fix-cache branch from 9aa5be3 to 3d806e0 Compare July 14, 2023 07:26

npinaeva closed this Jul 14, 2023

npinaeva reopened this Jul 14, 2023

npinaeva added 11 commits July 14, 2023 15:57

move gatewayInfoList to a sub-package to ensure correct usage

72ac3a5

Signed-off-by: Nadia Pinaeva <[email protected]>

allow External Gateway tests with more detailed focus.

105fb86

Signed-off-by: Nadia Pinaeva <[email protected]>

Update apbroute status with retry.

de79d4d

Signed-off-by: Nadia Pinaeva <[email protected]>

Fix deleting gateway pod with both CR and annotations. In this case

b07cae5

gateway pod ip should be cleaned up. Signed-off-by: Nadia Pinaeva <[email protected]>

fix external gateway test that cleans up namespace annotation.

e64f04d

It used to fail parsing namespace annotations, because they were empty. Add RunAPBExternalPolicyController call to handle apbroutes. Signed-off-by: Nadia Pinaeva <[email protected]>

Update external_gateway_apb_test.go to check policy status together with

65c2310

db state, simplify policy creation. Signed-off-by: Nadia Pinaeva <[email protected]>

Fix unit test cleanup: shutdown WatchFactory, call shutdown after every

4fef43f

init. Signed-off-by: Nadia Pinaeva <[email protected]>

npinaeva closed this Jul 15, 2023

npinaeva reopened this Jul 15, 2023

npinaeva force-pushed the apbroute-fix-cache branch from 3d806e0 to f4822af Compare July 17, 2023 08:29

trozet approved these changes Jul 20, 2023

View reviewed changes

trozet merged commit 5906681 into ovn-org:master Jul 20, 2023
24 of 25 checks passed

npinaeva deleted the apbroute-fix-cache branch July 20, 2023 14:48

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Apbroute fix cache #3733

Apbroute fix cache #3733

npinaeva commented Jun 30, 2023 •

edited

Loading

coveralls commented Jul 3, 2023 •

edited

Loading

npinaeva commented Jul 3, 2023

npinaeva commented Jul 4, 2023

npinaeva commented Jul 4, 2023

npinaeva commented Jul 4, 2023

npinaeva commented Jul 4, 2023

npinaeva commented Jul 5, 2023

npinaeva commented Jul 5, 2023

npinaeva commented Jul 5, 2023

trozet commented Jul 12, 2023

jordigilh Jul 13, 2023

jordigilh Jul 13, 2023 •

edited

Loading

npinaeva Jul 13, 2023

jordigilh Jul 14, 2023

npinaeva Jul 14, 2023

jordigilh commented Jul 13, 2023

npinaeva commented Jul 13, 2023

npinaeva commented Jul 13, 2023

npinaeva commented Jul 15, 2023

Apbroute fix cache #3733

Apbroute fix cache #3733

Conversation

npinaeva commented Jun 30, 2023 • edited Loading

coveralls commented Jul 3, 2023 • edited Loading

npinaeva commented Jul 3, 2023

npinaeva commented Jul 4, 2023

npinaeva commented Jul 4, 2023

npinaeva commented Jul 4, 2023

npinaeva commented Jul 4, 2023

npinaeva commented Jul 5, 2023

npinaeva commented Jul 5, 2023

npinaeva commented Jul 5, 2023

trozet commented Jul 12, 2023

jordigilh Jul 13, 2023

Choose a reason for hiding this comment

jordigilh Jul 13, 2023 • edited Loading

Choose a reason for hiding this comment

npinaeva Jul 13, 2023

Choose a reason for hiding this comment

jordigilh Jul 14, 2023

Choose a reason for hiding this comment

npinaeva Jul 14, 2023

Choose a reason for hiding this comment

jordigilh commented Jul 13, 2023

npinaeva commented Jul 13, 2023

npinaeva commented Jul 13, 2023

npinaeva commented Jul 15, 2023

npinaeva commented Jun 30, 2023 •

edited

Loading

coveralls commented Jul 3, 2023 •

edited

Loading

jordigilh Jul 13, 2023 •

edited

Loading