Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Apbroute fix cache #3733

Merged
merged 12 commits into from
Jul 20, 2023
Merged

Apbroute fix cache #3733

merged 12 commits into from
Jul 20, 2023

Conversation

npinaeva
Copy link
Member

@npinaeva npinaeva commented Jun 30, 2023

1st commit reverts disabling unit tests

there are 2 main fixes:
2nd commit fixes cache inconsistencies and races with namespace and pod handlers
3rd commit fixes repair logic
all the other commit are minor/test/log fixes with separate descriptions.

The first commit is the biggest one, it brings a lot of changes to the controller logic, even though the final networkController logic should be the same.

TODO:

@coveralls
Copy link

coveralls commented Jul 3, 2023

Coverage Status

coverage: 52.657% (-0.3%) from 52.979% when pulling f4822af on npinaeva:apbroute-fix-cache into 86f9e1f on ovn-org:master.

@npinaeva
Copy link
Member Author

npinaeva commented Jul 3, 2023

the only failure is https://github.com/ovn-org/ovn-kubernetes/actions/runs/5446728116/jobs/9908304517?pr=3733

Summarizing 2 Failures:

[Fail] e2e IGMP validation [It] can retrieve multicast IGMP query 
/home/runner/go/pkg/mod/github.com/onsi/[email protected]/internal/leafnodes/runner.go:113

[Fail] e2e IGMP validation [It] can retrieve multicast IGMP query 
/home/runner/go/pkg/mod/github.com/onsi/[email protected]/internal/leafnodes/runner.go:113

not sure if it is related

@npinaeva npinaeva closed this Jul 3, 2023
@npinaeva npinaeva reopened this Jul 3, 2023
@npinaeva
Copy link
Member Author

npinaeva commented Jul 4, 2023

https://github.com/ovn-org/ovn-kubernetes/actions/runs/5447602617/jobs/9910201194?pr=3733

Summarizing 3 Failures:

[Fail] Load Balancer Service Tests with MetalLB [It] Should ensure load balancer service works with pmtu 
/home/runner/work/ovn-kubernetes/ovn-kubernetes/test/e2e/service.go:885

[Fail] Services when a nodePort service targeting a pod with hostNetwork:false is created when tests are run towards the agnhost echo service [It] queries to the nodePort service shall work for TCP 
/home/runner/work/ovn-kubernetes/ovn-kubernetes/test/e2e/service.go:254

[Fail] Services when a nodePort service targeting a pod with hostNetwork:false is created when tests are run towards the agnhost echo service [It] queries to the nodePort service shall work for TCP 
/home/runner/work/ovn-kubernetes/ovn-kubernetes/test/e2e/service.go:254

https://github.com/ovn-org/ovn-kubernetes/actions/runs/5447602617/jobs/9910199851?pr=3733
e2e dual conversion timed out

@npinaeva npinaeva force-pushed the apbroute-fix-cache branch 2 times, most recently from 9940f0a to d3589d8 Compare July 4, 2023 11:39
@npinaeva
Copy link
Member Author

npinaeva commented Jul 4, 2023

@npinaeva
Copy link
Member Author

npinaeva commented Jul 4, 2023

https://github.com/ovn-org/ovn-kubernetes/actions/runs/5455237516/jobs/9926404074?pr=3733

Informer Event Handler Tests �[0m�[91m�[1m[It] adds existing pod and processes an update event

retest

@npinaeva npinaeva closed this Jul 4, 2023
@npinaeva npinaeva reopened this Jul 4, 2023
@npinaeva
Copy link
Member Author

npinaeva commented Jul 4, 2023

@npinaeva npinaeva marked this pull request as ready for review July 4, 2023 16:09
@npinaeva npinaeva changed the title [WIP] Apbroute fix cache Apbroute fix cache Jul 4, 2023
@npinaeva
Copy link
Member Author

npinaeva commented Jul 5, 2023

@npinaeva npinaeva closed this Jul 5, 2023
@npinaeva npinaeva reopened this Jul 5, 2023
@npinaeva
Copy link
Member Author

npinaeva commented Jul 5, 2023

dual-conversion for interconnect is flaking, logs look fine but service connection fails, everything else green https://github.com/ovn-org/ovn-kubernetes/actions/runs/5461347981/jobs/9939728109?pr=3733
will count this as 4th green run

@npinaeva
Copy link
Member Author

npinaeva commented Jul 5, 2023

only

 e2e control plane should provide Internet connection continuously when pod running master instance of ovnkube-control-plane is killed

failed https://github.com/ovn-org/ovn-kubernetes/actions/runs/5462694377/jobs/9943389996?pr=3733
no external gateway failures

@trozet
Copy link
Contributor

trozet commented Jul 12, 2023

I did not review all of the test case changes. I will leave that to @jordigilh


Eventually(func() []string { return listNamespaceInfo() }, 5).Should(HaveLen(1))
Eventually(func() *namespaceInfo { return getNamespaceInfo(namespaceTest.Name) }, 5).Should(BeComparableTo(expected, cmpOpts...))
It("deletes an existing namespace with one policy and then creates it again and validates the policy has been applied to the new one with equal values", func() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe the original one should have been defined in the delete context section instead, what do you think about moving it there?

}
return diffStatic
}
if refObjs.targetNamespaces.Intersection(targetNsNames).Len() > 0 {
Copy link
Contributor

@jordigilh jordigilh Jul 13, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the idea is that if a namespace is being targeted by multiple namespaces, the event to update the policy will be requeued with no guarantee of success until it reaches the maximum retry and it is then discarded.
Does it make sense to keep it trying when we know that it will fail? Perhaps we can skip retrying in this case by defining a unique error type and comparing it against before requeuing?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it has a chance to succeed if another policy targeting this namespace is deleted or updated, and I think we even have a test that checks that case. May be useful e.g. if another policy was deleted, but that delete event wasn't handled yet, wdyt?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the current policy that is targeting the namespace changes, then the reconciliation in the syncNamespace() should queue both policies. But now I realize that if the processing order in the queue is to first tackle the failed policy, using the logic I just suggested will lead to a namespace without any policy, since there will not be any retry and the second policy will be removed afterwards.

I guess we have to keep retrying for this kind of situations.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's even a bit worse :)
namespace doesn't change in this scenario, so there will be no namespace events, and if second policy targeting the same namespace can't be configured in 15 retries, it will be removed from the queue, so then even if another policy targeting that namespace is deleted (which means the second policy can be configured now) there will be no events to trigger second policy update. But we can just document this I guess, advising users e.g. to recreate second policy based on status message

@jordigilh
Copy link
Contributor

I did not review all of the test case changes. I will leave that to @jordigilh

Tests cases reviewed. I added two new comments.

@npinaeva npinaeva force-pushed the apbroute-fix-cache branch 2 times, most recently from ab73aa6 to 9aa5be3 Compare July 13, 2023 13:40
@npinaeva
Copy link
Member Author

first push addresses comment, second push is master rebase

@npinaeva
Copy link
Member Author

apparently not waiting for apbroute.Run to return causes unit test failure, but if we merge this #3763 it should go away

This reverts commit 926a1dc.

Signed-off-by: Nadia Pinaeva <[email protected]>
and races with namespace and pod handlers.
For cache fix, add routePolicySyncCache that stores the lates
state for every target pod, and allows retries.
For races add policyReferencedObjects cache to allow
policy handler share the references objects it used for the
latest config.

Signed-off-by: Nadia Pinaeva <[email protected]>
Now repair will initialize policies cache by handling
every existing policy, and return existing routes for
future cleanup.
Make sure Repair can return error, because if it fails,
some stale routes may be left in the system, and it can't be
fixed by any controller.
Fix buildExternalIPGatewaysFromAnnotations function to only set
dynamic gateway ips for pod in the target namespace instead of all pods.

Signed-off-by: Nadia Pinaeva <[email protected]>
for ip without double quotes, the following error occurred:

unable to unmarshall annotation on pod e2e-gateway-pod1
k8s.v1.cni.cncf.io/network-status '[{"name":"foo","interface":"net1",
"ips":[172.18.0.5],"mac":"01:23:45:67:89:10"}]': invalid character '.'
after array element

that caused all reconciliation on apb route and external gateway side to
fail because they couldn't extract ip for pod.
checkAPBExternalRouteStatus call should help in that case to signal that
no errors occurred (especially when we expect config to stay the same)

Signed-off-by: Nadia Pinaeva <[email protected]>
gateway pod ip should be cleaned up.

Signed-off-by: Nadia Pinaeva <[email protected]>
It used to fail parsing namespace annotations, because they were empty.
Add RunAPBExternalPolicyController call to handle apbroutes.

Signed-off-by: Nadia Pinaeva <[email protected]>
db state, simplify policy creation.

Signed-off-by: Nadia Pinaeva <[email protected]>
[error]Artifact path is not valid: /ovn-control-plane/e2e-dbs/should_provide_Internet_connection_continuously_when_pod_running_master_instance_of_ovnkube-control-plane_is_killed"-nettest-9947/ovn-control-plane-conf.db. Contains the following character:  Double quote "

Signed-off-by: Nadia Pinaeva <[email protected]>
@npinaeva
Copy link
Member Author

OVN for APB External Route Operations �[0m�[0mon setting namespace gateway static hop �[0m�[90mreconciles deleting a pod with namespace double exgw static gateway already set IPV6 �[0m�[91m�[1m[It] BFD IPV6 �[0m
failed https://github.com/ovn-org/ovn-kubernetes/actions/runs/5551758745/jobs/10138247045?pr=3733, can't reproduce it locally

@trozet trozet merged commit 5906681 into ovn-org:master Jul 20, 2023
24 of 25 checks passed
@npinaeva npinaeva deleted the apbroute-fix-cache branch July 20, 2023 14:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants