-
Notifications
You must be signed in to change notification settings - Fork 338
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[wip]: Add retry parameter to ovs-vsctl and ensure stale interfaces are removed #3784
base: master
Are you sure you want to change the base?
Conversation
c46ee92
to
c188ec7
Compare
7bb828e
to
1ae5d85
Compare
bbd1afc
to
656929e
Compare
@trozet @dcbw @girishmg @jcaamano before I continue to fix the other 1000 unit-tests that are falling because of adding the retry parameter, I would like to ask you guys: What do you guys think about the code changes (2 commits)? |
Signed-off-by: Dan Williams <[email protected]>
Used when we absolutely do not want to create the object (like a bridge) if it doesn't exist, because that should be a hard error, but just want to update it. Mostly to make sure testcases do the right thing, but we never actually want to create 'br-int' either, so it works for both testcases and real code. Signed-off-by: Dan Williams <[email protected]>
Complement to Lookup() for chained operations. Signed-off-by: Dan Williams <[email protected]>
Signed-off-by: Dan Williams <[email protected]>
Signed-off-by: Dan Williams <[email protected]>
cmdAdd/cmdDel use similar setup code; consolidate it. Also make cmdAdd use the same error logging mechanism as cmdDel does. Signed-off-by: Dan Williams <[email protected]>
@rchicoli note that I'm trying to remove ovs-vsctl usage via #3616 but I assume similar logic could be done with the new libovsdb-based workflow? Equivalent code would be 6a0b7ca#diff-01a2ef868473804cd3d9b894b093e5be0735cfb21bf27f375ece3a33cc07edc9R388 The port lookup shouldn't need a retry since it's coming directly from the internal cache, but the Bonus: you don't have to fix unit tests at all if you rebase onto my PR :) |
That is great news @dcbw and these changes are very promising. Give me a shout, once it is ready to release, then I could run a stress-test tool again. Btw I didnt get the idea behind of the rebase onto your branch node-libovsdb, because you removed all the ovs-vsctl command. So it is about adding the return on the DeleteInterfacesWithPredicate to ensure it succeeds before continuing to configure the OVS. If so, I could do that, but on which branch? |
My PR is in the pipeline to get merged soon-ish (week or two); so when that happens you'd have to rebase your changes on top of mine anyway :) I guess it was "rebase" not in a literal "git rebase" sense, but redo the functionality of the patches against the libovsdb-based bits instead.
will get your master branch based on my PR. Then you can do your changes as a new commit on top, and when my PR merges you can "git rebase" onto the actual upstream (I typically add a remote called "upstream", in this case https://github.com/ovn-org/ovn-kubernetes.git and then I can Anyway, my suggestion would be to wrap the
|
b514fcb
to
86f2e2d
Compare
Signed-off-by: Jaime Caamaño Ruiz <[email protected]>
Signed-off-by: Jaime Caamaño Ruiz <[email protected]>
This is handled the same for all networks so it makes sense for it to be at the base handler Signed-off-by: Jaime Caamaño Ruiz <[email protected]>
We will introduce a specific layer2 event handler and the current name for the base layer 2 event handler (shared between layer2 and localnet controllers) would collide, so rename it. Signed-off-by: Jaime Caamaño Ruiz <[email protected]>
Instead of having sommon Start/Stop entry points in the base layer2 controller, let the layer2 and localnet controllers have their own so that they are in full control on how they start or stop. Signed-off-by: Jaime Caamaño Ruiz <[email protected]>
Layer2 controller needs to be aware the node's zones so it needs to handle node events thorugh its own event handler. Signed-off-by: Jaime Caamaño Ruiz <[email protected]>
Currently in use by the default & layer 3 network controllers, the zoneICHandler will also be used by the layer2 network controller in sections of the code that are shared accross all controllers. Signed-off-by: Jaime Caamaño Ruiz <[email protected]>
All network IDs are annotated to all nodes and the layer2 network controller will need it in context of no particular node. Signed-off-by: Jaime Caamaño Ruiz <[email protected]>
There are three aspects worth highlighting: * Support is achieved through specific configuration of the logical switch and logical ports that the layer2 network controller was already owner of. As such, the IC handler does provide this specific configuration but does not add, delete or clean up those entries which is still done by the network controller. * The base layer 2 controller does no longer handle local or remote pod events in different code flows. This commit brings them together as the only difference between them is whether they create the remote port (layer2) or not (localnet). I did try different things but this was the easiest way forward at this time by at least an order of magnitude amount of effort. * This commit also brings together layer3 and layer2 pod synchronization as there were not that many differences between them. Signed-off-by: Jaime Caamaño Ruiz <[email protected]>
Signed-off-by: Jaime Caamaño Ruiz <[email protected]>
Signed-off-by: Jaime Caamaño Ruiz <[email protected]>
Commit 5d6b136 added child stop channels to stop the network policy handlers independently from the network controller when the policy is deleted while also stopping them if the network controller is stopped. Unfortunately when both things happen at the same time, one of those events will end up attempting to close a closed channel which will panic. Introduce a CancelableContext utility that will wrap a cancelable context that can be chained to achieve the same effect. Signed-off-by: Jaime Caamaño Ruiz <[email protected]>
Signed-off-by: Jaime Caamaño Ruiz <[email protected]>
This would remove node name comparison between cloud private ip config and egress ip status as this is not valid in the case of egress ip failover to another node during upgrade or reboot scenario. Signed-off-by: Periyasamy Palanisamy <[email protected]>
After live migrating a VM the original node can be shutdown and a new one can appear taking over the node subnet. There were some unit tets for that but they exercise this part for non live migration scenarios creating a race condition between cleanup and addLogicalPort. This change run that part of the tests only for live-migration scenarios. Signed-off-by: Enrique Llorente <[email protected]>
…ed before creating a new pod networking Do not try to configure the OVS, if the interface matching the iface-id cannot be garbage-collected. Otherwise pods might enter the CrashLoopBackOff state, because the related ct-zone disappears from the bridge interface. Signed-off-by: Rafael Chicoli <[email protected]>
Thanks for helping out with git rebase, now it is all clear. I am really looking forward to testing these changes, because it might solve some of the issues we were having with ovnkube while trying to connect to the ovn database. Next week I will be on vacations. So I would have time to rebase once again by the end of next week. Thanks again. |
Moving to draft until the node side libovsdb PR is complete. |
This PR is stale because it has been open 90 days with no activity. Remove stale label or comment or reach out to maintainers for code reviews or consider closing this if you do not plan to work on it. |
- What this PR does and why is it needed
The last past weeks I've been running intensively some stress tests on the a kuberneters platform to check the system limits in order to fix some instability problems of other clusters. This PR addresses following topics:
- How to verify it
For the record, the tests were based on a cluster with 3 master and 4 worker nodes, each of them with 64 cores and 1 TB memory
To reproduce this problem use an image with a readiness and liveness checks:
There should be lots of
FailedCreatePodSandBox
events, when the cluster is recreating all pods, e.g.:More errors can be found in the ovnkube-node log files:
To highlight is the exit status 1 errors, that happens when the OVN default database is shortly unavailable due too too many calls, see strace output:
The man page of
ovs-vsctl
explains the use of thetimeout
parameter together with theretry
:To summarize the "ovs-vsctl will try to connect once and exit with an error if the connection fails". This was also confirmed with a simple bash script.
First run it without the
retry
parameter:After appending the
retry
parameter, theovs-vsctl
command returns OK and no error messages are displayed. It is also noticeable that some requests take longer than 1 second to finish, depending on the load on the OVN database:Let me try to explain the reason of the CrashLoopBackOff now.
At first the
ADD starting CNI request
went through and the MAC address and IP address listed above were reserved:This data can be found in the Open vSwitch
conf.db
file:There were an internal error and the interface got the
**ofport**
value set to -1, which means that the interface could not be created due to an error:Very often the
UnconfigureInterface
fails toclearPodBandwidth
due todatabase connection errors
and no retries are in place:ovn-kubernetes/go-controller/pkg/cni/helper_linux.go
Lines 603 to 604 in 86f9e1f
In the ovn-controller logs, you can see the claiming and releasing of a logical port (lport) reserved with the iface-id, here it looks good:
Now the new pod is coming up:
In the ovn-controller logs, the same MAC address is claimed, the same IP address is reserved and the same namespace_podName (iface-id) is used:
In the Open vSwitch
conf.db
file, we can see the new created interface and port and anofport
assigned to it, which possibly means that this time the interface was created successfully:The interface
uuid
and the external_ids[ct-zone]
can be found on the bridge (see conf.db):After all it seems that the real problem happens when the stale resources are removed:
ovn-kubernetes/go-controller/pkg/network-controller-manager/node_network_controller_manager.go
Line 236 in 86f9e1f
Few milliseconds later, the ovn-controller releases the logical port stress-test-density-706_nginx-1-5745dddb7c-ldjkl. Uh, this is BAD:
Then the same existed ct-zone will be "re-added" to the bridge. Something went wrong, If you run
ovs-vsctl list bridge
, this ct-zone is non-existent:Only the port ID of the corresponding interface can be listed inside the
ports
array:sh-4.4# ovs-vsctl list bridge | grep -c 5502f07d-d1bb-42ee-b090-00af944a5f88 1
The problem is that the required ct-zone is missing in the
external_ids
dictionary, so the logical flows with the priority=120 related to table 7 and 12 will be missing too:sh-4.4# ovs-vsctl list bridge | grep -c stress-test-density-706_nginx-1-5745dddb7c-ldjkl 0
At this moment I was thinking, we saw just one pod crashing, why didn't the others crashed too?
W0717 07:36:22.729412 1987398 node.go:989] Found stale interface b2d117d5fc0c4ec, so queuing it to be deleted W0717 07:36:22.729452 1987398 node.go:989] Found stale interface 63cfedb3828707c, so queuing it to be deleted W0717 07:36:22.729469 1987398 node.go:989] Found stale interface 8824323d5ff74ba, so queuing it to be deleted # --> this is the one W0717 07:36:22.729483 1987398 node.go:989] Found stale interface 58bdfe3ae2dacc8, so queuing it to be deleted W0717 07:36:22.729498 1987398 node.go:989] Found stale interface 3743fd52d7a73b4, so queuing it to be deleted W0717 07:36:22.729511 1987398 node.go:989] Found stale interface de100db60e6b221, so queuing it to be deleted W0717 07:36:22.729524 1987398 node.go:989] Found stale interface fe7b9333cb2f5ed, so queuing it to be deleted W0717 07:36:22.729539 1987398 node.go:989] Found stale interface ddbeccc99518185, so queuing it to be deleted
it is interesting that some interfaces couldn't be created (tagged by
ofport=-1
), although they could be found:Very important, one interface was not found (exactly the one we had a problem with) due to the same errors described above related to the database connection errors:
The new created pod has internal networking, but the network packages cannot be routed properly and the healthchecks will fail causing the container to enter the CrashLoopBackOff state.
If I am not mistaking, it seems to be a bug on the ovn-controller, because I believe the ct-zone should not be deleted if there is a corresponding interface attached to the bridge. I will raise one issue and ask the contributors to double check this.
- Special notes for reviewers
This changes affects the ovnkube and the ovs-vsctl command
- Conclusion
I am also convinced that this problem affects lots of people.
Furthermore I know that eventually some functions would retry by itself, after the next backoff has been reached. So we should ensure to rollback everything, if something goes wrong, before configuring OVS for the next pod on the same namespace
- Description for the changelog
add a retry parameter to the ovs-vsctl command and ensure stale interfaces are removed before creating a new pod networking