Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[wip]: Add retry parameter to ovs-vsctl and ensure stale interfaces are removed #3784

Draft
wants to merge 22 commits into
base: master
Choose a base branch
from

Conversation

rchicoli
Copy link

@rchicoli rchicoli commented Jul 19, 2023

- What this PR does and why is it needed

The last past weeks I've been running intensively some stress tests on the a kuberneters platform to check the system limits in order to fix some instability problems of other clusters. This PR addresses following topics:

  • UNSTABLE connection to the OVN database during a system overload
  • NO retry configured, if the OVS-VSCTL command fails
  • NO removal of STALE interface with the same IFACE ID before configuring a new pod networking, if the ovs-vsctl command fails once

- How to verify it

For the record, the tests were based on a cluster with 3 master and 4 worker nodes, each of them with 64 cores and 1 TB memory
To reproduce this problem use an image with a readiness and liveness checks:

  1. Deploy the same deployment to ~880 different namespaces (220 per node)
  2. Wait for all pods to be running and the healthchecks to be passing
  3. Delete all of them filtering by labels
  4. Wait for all pods to be recreated and the healthchecks to be passing
  5. Most likely after the first deletion iteration, some pods might enter the CrashLoopBackOff state
  6. If not, try the deletion of all pods multiple times

There should be lots of FailedCreatePodSandBox events, when the cluster is recreating all pods, e.g.:

Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox... error adding container to network "ovn-kubernetes": CNI request failed with status 400...
- failed to configure pod interface: timed out waiting for OVS port binding (ovn-installed) for...
- failed to configure pod interface: failure in plugging pod interface: failed to run 'ovs-vsctl --timeout=30 add-port br-int 7eee0d827eb862b': exit status 1...
- failed to configure pod interface: failed to configure pod interface: failed to run 'ovs-vsctl --timeout=30 --no-heading --format=csv --data=bare --columns=name find interface external-ids:sandbox... : exit status 1...

More errors can be found in the ovnkube-node log files:

helper_linux.go:588] Failed to delete pod "..." OVS port 8824323d5ff74ba: failed to run 'ovs-vsctl --timeout=30 del-port br-int 8824323d5ff74ba': exit status 1
helper_linux.go:549] Failed to clearPodBandwidth sandbox 8824323d5ff74bab767ac... for pod ...: failed to run `'ovs-vsctl --timeout=30 --no-heading --format=csv --data=bare --columns=name find interface ...: exit status 1

To highlight is the exit status 1 errors, that happens when the OVN default database is shortly unavailable due too too many calls, see strace output:

connect(3, {sa_family=AF_UNIX, sun_path="/var/run/openvswitch/db.sock"}, 31) = -1 EAGAIN (Resource temporarily unavailable)

The man page of ovs-vsctl explains the use of the timeout parameter together with the retry:

**--timeout=**secs
       By default, or with a secs of **0**, **ovs-vsctl** waits forever for a  response  from  the
       database.   This  option  limits  runtime  to  approximately  secs seconds.  If the
       timeout expires, **ovs-vsctl** will exit with  a  **SIGALRM**  signal.   (A  timeout  would
       normally  happen  only  if  the  database  cannot be contacted, or if the system is
       overloaded.)

**--retry**
       Without this option, if **ovs-vsctl** connects outward  to  the  database  server  (the
       default)  then  **ovs-vsctl**  will  try  to connect once and exit with an error if the
       connection fails (which usually means that **ovsdb-server** is not running).

       With this option, or if **--db** specifies that **ovs-vsctl** should listen for an incoming
       connection  from  the database server, then **ovs-vsctl** will wait for a connection to
       the database forever.

       Regardless of this setting, **--timeout** always limits how long **ovs-vsctl** will wait

To summarize the "ovs-vsctl will try to connect once and exit with an error if the connection fails". This was also confirmed with a simple bash script.

First run it without the retry parameter:

sh-4.4# while true; do ovs-vsctl --timeout=30 show >/dev/null && date; echo ok || echo fail; sleep 1; done
Thu Jul  6 12:22:02 UTC 2023
ok
Thu Jul  6 12:22:03 UTC 2023
ok
Thu Jul  6 12:22:04 UTC 2023
ok
ovs-vsctl: unix:/var/run/openvswitch/db.sock: database connection failed (Protocol error)
ok
Thu Jul  6 12:22:07 UTC 2023
ok
ovs-vsctl: unix:/var/run/openvswitch/db.sock: database connection failed (Protocol error)
ok
Thu Jul  6 12:22:09 UTC 2023.

After appending the retry parameter, the ovs-vsctl command returns OK and no error messages are displayed. It is also noticeable that some requests take longer than 1 second to finish, depending on the load on the OVN database:

sh-4.4# while true; do ovs-vsctl --timeout=30 --retry show >/dev/null && date; echo ok || echo fail; sleep 1; done
Thu Jul  6 12:22:00 UTC 2023
ok
Thu Jul  6 12:22:01 UTC 2023
ok
...
Thu Jul  6 12:34:59 UTC 2023
ok
Thu Jul  6 12:35:00 UTC 2023
ok
Thu Jul  6 12:35:01 UTC 2023
ok

Let me try to explain the reason of the CrashLoopBackOff now.

At first the ADD starting CNI request went through and the MAC address and IP address listed above were reserved:

I0717 07:35:21.094518 1987398 cni.go:227] [stress-test-density-706/nginx-1-5745dddb7c-ldjkl 8824323d5ff74bab767acf132914c8edd36c22fcef81c26a40a51a8b670cc7a3] ADD starting CNI request [stress-test-density-706/nginx-1-5745dddb7c-ldjkl 8824323d5ff74bab767acf132914c8edd36c22fcef81c26a40a51a8b670cc7a3]
I0717 07:35:21.713583 1987398 helper_linux.go:334] ConfigureOVS: namespace: stress-test-density-706, podName: nginx-1-5745dddb7c-ldjkl, SandboxID: "8824323d5ff74bab767acf132914c8edd36c22fcef81c26a40a51a8b670cc7a3", UID: "01c5fa41-83d0-439d-9289-b27abe137c87", MAC: 01:02:03:04:05:06, IPs: [172.16.0.99/22]

This data can be found in the Open vSwitch conf.db file:

// Monday, July 17, 2023 7:35:21.982 AM
{"Interface":{"39fa6f4e-c67d-449b-aa88-c7586cf6ab9c":{"name":"8824323d5ff74ba","external_ids":["map",[["attached_mac","01:02:03:04:05:06"],["iface-id","stress-test-density-706_nginx-1-5745dddb7c-ldjkl"],["iface-id-ver","01c5fa41-83d0-439d-9289-b27abe137c87"],["ip_addresses","172.16.0.99/22"],["sandbox","8824323d5ff74bab767acf132914c8edd36c22fcef81c26a40a51a8b670cc7a3"]]]}},"Port":{"db765055-6020-43d2-8a22-35d9d18d9b1f":{"name":"8824323d5ff74ba","interfaces":["uuid","39fa6f4e-c67d-449b-aa88-c7586cf6ab9c"],"other_config":["map",[["transient","true"]]]}},"_date":1689579321982,"Bridge":{"4805dd32-af2a-4ac6-8917-3a27975c1ab5":{"ports":["uuid","db765055-6020-43d2-8a22-35d9d18d9b1f"]}},"_is_diff":true,"_comment":"ovs-vsctl (invoked by /usr/bin/ovnkube): /usr/bin/ovs-vsctl --timeout=30 add-port br-int 8824323d5ff74ba other_config:transient=true -- set interface 8824323d5ff74ba external_ids:attached_mac=01:02:03:04:05:06 external_ids:iface-id=stress-test-density-706_nginx-1-5745dddb7c-ldjkl external_ids:iface-id-ver=01c5fa41-83d0-439d-9289-b27abe137c87 external_ids:ip_addresses=172.16.0.99/22 external_ids:sandbox=8824323d5ff74bab767acf132914c8edd36c22fcef81c26a40a51a8b670cc7a3","Open_vSwitch":{"11d16522-1170-1121-b42b-49f93ee66dae":{"next_cfg":142523}}}

There were an internal error and the interface got the **ofport** value set to -1, which means that the interface could not be created due to an error:

// Monday, July 17, 2023 7:35:25.957 AM
{"Interface":{"0fe370c5-d6a7-4aa4-9243-e818fdf629ba":{"ofport":22553},"38791b57-6f42-42c0-bf79-92ce8d7bc11f":{"ofport":22556},"41619eb1-ed6a-4365-8b48-dbeb0214e31a":{"ofport":22557},"c6b90d74-c38a-4ac5-8904-c43ff6a753db":{"ofport":22554},"28c13ef7-2277-4d4e-88ce-f4ae30a3b610":{"ofport":22555},"39fa6f4e-c67d-449b-aa88-c7586cf6ab9c":{"ofport":-1,"error":"could not open network device 8824323d5ff74ba (No such device)"}},"_date":1689579325957,"_is_diff":true,"Open_vSwitch":{"11d16522-1170-1121-b42b-49f93ee66dae":{"cur_cfg":142555}}}

Very often the UnconfigureInterface fails to clearPodBandwidth due to database connection errors and no retries are in place:

if err := clearPodBandwidth(pr.SandboxID); err != nil {
klog.Warningf("Failed to clearPodBandwidth sandbox %v %s: %v", pr.SandboxID, podDesc, err)

W0717 07:35:25.787314 1987398 helper_linux.go:588] Failed to delete pod "stress-test-density-706/nginx-1-5745dddb7c-ldjkl" OVS port 8824323d5ff74ba: failed to run 'ovs-vsctl --timeout=30 del-port br-int 8824323d5ff74ba': exit status 1
W0717 07:35:25.787778 1987398 helper_linux.go:593] Failed to delete pod "stress-test-density-706/nginx-1-5745dddb7c-ldjkl" interface 8824323d5ff74ba: failed to lookup link 8824323d5ff74ba: Link not found
W0717 07:35:25.815372 1987398 helper_linux.go:549] Failed to clearPodBandwidth sandbox 8824323d5ff74bab767acf132914c8edd36c22fcef81c26a40a51a8b670cc7a3 for pod stress-test-density-706/nginx-1-5745dddb7c-ldjkl: failed to run 'ovs-vsctl --timeout=30 --no-heading --format=csv --data=bare --columns=name find interface external-ids:sandbox=8824323d5ff74bab767acf132914c8edd36c22fcef81c26a40a51a8b670cc7a3': exit status 1
I0717 07:35:25.815422 1987398 cni.go:248] [stress-test-density-706/nginx-1-5745dddb7c-ldjkl 8824323d5ff74bab767acf132914c8edd36c22fcef81c26a40a51a8b670cc7a3] DEL finished CNI request [stress-test-density-706/nginx-1-5745dddb7c-ldjkl 8824323d5ff74bab767acf132914c8edd36c22fcef81c26a40a51a8b670cc7a3], result "{\"dns\":{}}", err <nil>

In the ovn-controller logs, you can see the claiming and releasing of a logical port (lport) reserved with the iface-id, here it looks good:

2023-07-17T07:35:25.294Z|36242|binding|INFO|Claiming lport stress-test-density-706_nginx-1-5745dddb7c-ldjkl for this chassis.
2023-07-17T07:35:25.294Z|36243|binding|INFO|stress-test-density-706_nginx-1-5745dddb7c-ldjkl: Claiming 01:02:03:04:05:06 172.16.0.99
2023-07-17T07:35:25.961Z|36252|binding|INFO|Releasing lport stress-test-density-706_nginx-1-5745dddb7c-ldjkl from this chassis (sb_readonly=0)

Now the new pod is coming up:

I0717 07:35:27.731860 1987398 cni.go:227] [stress-test-density-706/nginx-1-5745dddb7c-ldjkl 300cf7d291f454fee923f5c58dad64411e4c982ccc4e916b6a1681e8667b2a50] ADD starting CNI request [stress-test-density-706/nginx-1-5745dddb7c-ldjkl 300cf7d291f454fee923f5c58dad64411e4c982ccc4e916b6a1681e8667b2a50]
I0717 07:35:27.927250 1987398 helper_linux.go:334] ConfigureOVS: namespace: stress-test-density-706, podName: nginx-1-5745dddb7c-ldjkl, SandboxID: "300cf7d291f454fee923f5c58dad64411e4c982ccc4e916b6a1681e8667b2a50", UID: "01c5fa41-83d0-439d-9289-b27abe137c87", MAC: 01:02:03:04:05:06, IPs: [172.16.0.99/22]

In the ovn-controller logs, the same MAC address is claimed, the same IP address is reserved and the same namespace_podName (iface-id) is used:

2023-07-17T07:35:28.353Z|36280|binding|INFO|Claiming lport stress-test-density-706_nginx-1-5745dddb7c-ldjkl for this chassis.
2023-07-17T07:35:28.353Z|36281|binding|INFO|stress-test-density-706_nginx-1-5745dddb7c-ldjkl: Claiming 01:02:03:04:05:06 172.16.0.99
2023-07-17T07:35:49.878Z|36729|binding|INFO|Setting lport stress-test-density-706_nginx-1-5745dddb7c-ldjkl ovn-installed in OVS
2023-07-17T07:35:49.878Z|36730|binding|INFO|Setting lport stress-test-density-706_nginx-1-5745dddb7c-ldjkl up in Southbound

In the Open vSwitch conf.db file, we can see the new created interface and port and an ofport assigned to it, which possibly means that this time the interface was created successfully:

// Monday, July 17, 2023 7:35:28.124 AM
{"Port":{"5502f07d-d1bb-42ee-b090-00af944a5f88":{"name":"300cf7d291f454f","interfaces":["uuid","39ba0dea-406d-489e-8757-deb799c757f0"],"other_config":["map",[["transient","true"]]]}},"Interface":{"39ba0dea-406d-489e-8757-deb799c757f0":{"name":"300cf7d291f454f","external_ids":["map",[["attached_mac","01:02:03:04:05:06"],["iface-id","stress-test-density-706_nginx-1-5745dddb7c-ldjkl"],["iface-id-ver","01c5fa41-83d0-439d-9289-b27abe137c87"],["ip_addresses","172.16.0.99/22"],["sandbox","300cf7d291f454fee923f5c58dad64411e4c982ccc4e916b6a1681e8667b2a50"]]]}},"_date":1689579328124,"Bridge":{"4805dd32-af2a-4ac6-8917-3a27975c1ab5":{"ports":["uuid","5502f07d-d1bb-42ee-b090-00af944a5f88"]}},"_is_diff":true,"_comment":"ovs-vsctl (invoked by /usr/bin/ovnkube): /usr/bin/ovs-vsctl --timeout=30 add-port br-int 300cf7d291f454f other_config:transient=true -- set interface 300cf7d291f454f external_ids:attached_mac=01:02:03:04:05:06 external_ids:iface-id=stress-test-density-706_nginx-1-5745dddb7c-ldjkl external_ids:iface-id-ver=01c5fa41-83d0-439d-9289-b27abe137c87 external_ids:ip_addresses=172.16.0.99/22 external_ids:sandbox=300cf7d291f454fee923f5c58dad64411e4c982ccc4e916b6a1681e8667b2a50","Open_vSwitch":{"11d16522-1170-1121-b42b-49f93ee66dae":{"next_cfg":142571}}}
// Monday, July 17, 2023 7:35:28.348 AM
{"Interface":{"cf3cc598-a384-4d01-b91e-ebc9da69351f":{"ofport":22563},"ad0ddbdc-faec-4d3e-806d-96285da88ed2":{"ofport":22564},"39ba0dea-406d-489e-8757-deb799c757f0":{"ofport":22565}},"_date":1689579328348,"_is_diff":true,"Open_vSwitch":{"11d16522-1170-1121-b42b-49f93ee66dae":{"cur_cfg":142572}}}

The interface uuid and the external_ids [ct-zone] can be found on the bridge (see conf.db):

// Monday, July 17, 2023 7:35:49.909 AM
{"Interface":{"39ba0dea-406d-489e-8757-deb799c757f0"... "Bridge":{"4805dd32-af2a-4ac6-8917-3a27975c1ab5":{"external_ids":["map",["ct-zone-ichp-kubelet-density-706_nginx-1-5745dddb7c-ldjkl","183"]...}

After all it seems that the real problem happens when the stale resources are removed:

_, stderr, err := util.RunOVSVsctl("--if-exists", "--with-iface", "del-port", ifaceInfo.Name)

// Monday, July 17, 2023 7:36:22.762 AM
{"Interface":{"7ff3cc64-2a9e-46bc-994f-58d363e8e9ac":null,"2ae5e60a-8412-48c9-9f20-253b2b536ce6":null,"5b0f8f5d-bc9e-4732-ac50-e67de72172f7":null,"6bec85ea-a09b-4d6d-98f8-0eaa8e85544e":null,"cbf29c8c-8d1b-4adb-8161-28f8efb7ff0c":null,"3b25eb5a-8a95-49a0-9a31-090ca404938b":null,"8e732e10-76c4-4849-ad1d-8b0df7cffc41":null,"39fa6f4e-c67d-449b-aa88-c7586cf6ab9c":null},"Port":{"8193b121-3d6c-4bb5-acba-c7be2b6bed88":null,"db765055-6020-43d2-8a22-35d9d18d9b1f":null,"d5471baf-764e-4dbc-aa2d-96869b7061d6":null,"388e3385-8573-4240-b914-a08aa3a623d2":null,"61336f6e-3691-435d-9dca-2445301240f1":null,"cb4fcab1-a80c-4819-8342-97b700622bc8":null,"132951aa-3e10-410c-b5d0-1b5b1341bef8":null,"bf2eb754-38a9-479a-a3ca-6f876315830b":null},"_date":1689579382762,"Bridge":{"4805dd32-af2a-4ac6-8917-3a27975c1ab5":{"ports":["set",[["uuid","132951aa-3e10-410c-b5d0-1b5b1341bef8"],["uuid","388e3385-8573-4240-b914-a08aa3a623d2"],["uuid","61336f6e-3691-435d-9dca-2445301240f1"],["uuid","8193b121-3d6c-4bb5-acba-c7be2b6bed88"],["uuid","bf2eb754-38a9-479a-a3ca-6f876315830b"],["uuid","cb4fcab1-a80c-4819-8342-97b700622bc8"],["uuid","d5471baf-764e-4dbc-aa2d-96869b7061d6"],["uuid","db765055-6020-43d2-8a22-35d9d18d9b1f"]]]}},"_is_diff":true,"_comment":"ovs-vsctl (invoked by /usr/bin/ovnkube): /usr/bin/ovs-vsctl --timeout=15 --if-exists --with-iface del-port b2d117d5fc0c4ec -- --if-exists --with-iface del-port 63cfedb3828707c -- --if-exists --with-iface del-port 8824323d5ff74ba -- --if-exists --with-iface del-port 58bdfe3ae2dacc8 -- --if-exists --with-iface del-port 3743fd52d7a73b4 -- --if-exists --with-iface del-port de100db60e6b221 -- --if-exists --with-iface del-port fe7b9333cb2f5ed -- --if-exists --with-iface del-port ddbeccc99518185","Open_vSwitch":{"08d1652-0470-4421-b42b-49f93ee66dae":{"next_cfg":142712}}}

Few milliseconds later, the ovn-controller releases the logical port stress-test-density-706_nginx-1-5745dddb7c-ldjkl. Uh, this is BAD:

2023-07-17T07:36:22.767Z|37013|binding|INFO|Releasing lport stress-test-density-706_nginx-1-5745dddb7c-ldjkl from this chassis (sb_readonly=0)

Then the same existed ct-zone will be "re-added" to the bridge. Something went wrong, If you run ovs-vsctl list bridge, this ct-zone is non-existent:

// Monday, July 17, 2023 7:36:27.008 AM
{"_date":1689579387008,"Bridge":{"4805dd32-af2a-4ac6-8917-3a27975c1ab5":{"external_ids":["map",[["ct-zone-stress-test-density-706_nginx-1-5745dddb7c-ldjkl","183"]]]}},"_is_diff":true,"_comment":"ovn-controller\novn-controller: modifying OVS tunnels '55fc8491-3fff-4894-808b-937302978c36'"}

Only the port ID of the corresponding interface can be listed inside the ports array:

sh-4.4# ovs-vsctl list bridge | grep -c 5502f07d-d1bb-42ee-b090-00af944a5f88
1

The problem is that the required ct-zone is missing in the external_ids dictionary, so the logical flows with the priority=120 related to table 7 and 12 will be missing too:

sh-4.4# ovs-vsctl list bridge | grep -c stress-test-density-706_nginx-1-5745dddb7c-ldjkl
0

At this moment I was thinking, we saw just one pod crashing, why didn't the others crashed too?

W0717 07:36:22.729412 1987398 node.go:989] Found stale interface b2d117d5fc0c4ec, so queuing it to be deleted
W0717 07:36:22.729452 1987398 node.go:989] Found stale interface 63cfedb3828707c, so queuing it to be deleted
W0717 07:36:22.729469 1987398 node.go:989] Found stale interface 8824323d5ff74ba, so queuing it to be deleted # --> this is the one
W0717 07:36:22.729483 1987398 node.go:989] Found stale interface 58bdfe3ae2dacc8, so queuing it to be deleted
W0717 07:36:22.729498 1987398 node.go:989] Found stale interface 3743fd52d7a73b4, so queuing it to be deleted
W0717 07:36:22.729511 1987398 node.go:989] Found stale interface de100db60e6b221, so queuing it to be deleted
W0717 07:36:22.729524 1987398 node.go:989] Found stale interface fe7b9333cb2f5ed, so queuing it to be deleted
W0717 07:36:22.729539 1987398 node.go:989] Found stale interface ddbeccc99518185, so queuing it to be deleted

it is interesting that some interfaces couldn't be created (tagged by ofport=-1), although they could be found:

// Find and remove any existing OVS port with this iface-id. Pods can
// have multiple sandboxes if some are waiting for garbage collection,
// but only the latest one should have the iface-id set.
uuids, _ := ovsFind("Interface", "_uuid", "external-ids:iface-id="+ifaceID)

Very important, one interface was not found (exactly the one we had a problem with) due to the same errors described above related to the database connection errors:

//58bdfe3ae2dacc8 -> Monday, July 17, 2023 7:35:17.618 AM
{"Interface":{"3b25eb5a-8a95-49a0-9a31-090ca404938b":{"external_ids":["map",[["iface-id","ichp-kubelet-density-384_nginx-1-5745dddb7c-x4ncp"]]]}},"_date":1689579317618,"_is_diff":true,"_comment":"ovs-vsctl (invoked by /usr/bin/ovnkube): /usr/bin/ovs-vsctl --timeout=30 remove Interface 3b25eb5a-8a95-49a0-9a31-090ca404938b external-ids iface-id","Open_vSwitch":{"08d16522-0470-4421-b42b-49f93ee66dae":{"next_cfg":142490}}}
//63cfedb3828707c -> Monday, July 17, 2023 7:35:26.275 AM
{"Interface":{"8e732e10-76c4-4849-ad1d-8b0df7cffc41":{"external_ids":["map",[["iface-id","ichp-kubelet-density-574_nginx-1-5745dddb7c-52xz5"]]]}},"_date":1689579326275,"_is_diff":true,"_comment":"ovs-vsctl (invoked by /usr/bin/ovnkube): /usr/bin/ovs-vsctl --timeout=30 remove Interface 8e732e10-76c4-4849-ad1d-8b0df7cffc41 external-ids iface-id","Open_vSwitch":{"08d16522-0470-4421-b42b-49f93ee66dae":{"next_cfg":142560}}}
//ddbeccc99518185 -> Monday, July 17, 2023 7:36:01.524 AM
{"Interface":{"6bec85ea-a09b-4d6d-98f8-0eaa8e85544e":{"external_ids":["map",[["iface-id","ichp-kubelet-density-402_nginx-1-5745dddb7c-67758"]]]}},"_date":1689579361524,"_is_diff":true,"_comment":"ovs-vsctl (invoked by /usr/bin/ovnkube): /usr/bin/ovs-vsctl --timeout=30 remove Interface 6bec85ea-a09b-4d6d-98f8-0eaa8e85544e external-ids iface-id","Open_vSwitch":{"08d16522-0470-4421-b42b-49f93ee66dae":{"next_cfg":142669}}}
//b2d117d5fc0c4ec -> Monday, July 17, 2023 7:36:01.607 AM
{"Interface":{"2ae5e60a-8412-48c9-9f20-253b2b536ce6":{"external_ids":["map",[["iface-id","ichp-kubelet-density-510_nginx-1-5745dddb7c-hv5mc"]]]}},"_date":1689579361607,"_is_diff":true,"_comment":"ovs-vsctl (invoked by /usr/bin/ovnkube): /usr/bin/ovs-vsctl --timeout=30 remove Interface 2ae5e60a-8412-48c9-9f20-253b2b536ce6 external-ids iface-id","Open_vSwitch":{"08d16522-0470-4421-b42b-49f93ee66dae":{"next_cfg":142675}}}
//de100db60e6b221 -> Monday, July 17, 2023 7:36:01.913 AM
{"Interface":{"7ff3cc64-2a9e-46bc-994f-58d363e8e9ac":{"external_ids":["map",[["iface-id","ichp-kubelet-density-513_nginx-1-5745dddb7c-jj7f5"]]]}},"_date":1689579361913,"_is_diff":true,"_comment":"ovs-vsctl (invoked by /usr/bin/ovnkube): /usr/bin/ovs-vsctl --timeout=30 remove Interface 7ff3cc64-2a9e-46bc-994f-58d363e8e9ac external-ids iface-id","Open_vSwitch":{"08d16522-0470-4421-b42b-49f93ee66dae":{"next_cfg":142684}}}
//3743fd52d7a73b4 -> Monday, July 17, 2023 7:36:02.311 AM
{"Interface":{"5b0f8f5d-bc9e-4732-ac50-e67de72172f7":{"external_ids":["map",[["iface-id","ichp-kubelet-density-414_nginx-1-5745dddb7c-fnfpv"]]]}},"_date":1689579362311,"_is_diff":true,"_comment":"ovs-vsctl (invoked by /usr/bin/ovnkube): /usr/bin/ovs-vsctl --timeout=30 remove Interface 5b0f8f5d-bc9e-4732-ac50-e67de72172f7 external-ids iface-id","Open_vSwitch":{"08d16522-0470-4421-b42b-49f93ee66dae":{"next_cfg":142694}}}
//fe7b9333cb2f5ed -> Monday, July 17, 2023 7:36:02.346 AM
{"Interface":{"cbf29c8c-8d1b-4adb-8161-28f8efb7ff0c":{"external_ids":["map",[["iface-id","ichp-kubelet-density-373_nginx-1-5745dddb7c-w5z55"]]]}},"_date":1689579362346,"_is_diff":true,"_comment":"ovs-vsctl (invoked by /usr/bin/ovnkube): /usr/bin/ovs-vsctl --timeout=30 remove Interface cbf29c8c-8d1b-4adb-8161-28f8efb7ff0c external-ids iface-id","Open_vSwitch":{"08d16522-0470-4421-b42b-49f93ee66dae":{"next_cfg":142697}}}

The new created pod has internal networking, but the network packages cannot be routed properly and the healthchecks will fail causing the container to enter the CrashLoopBackOff state.

If I am not mistaking, it seems to be a bug on the ovn-controller, because I believe the ct-zone should not be deleted if there is a corresponding interface attached to the bridge. I will raise one issue and ask the contributors to double check this.

- Special notes for reviewers

This changes affects the ovnkube and the ovs-vsctl command

- Conclusion

  • after adding a simple retry and ensure the removal of stale resources, I am able to deploy and redeploy more than 350 few pods per node successfully without any crashes
  • all exit status error messages are gone

I am also convinced that this problem affects lots of people.
Furthermore I know that eventually some functions would retry by itself, after the next backoff has been reached. So we should ensure to rollback everything, if something goes wrong, before configuring OVS for the next pod on the same namespace

- Description for the changelog

add a retry parameter to the ovs-vsctl command and ensure stale interfaces are removed before creating a new pod networking

@rchicoli rchicoli force-pushed the master branch 4 times, most recently from c46ee92 to c188ec7 Compare July 19, 2023 09:14
@rchicoli rchicoli changed the title Add retry parameter to ovs-vsctl and ensure stale interfaces are removed WIP: Add retry parameter to ovs-vsctl and ensure stale interfaces are removed Jul 19, 2023
@rchicoli rchicoli force-pushed the master branch 7 times, most recently from 7bb828e to 1ae5d85 Compare July 19, 2023 12:19
@rchicoli rchicoli changed the title WIP: Add retry parameter to ovs-vsctl and ensure stale interfaces are removed [wip]: Add retry parameter to ovs-vsctl and ensure stale interfaces are removed Jul 19, 2023
@rchicoli rchicoli force-pushed the master branch 2 times, most recently from bbd1afc to 656929e Compare July 19, 2023 14:46
@rchicoli
Copy link
Author

@trozet @dcbw @girishmg @jcaamano before I continue to fix the other 1000 unit-tests that are falling because of adding the retry parameter, I would like to ask you guys:

What do you guys think about the code changes (2 commits)?
Was the PR description clear enough about the database connection errors and the pods that enter the CrashLoopBackOff state due to an OVN misconfiguration?

dcbw added 6 commits July 20, 2023 08:52
Used when we absolutely do not want to create the object (like a bridge)
if it doesn't exist, because that should be a hard error, but just want
to update it. Mostly to make sure testcases do the right thing, but we
never actually want to create 'br-int' either, so it works for both
testcases and real code.

Signed-off-by: Dan Williams <[email protected]>
Complement to Lookup() for chained operations.

Signed-off-by: Dan Williams <[email protected]>
cmdAdd/cmdDel use similar setup code; consolidate it. Also make
cmdAdd use the same error logging mechanism as cmdDel does.

Signed-off-by: Dan Williams <[email protected]>
@dcbw
Copy link
Contributor

dcbw commented Jul 20, 2023

@rchicoli note that I'm trying to remove ovs-vsctl usage via #3616 but I assume similar logic could be done with the new libovsdb-based workflow?

Equivalent code would be 6a0b7ca#diff-01a2ef868473804cd3d9b894b093e5be0735cfb21bf27f375ece3a33cc07edc9R388

The port lookup shouldn't need a retry since it's coming directly from the internal cache, but the libovsdbops.DeleteInterfacesWithPredicate() probably would.

Bonus: you don't have to fix unit tests at all if you rebase onto my PR :)

@rchicoli
Copy link
Author

rchicoli commented Jul 21, 2023

That is great news @dcbw and these changes are very promising. Give me a shout, once it is ready to release, then I could run a stress-test tool again.

Btw I didnt get the idea behind of the rebase onto your branch node-libovsdb, because you removed all the ovs-vsctl command.

So it is about adding the return on the DeleteInterfacesWithPredicate to ensure it succeeds before continuing to configure the OVS. If so, I could do that, but on which branch?

@dcbw
Copy link
Contributor

dcbw commented Jul 21, 2023

Btw I didnt get the idea behind of the rebase onto your branch node-libovsdb, because you removed all the ovs-vsctl command.

My PR is in the pipeline to get merged soon-ish (week or two); so when that happens you'd have to rebase your changes on top of mine anyway :) I guess it was "rebase" not in a literal "git rebase" sense, but redo the functionality of the patches against the libovsdb-based bits instead.

So it is about adding the return on the DeleteInterfacesWithPredicate to ensure it succeeds before continuing to configure the OVS. If so, I could do that, but on which branch?

git remote add dcbw https://github.com/dcbw/ovn-kubernetes.git
git fetch dcbw
git rebase dcbw/node-libovsdb

will get your master branch based on my PR. Then you can do your changes as a new commit on top, and when my PR merges you can "git rebase" onto the actual upstream (I typically add a remote called "upstream", in this case https://github.com/ovn-org/ovn-kubernetes.git and then I can git rebase upstream/master in my fork to keep things current).

Anyway, my suggestion would be to wrap the DeleteInterfacesWithPredicate() call with a PollUntilContextCancel:

	if err := wait.PollUntilContextCancel(ctx, 3 * time.Second, true, func(context.Context) (done bool, err error) {
		if err := libovsdbops.DeleteInterfacesWithPredicate(vsClient, p); err != nil {
			klog.Warningf("Failed to delete stale OVS ports with iface-id %q from br-int: %v", ifaceID, err)
			return false, nil
		}
		// success
		return true, nil
	}); err != nil {
		return err
	}

@rchicoli rchicoli force-pushed the master branch 2 times, most recently from b514fcb to 86f2e2d Compare July 23, 2023 19:26
jcaamano and others added 16 commits July 23, 2023 21:37
Signed-off-by: Jaime Caamaño Ruiz <[email protected]>
This is handled the same for all networks so it makes sense for it to be
at the base handler

Signed-off-by: Jaime Caamaño Ruiz <[email protected]>
We will introduce a specific layer2 event handler and the current name
for the base layer 2 event handler (shared between layer2 and localnet
controllers) would collide, so rename it.

Signed-off-by: Jaime Caamaño Ruiz <[email protected]>
Instead of having sommon Start/Stop entry points in the base layer2
controller, let the layer2 and localnet controllers have their own so
that they are in full control on how they start or stop.

Signed-off-by: Jaime Caamaño Ruiz <[email protected]>
Layer2 controller needs to be aware the node's zones so it needs to
handle node events thorugh its own event handler.

Signed-off-by: Jaime Caamaño Ruiz <[email protected]>
Currently in use by the default & layer 3 network controllers, the
zoneICHandler will also be used by the layer2 network controller in
sections of the code that are shared accross all controllers.

Signed-off-by: Jaime Caamaño Ruiz <[email protected]>
All network IDs are annotated to all nodes and the layer2 network
controller will need it in context of no particular node.

Signed-off-by: Jaime Caamaño Ruiz <[email protected]>
There are three aspects worth highlighting:

* Support is achieved through specific configuration of the logical
  switch and logical ports that the layer2 network controller was
  already owner of. As such, the IC handler does provide this specific
  configuration but does not add, delete or clean up those entries which
  is still done by the network controller.

* The base layer 2 controller does no longer handle local or remote pod
  events in different code flows. This commit brings them together as
  the only difference between them is whether they create the remote
  port (layer2) or not (localnet). I did try different things but this
  was the easiest way forward at this time by at least an order
  of magnitude amount of effort.

* This commit also brings together layer3 and layer2 pod synchronization
  as there were not that many differences between them.

Signed-off-by: Jaime Caamaño Ruiz <[email protected]>
Signed-off-by: Jaime Caamaño Ruiz <[email protected]>
Signed-off-by: Jaime Caamaño Ruiz <[email protected]>
Commit 5d6b136 added child stop channels to stop the network policy
handlers independently from the network controller when the policy is
deleted while also stopping them if the network controller is stopped.

Unfortunately when both things happen at the same time, one of those
events will end up attempting to close a closed channel which will panic.

Introduce a CancelableContext utility that will wrap a cancelable context
that can be chained to achieve the same effect.

Signed-off-by: Jaime Caamaño Ruiz <[email protected]>
This would remove node name comparison between cloud private ip config
and egress ip status as this is not valid in the case of egress ip
failover to another node during upgrade or reboot scenario.

Signed-off-by: Periyasamy Palanisamy <[email protected]>
After live migrating a VM the original node can be shutdown and a new
one can appear taking over the node subnet. There were some unit tets
for that but they exercise this part for non live migration scenarios
creating a race condition between cleanup and addLogicalPort.

This change run that part of the tests only for live-migration
scenarios.

Signed-off-by: Enrique Llorente <[email protected]>
…ed before creating a new pod networking

Do not try to configure the OVS, if the interface matching the iface-id cannot be garbage-collected.
Otherwise pods might enter the CrashLoopBackOff state, because the related ct-zone disappears from the bridge interface.

Signed-off-by: Rafael Chicoli <[email protected]>
@rchicoli
Copy link
Author

Thanks for helping out with git rebase, now it is all clear.
For now I've prepared the changes as you suggested.
Later on I would need more time to go through the new libovsdb client, to see how it uses the cache and if a retry would be helpful too.

I am really looking forward to testing these changes, because it might solve some of the issues we were having with ovnkube while trying to connect to the ovn database.

Next week I will be on vacations. So I would have time to rebase once again by the end of next week. Thanks again.

@coveralls
Copy link

Coverage Status

coverage: 52.87% (+0.1%) from 52.774% when pulling 904d4f7 on rchicoli:master into fa9028b on ovn-org:master.

@trozet trozet marked this pull request as draft May 29, 2024 13:44
@trozet
Copy link
Contributor

trozet commented May 29, 2024

Moving to draft until the node side libovsdb PR is complete.

Copy link

This PR is stale because it has been open 90 days with no activity. Remove stale label or comment or reach out to maintainers for code reviews or consider closing this if you do not plan to work on it.

@github-actions github-actions bot added the lifecycle/stale All issues (> 60 days) and PRs (>90 days) with no activity. label Aug 28, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
lifecycle/stale All issues (> 60 days) and PRs (>90 days) with no activity.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants