You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Lately I was trying to find out why some pods enter a CrashLoopBackOff state in a Kubernetes cluster running with the ovn-controller and ovnkube. It turns out that this issue was caused by a missing ct-zone on the bridge:
That happens when the system is overloaded and the interfaces could not be created because some internal errors (ofport=-1), then a new pod will reuse the same iface-id:
// Monday, July 17, 2023 7:35:21.982 AM
{"Interface":{"39fa6f4e-c67d-449b-aa88-c7586cf6ab9c":{"name":"8824323d5ff74ba","external_ids":["map",[["attached_mac","01:02:03:04:05:06"],["iface-id","stress-test-density-706_nginx-1-5745dddb7c-ldjkl"],["iface-id-ver","01c5fa41-83d0-439d-9289-b27abe137c87"],["ip_addresses","172.16.0.99/22"],["sandbox","8824323d5ff74bab767acf132914c8edd36c22fcef81c26a40a51a8b670cc7a3"]]]}},"Port":{"db765055-6020-43d2-8a22-35d9d18d9b1f":{"name":"8824323d5ff74ba","interfaces":["uuid","39fa6f4e-c67d-449b-aa88-c7586cf6ab9c"],"other_config":["map",[["transient","true"]]]}},"_date":1689579321982,"Bridge":{"4805dd32-af2a-4ac6-8917-3a27975c1ab5":{"ports":["uuid","db765055-6020-43d2-8a22-35d9d18d9b1f"]}},"_is_diff":true,"_comment":"ovs-vsctl (invoked by /usr/bin/ovnkube): /usr/bin/ovs-vsctl --timeout=30 add-port br-int 8824323d5ff74ba other_config:transient=true -- set interface 8824323d5ff74ba external_ids:attached_mac=01:02:03:04:05:06 external_ids:iface-id=stress-test-density-706_nginx-1-5745dddb7c-ldjkl external_ids:iface-id-ver=01c5fa41-83d0-439d-9289-b27abe137c87 external_ids:ip_addresses=172.16.0.99/22 external_ids:sandbox=8824323d5ff74bab767acf132914c8edd36c22fcef81c26a40a51a8b670cc7a3","Open_vSwitch":{"11d16522-1170-1121-b42b-49f93ee66dae":{"next_cfg":142523}}}
// Monday, July 17, 2023 7:35:25.957 AM
{"Interface":{"0fe370c5-d6a7-4aa4-9243-e818fdf629ba":{"ofport":22553},"38791b57-6f42-42c0-bf79-92ce8d7bc11f":{"ofport":22556},"41619eb1-ed6a-4365-8b48-dbeb0214e31a":{"ofport":22557},"c6b90d74-c38a-4ac5-8904-c43ff6a753db":{"ofport":22554},"28c13ef7-2277-4d4e-88ce-f4ae30a3b610":{"ofport":22555},"39fa6f4e-c67d-449b-aa88-c7586cf6ab9c":{"ofport":-1,"error":"could not open network device 8824323d5ff74ba (No such device)"}},"_date":1689579325957,"_is_diff":true,"Open_vSwitch":{"11d16522-1170-1121-b42b-49f93ee66dae":{"cur_cfg":142555}}}
// Monday, July 17, 2023 7:35:28.124 AM
{"Port":{"5502f07d-d1bb-42ee-b090-00af944a5f88":{"name":"300cf7d291f454f","interfaces":["uuid","39ba0dea-406d-489e-8757-deb799c757f0"],"other_config":["map",[["transient","true"]]]}},"Interface":{"39ba0dea-406d-489e-8757-deb799c757f0":{"name":"300cf7d291f454f","external_ids":["map",[["attached_mac","01:02:03:04:05:06"],["iface-id","stress-test-density-706_nginx-1-5745dddb7c-ldjkl"],["iface-id-ver","01c5fa41-83d0-439d-9289-b27abe137c87"],["ip_addresses","172.16.0.99/22"],["sandbox","300cf7d291f454fee923f5c58dad64411e4c982ccc4e916b6a1681e8667b2a50"]]]}},"_date":1689579328124,"Bridge":{"4805dd32-af2a-4ac6-8917-3a27975c1ab5":{"ports":["uuid","5502f07d-d1bb-42ee-b090-00af944a5f88"]}},"_is_diff":true,"_comment":"ovs-vsctl (invoked by /usr/bin/ovnkube): /usr/bin/ovs-vsctl --timeout=30 add-port br-int 300cf7d291f454f other_config:transient=true -- set interface 300cf7d291f454f external_ids:attached_mac=01:02:03:04:05:06 external_ids:iface-id=stress-test-density-706_nginx-1-5745dddb7c-ldjkl external_ids:iface-id-ver=01c5fa41-83d0-439d-9289-b27abe137c87 external_ids:ip_addresses=172.16.0.99/22 external_ids:sandbox=300cf7d291f454fee923f5c58dad64411e4c982ccc4e916b6a1681e8667b2a50","Open_vSwitch":{"11d16522-1170-1121-b42b-49f93ee66dae":{"next_cfg":142571}}}
// Monday, July 17, 2023 7:35:28.348 AM
{"Interface":{"cf3cc598-a384-4d01-b91e-ebc9da69351f":{"ofport":22563},"ad0ddbdc-faec-4d3e-806d-96285da88ed2":{"ofport":22564},"39ba0dea-406d-489e-8757-deb799c757f0":{"ofport":22565}},"_date":1689579328348,"_is_diff":true,"Open_vSwitch":{"11d16522-1170-1121-b42b-49f93ee66dae":{"cur_cfg":142572}}}
// Monday, July 17, 2023 7:35:49.909 AM
{"Interface":{"39ba0dea-406d-489e-8757-deb799c757f0"... "Bridge":{"4805dd32-af2a-4ac6-8917-3a27975c1ab5":{"external_ids":["map",["ct-zone-ichp-kubelet-density-706_nginx-1-5745dddb7c-ldjkl","183"]...}
// Monday, July 17, 2023 7:36:22.762 AM
{"Interface":{"7ff3cc64-2a9e-46bc-994f-58d363e8e9ac":null,"2ae5e60a-8412-48c9-9f20-253b2b536ce6":null,"5b0f8f5d-bc9e-4732-ac50-e67de72172f7":null,"6bec85ea-a09b-4d6d-98f8-0eaa8e85544e":null,"cbf29c8c-8d1b-4adb-8161-28f8efb7ff0c":null,"3b25eb5a-8a95-49a0-9a31-090ca404938b":null,"8e732e10-76c4-4849-ad1d-8b0df7cffc41":null,"39fa6f4e-c67d-449b-aa88-c7586cf6ab9c":null},"Port":{"8193b121-3d6c-4bb5-acba-c7be2b6bed88":null,"db765055-6020-43d2-8a22-35d9d18d9b1f":null,"d5471baf-764e-4dbc-aa2d-96869b7061d6":null,"388e3385-8573-4240-b914-a08aa3a623d2":null,"61336f6e-3691-435d-9dca-2445301240f1":null,"cb4fcab1-a80c-4819-8342-97b700622bc8":null,"132951aa-3e10-410c-b5d0-1b5b1341bef8":null,"bf2eb754-38a9-479a-a3ca-6f876315830b":null},"_date":1689579382762,"Bridge":{"4805dd32-af2a-4ac6-8917-3a27975c1ab5":{"ports":["set",[["uuid","132951aa-3e10-410c-b5d0-1b5b1341bef8"],["uuid","388e3385-8573-4240-b914-a08aa3a623d2"],["uuid","61336f6e-3691-435d-9dca-2445301240f1"],["uuid","8193b121-3d6c-4bb5-acba-c7be2b6bed88"],["uuid","bf2eb754-38a9-479a-a3ca-6f876315830b"],["uuid","cb4fcab1-a80c-4819-8342-97b700622bc8"],["uuid","d5471baf-764e-4dbc-aa2d-96869b7061d6"],["uuid","db765055-6020-43d2-8a22-35d9d18d9b1f"]]]}},"_is_diff":true,"_comment":"ovs-vsctl (invoked by /usr/bin/ovnkube): /usr/bin/ovs-vsctl --timeout=15 --if-exists --with-iface del-port b2d117d5fc0c4ec -- --if-exists --with-iface del-port 63cfedb3828707c -- --if-exists --with-iface del-port 8824323d5ff74ba -- --if-exists --with-iface del-port 58bdfe3ae2dacc8 -- --if-exists --with-iface del-port 3743fd52d7a73b4 -- --if-exists --with-iface del-port de100db60e6b221 -- --if-exists --with-iface del-port fe7b9333cb2f5ed -- --if-exists --with-iface del-port ddbeccc99518185","Open_vSwitch":{"08d1652-0470-4421-b42b-49f93ee66dae":{"next_cfg":142712}}}
// Monday, July 17, 2023 7:36:27.008 AM
{"_date":1689579387008,"Bridge":{"4805dd32-af2a-4ac6-8917-3a27975c1ab5":{"external_ids":["map",[["ct-zone-stress-test-density-706_nginx-1-5745dddb7c-ldjkl","183"]]]}},"_is_diff":true,"_comment":"ovn-controller\novn-controller: modifying OVS tunnels '55fc8491-3fff-4894-808b-937302978c36'"}
After removing the old port, I can see the logical port being released, what should not happen:
[ovn-controller] 2023-07-17T07:35:25.294Z|36242|binding|INFO|Claiming lport stress-test-density-706_nginx-1-5745dddb7c-ldjkl for this chassis.
[ovn-controller] 2023-07-17T07:35:25.294Z|36243|binding|INFO|stress-test-density-706_nginx-1-5745dddb7c-ldjkl: Claiming 01:02:03:04:05:06 172.16.0.99
[ovn-controller] 2023-07-17T07:35:25.961Z|36252|binding|INFO|Releasing lport stress-test-density-706_nginx-1-5745dddb7c-ldjkl from this chassis (sb_readonly=0)
--->
[ovn-controller] 2023-07-17T07:35:28.353Z|36280|binding|INFO|Claiming lport stress-test-density-706_nginx-1-5745dddb7c-ldjkl for this chassis.
[ovn-controller] 2023-07-17T07:35:28.353Z|36281|binding|INFO|stress-test-density-706_nginx-1-5745dddb7c-ldjkl: Claiming 01:02:03:04:05:06 172.16.0.99
[ovn-controller] 2023-07-17T07:35:49.878Z|36729|binding|INFO|Setting lport stress-test-density-706_nginx-1-5745dddb7c-ldjkl ovn-installed in OVS
[ovn-controller] 2023-07-17T07:35:49.878Z|36730|binding|INFO|Setting lport stress-test-density-706_nginx-1-5745dddb7c-ldjkl up in Southbound
--->
[ovn-controller] 2023-07-17T07:36:22.767Z|37013|binding|INFO|Releasing lport stress-test-density-706_nginx-1-5745dddb7c-ldjkl from this chassis (sb_readonly=0) # this is BAD <--
It is important to notice, if the old interface has been remove with iface-id, before running the del-port command, then it seems to work:
At the end, a pod crashes only if the port ID of the corresponding new interface is listed on the bridge:
sh-4.4# ovs-vsctl list bridge | grep -c 5502f07d-d1bb-42ee-b090-00af944a5f88
1
But the required ct-zone is missing, so the logical flows with the priority=120 related to table 7 and 12 will be missing too:
sh-4.4# ovs-vsctl list bridge | grep -c stress-test-density-706_nginx-1-5745dddb7c-ldjkl
0
The new created pod has internal networking, but the network packages cannot be routed properly and the healthchecks will fail causing the container to enter the CrashLoopBackOff state.
I've created a fix for the ovn-kubernetes code, but I believe this is a bug on the ovn-controller, because the ct-zone should not be deleted if there is a corresponding interface attached to the bridge. If not, please feel free to close this issue.
Thanks in advanced and let me know, if I should provide more information to that.
@rchicoli sorry for the delay in replying, does this still happen with the latest version ovn-kubernetes uses upstream (I think that's ovn23.09.x from Fedora)?
I am not actively taking care of the platform anymore. It is a little upset that required "retry" option wasn't taking in consideration, although it had fixed a huge problem when the system was overloaded. Anyway I've heard the performance is better with the latest releases.
Lately I was trying to find out why some pods enter a CrashLoopBackOff state in a Kubernetes cluster running with the ovn-controller and ovnkube. It turns out that this issue was caused by a missing ct-zone on the bridge:
That happens when the system is overloaded and the interfaces could not be created because some internal errors (ofport=-1), then a new pod will reuse the same iface-id:
After removing the old port, I can see the logical port being released, what should not happen:
It is important to notice, if the old interface has been remove with
iface-id
, before running thedel-port
command, then it seems to work:At the end, a pod crashes only if the port ID of the corresponding new interface is listed on the bridge:
sh-4.4# ovs-vsctl list bridge | grep -c 5502f07d-d1bb-42ee-b090-00af944a5f88 1
But the required ct-zone is missing, so the logical flows with the priority=120 related to table 7 and 12 will be missing too:
sh-4.4# ovs-vsctl list bridge | grep -c stress-test-density-706_nginx-1-5745dddb7c-ldjkl 0
The new created pod has internal networking, but the network packages cannot be routed properly and the healthchecks will fail causing the container to enter the CrashLoopBackOff state.
I've created a fix for the ovn-kubernetes code, but I believe this is a bug on the ovn-controller, because the ct-zone should not be deleted if there is a corresponding interface attached to the bridge. If not, please feel free to close this issue.
Thanks in advanced and let me know, if I should provide more information to that.
Here is the related PR for ovn-org/ovn-kubernetes#3784
A similar topic found:
The text was updated successfully, but these errors were encountered: