Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Do not reconcile egressIPPod objects that are being deleted when the namespace no longer exists #3751

Merged
merged 1 commit into from
Jul 7, 2023

Conversation

kyrtapz
Copy link
Contributor

@kyrtapz kyrtapz commented Jul 6, 2023

When a namespace with pods gets removed and the pod removal event is handled after the namespace is already gone, we should ignore the NotFound error in reconcileEgressIPPod. Any potential configuration will get removed in reconcileEgressIPNamespace.

Found in downstream CI, tracked in https://issues.redhat.com/browse/OCPBUGS-15804
Example error:

I0707 01:15:21.862524       1 namespace.go:260] [e2e-netpol-z-1913] deleting namespace
...
E0707 01:15:21.894356       1 obj_retry.go:680] Failed to delete *factory.egressIPPod e2e-netpol-z-1913/b, error: namespace "e2e-netpol-z-1913" not found
I0707 01:15:21.894452       1 pods.go:100] Deleting pod: e2e-netpol-z-1913/b
E0707 01:15:21.904170       1 obj_retry.go:680] Failed to delete *factory.egressIPPod e2e-netpol-z-1913/c, error: namespace "e2e-netpol-z-1913" not found
I0707 01:15:21.904206       1 pods.go:100] Deleting pod: e2e-netpol-z-1913/c

/cc @tssurya @jcaamano

@kyrtapz kyrtapz marked this pull request as ready for review July 6, 2023 16:42
@tssurya
Copy link
Member

tssurya commented Jul 7, 2023

hey @flavio-fernandes : dualstack is failing again on IC: https://github.com/ovn-org/ovn-kubernetes/actions/runs/5483719976/jobs/9990837646?pr=3751 PTAL! I really don't want to reopen our tracker card again HAHA
@kyrtapz you are rebased on #3724 i guess

… and the namespace no longer exists

When a namespace with pods gets removed and the pod removal event is handled after the namespace is already gone,
we should ignore the NotFound error in reconcileEgressIPPod.
Any potential configuration will get removed in reconcileEgressIPNamespace.

Signed-off-by: Patryk Diak <[email protected]>
@kyrtapz kyrtapz changed the title Do not reconcile egressIPPod objects that are being deleted and the namespace no longer exists Do not reconcile egressIPPod objects that are being deleted when the namespace no longer exists Jul 7, 2023
Copy link
Member

@tssurya tssurya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hope we have coverage for this somewhere in our UTs? I am afraid each if condition we are adding is now for a bug fix (another sample: #3692 ; these should have tests...) and don't want someone in future (like i did for IC ROFL) to come and change these :D - won't hold the PR though for missing UT.

@@ -327,6 +327,12 @@ func (oc *DefaultNetworkController) reconcileEgressIPPod(old, new *v1.Pod) (err
oldPod = old
namespace, err = oc.watchFactory.GetNamespace(oldPod.Namespace)
if err != nil {
// when the whole namespace gets removed, we can ignore the NotFound error here
// any potential configuration will get removed in reconcileEgressIPNamespace
if new == nil && apierrors.IsNotFound(err) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kyrtapz: what about the other paths? reconcileEgressIP might get called and let's say pod is not found or namespace is not found.. are there other places where this protection needs to be added? (Having said that each if in egressIP code at this point is a specific bug fix so its like walking on shells if we ignore something we shouldn't be..at long as at least one path handles the deletion we should be safe).

Copy link
Contributor Author

@kyrtapz kyrtapz Jul 7, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reconcileEgressIPPod doesn't fetch a pod so it won't be directly affected by pod not existing anymore.
As for namespace not found I think we should still return an error on update since this seems like a messed up setup if there are pods running in non-existing namespaces.

@kyrtapz
Copy link
Contributor Author

kyrtapz commented Jul 7, 2023

hope we have coverage for this somewhere in our UTs? I am afraid each if condition we are adding is now for a bug fix (another sample: #3692 ; these should have tests...) and don't want someone in future (like i did for IC ROFL) to come and changes these :D - won't hold the PR though for missing UT.

I can try think of how to trigger this specific issue in our tests but I think we might need to get that PR in first as it is affecting CI.

@dcbw
Copy link
Contributor

dcbw commented Jul 7, 2023

So we don't forget... test could Namespace("foo").Delete() via the fake client which shouldn't trigger any additional events because it's not a real apiserver.

Then use an Eventually() calling oc.getNamespaceLocked("foo", false) and wait for a nil return to indicate the namespace is gone from our cache.

Then delete the pod and verify that it doesn't get retried.

@flavio-fernandes
Copy link
Contributor

hey @flavio-fernandes : dualstack is failing again on IC: https://github.com/ovn-org/ovn-kubernetes/actions/runs/5483719976/jobs/9990837646?pr=3751 PTAL! I really don't want to reopen our tracker card again HAHA @kyrtapz you are rebased on #3724 i guess

@tssurya I think the test is not being patient enough. Check this out:

...
+ kubectl wait --for=condition=ready pods --all --timeout=100s
pod/httpd-deployment-6b68fbf665-r8fzm condition met
pod/httpd-deployment-6b68fbf665-vfwd9 condition met
pod/server-deployment-8468968cd5-c5q8j condition met
pod/server-deployment-8468968cd5-vmznw condition met
+ IPS=()
++ kubectl get services my-service-v4 -o 'jsonpath={.spec.clusterIPs[0]}'
+ CLUSTER_IPV4=10.96.222.173
+ IPS+=("${CLUSTER_IPV4}:8081")
++ kubectl get services my-service-v6 -o 'jsonpath={.spec.clusterIPs[0]}'
+ CLUSTER_IPV6=fd00:10:96::57f5
+ IPS+=("\[${CLUSTER_IPV6}\]:8080")
+ for n in $NODES
+ for ip in ${IPS[@]}
+ docker exec ovn-worker2 curl --connect-timeout 5 10.96.222.173:8081
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100    45  100    45    0     0     42 <html><body><h1>It works!</h1></body></html>
     0  0:00:01  0:00:01 --:--:--    42
+ for ip in ${IPS[@]}
+ docker exec ovn-worker2 curl --connect-timeout 5 '\[fd00:10:96::57f5\]:8080'
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:--  0:00:05 --:--:--     0
curl: (28) Connection timeout after 5000 ms


❯ docker exec ovn-worker2 curl --connect-timeout 5 '\[fd00:10:96::57f5\]:8080'
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100    45  100    45    0     0   6695      0 --:--:-- --:--:-- --:--:--  7500
<html><body><h1>It works!</h1></body></html>

The kubectl wait --for=condition=ready pods --all --timeout=100s is not taking into account the mili-seconds it will take ovnk to setup the static routes across the nodes; I think. I will tweak the test to accommodate that.

@dcbw dcbw merged commit 08033f4 into ovn-org:master Jul 7, 2023
24 of 25 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants