Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RayCluster][Fix] Add expectations of RayCluster #2150

Open
wants to merge 9 commits into
base: master
Choose a base branch
from

Conversation

Eikykun
Copy link

@Eikykun Eikykun commented May 16, 2024

Why are these changes needed?

This PR attempts to address issues #715 and #1936 by adding expectation capabilities to ensure the pod is in the desired state during the next Reconcile following pod deletion/creation.

Similar solutions can be referred to at:

Related issue number

Checks

  • I've made sure the tests are passing.
  • Testing Strategy
    • Unit tests
    • Manual tests
    • This PR is not tested :(

@kevin85421
Copy link
Member

Hi @Eikykun, thank you for the PR! I will review it next week. Are you on Ray Slack? We can iterate more quickly there since this is a large PR. My Slack handle is "Kai-Hsun Chen (ray team)". Thanks!

@kevin85421
Copy link
Member

I will review this PR tomorrow.

@kevin85421
Copy link
Member

cc @rueian Would you mind giving this PR a review? I think I don't have enough time to review it today. Thanks!

Comment on lines 142 to 148
defer func() {
if satisfied {
ae.subjects.Delete(expectation)
}
}()

satisfied, err = expectation.(*ActiveExpectation).isSatisfied()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are many read-after-write operations in the ActiveExpectations. Should we use a mutex to wrap these operations? For example, will the above ae.subjects.Delete(expectation) delete an unsatisfied expectation?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The subjects of ActiveExpectations are ThreadSafeStore, utilizing store provided by k8s.io/client-go/tools/cache. Therefore, operations on ActiveExpectations.subjects are thread-safe.
For ActiveExpectations.subjects items, the ActiveExpectation also utilizes a ThreadSafeStore.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

return fmt.Errorf("fail to get active expectation item for %s when expecting: %s", key, err)
}

ae.recordTimestamp = time.Now()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we use a mutex for updating the recordTimestamp?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we use a mutex for updating the recordTimestamp?

Thanks for your review. 😺

The use of a mutex for ActiveExpectation's recordTimestamp depends on the context of the ActiveExpectations. Currently, it is only employed in the controller's Reconcile func. Within the controller, multiple workers reconcile concurrently, but only one worker handles a given reconcile.Request at any given time.

Whether the recordTimestamp is used with a mutex in ActiveExpectation depends on its usage context. It is currently only used within the Reconcile func in the controller, where multiple workers are running in parallel. However, for the same ReconcileRequest, only a single worker handles it at any given time.

As a result, only one goroutine handles the ActiveExpectations associated with the same RayCluster, ensuring there won't be any concurrency issues with reading and writing.

However, this can cause issues if ActiveExpectations are used externally by something like an EventHandler. But we aren't seeing this use case currently.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@rueian
Copy link
Contributor

rueian commented May 30, 2024

Just wondering if the client-go's workqueue ensures that no more than one consumer can process an equivalent reconcile.Request at any given time, why don't we clear the related informer cache when needed?

@Eikykun
Copy link
Author

Eikykun commented Jun 3, 2024

Just wondering if the client-go's workqueue ensures that no more than one consumer can process an equivalent reconcile.Request at any given time, why don't we clear the related informer cache when needed?

Apologies, I'm not quite clear about what "related informer cache" refers to.

@rueian
Copy link
Contributor

rueian commented Jun 8, 2024

Just wondering if the client-go's workqueue ensures that no more than one consumer can process an equivalent reconcile.Request at any given time, why don't we clear the related informer cache when needed?

Apologies, I'm not quite clear about what "related informer cache" refers to.

According to #715, the root cause is the stale informer cache, so I am wondering if the issue can be solved by fixing the cache, for example doing a manual Resync somehow.

@kevin85421
Copy link
Member

I am reviewing this PR now. I will try to review this PR an iteration every 1 or 2 days.

Copy link
Member

@kevin85421 kevin85421 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just reviewed a small part of this PR. I will try to do another iteration tomorrow.

ray-operator/controllers/ray/raycluster_controller.go Outdated Show resolved Hide resolved
resource := ResourceInitializers[i.Kind]()
if err := i.Get(context.TODO(), types.NamespacedName{Namespace: namespace, Name: i.Name}, resource); err == nil {
return true, nil
} else if errors.IsNotFound(err) && i.RecordTimestamp.Add(30*time.Second).Before(time.Now()) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does this mean? Do you mean:

(1) The Pod is not found in the informer cache.
(2) KubeRay has already submitted a Create request to the K8s API server at t=RecordTimestamp. If the Create request was made more than 30 seconds ago, we assume it satisfies the expectation.

I can't understand (2). If we sent a request 30 seconds ago and the informer still hasn't received information about the Pod, there are two possibilities:

  • (a) There are delays between the K8s API server and the informer cache.
  • (b) The creation failed.

For case (a), it is OK for the function to say that the expectation is satisfied. However, for case (b), what will happen if the creation fails and we tell the KubeRay operator it is satisfied?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Case(b) may not occur here? Because only after the pod is successfully created, the pod is expected to be in the cache.

if err := r.Create(ctx, &pod); err != nil {
	return err
}
rayClusterExpectation.ExpectCreateHeadPod(key, pod.Namespace, pod.Name)

@kevin85421
Copy link
Member

Btw, @Eikykun would you mind rebasing with the master branch and resolving the conflict? Thanks!

@Eikykun
Copy link
Author

Eikykun commented Jun 12, 2024

According to #715, the root cause is the stale informer cache, so I am wondering if the issue can be solved by fixing the cache, for example doing a manual Resync somehow.

Gotit. From a problem-solving standpoint, if we don't rely on an informer in the controller and directly query the ApiServer for pods, the cache consistency issue with etcd wouldn't occur. However, this approach would increase network traffic and affect reconciliation efficiency.
As far as I understand, the Resync() method in DeltaFIFO is not intended to ensure cache consistency with etcd, but rather to prevent event loss by means of periodic reconciliation.

@Eikykun
Copy link
Author

Eikykun commented Jun 12, 2024

Btw, @Eikykun would you mind rebasing with the master branch and resolving the conflict? Thanks!

thanks for your review, I will review the pr issue and resolve the conflicts later.

@kevin85421
Copy link
Member

@Eikykun would you mind installing pre-commit https://github.com/ray-project/kuberay/blob/master/ray-operator/DEVELOPMENT.md and fixing the linter issues? Thanks!

Copy link
Member

@kevin85421 kevin85421 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At a quick glance, it seems that we create an ActiveExpectationItem for each Pod's creation, deletion, or update. I have some concerns about the scalability bottleneck caused by the memory usage. In ReplicaSet's source code, it seems only track the number of Pods expect to be created or deleted per ReplicaSet.

@kevin85421
Copy link
Member

At a quick glance, it seems that we create an ActiveExpectationItem for each Pod's creation, deletion, or update. I have some concerns about the scalability bottleneck caused by the memory usage. In ReplicaSet's source code, it seems only track the number of Pods expect to be created or deleted per ReplicaSet.

Follow up for ^

@Eikykun
Copy link
Author

Eikykun commented Jun 18, 2024

At a quick glance, it seems that we create an ActiveExpectationItem for each Pod's creation, deletion, or update. I have some concerns about the scalability bottleneck caused by the memory usage. In ReplicaSet's source code, it seems only track the number of Pods expect to be created or deleted per ReplicaSet.

Sorry, I didn't have time to reply a few days ago.

ActiveExpectationItem is removed after fulfilling its expectations. Therefore, the memory usage depends on how many pods that are being created or deleted are not yet synchronized to the cache. It might not actually consume much memory? Also, ControllerExpectations caches each pod's UID: https://github.com/kubernetes/kubernetes/blob/master/pkg/controller/controller_utils.go#L364
Therefore, I'm not quite sure which one is lighter, ActiveExpectationItem or ControllerExpectations.

I started with ControllerExpectations in RayCluster from the beginning. But I'm a bit unsure why I switched to ActiveExpectationItem; perhaps it was more complicated. ControllerExpectations requires using PodEventHandler to handle Observed logic. RayCluster needs to implement PodEventHandler logic separately.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants