Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support max-allowed Pods for nodes #3

Open
wants to merge 1 commit into
base: dev
Choose a base branch
from

Conversation

@m1093782566
Copy link

/cc @shivramsrivastava

@islinwb islinwb changed the title [WIP]support max-allowed Pods for nodes Support max-allowed Pods for nodes Jun 9, 2018
@islinwb
Copy link
Author

islinwb commented Jun 9, 2018

@shivramsrivastava The PR's ready.

string core_id_substr = label.substr(idx + 4, label.size() - idx - 4);
uint32_t core_id = strtoul(core_id_substr.c_str(), 0, 10);
float available_cpu_cores =
latest_stats.cpus_stats(core_id).cpu_capacity() *

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel a bit confused about why we need to touch the CPU stats?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here we are reading latest machine stats sample for a machine. And fetching particular core's cpu utilization, available cpus and storing it in PU's resource descriptor. And then we accumulate combined cpu stats of all PUs(PU - core, machine may more than 1 PUs) at the Machine's resource descriptor. While deciding the cost, drawing arcs we use collected available cpu from machine's resource descriptor.

}
// Running/idle task count
rd_ptr->set_num_running_tasks_below(rd_ptr->current_running_tasks_size());
rd_ptr->set_num_slots_below(FLAGS_max_tasks_per_pu);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should not we set the rd.num_slots_below()?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Current firmament code fixed the maximum number of pods that can be scheduled on PU by using constant number FLAGS_max_tasks_per_pu. So because of this when running_tasks on PU equals FLAGS_max_tasks_per_pu, capacity of arc from Machine to PU is being set to zero. So we are restricted to schedule only FLAGS_max_tasks_per_pu number of tasks/pods on the PU. So in order to remove this restriction we need to set this rd.num_slots_below() to maximum pods that can be scheduled on that PU, that too only for PU.

Code changes can be like.
1)Add new field ‘max_pods’ in ResourceDescriptor, which gets value from kubelet parameter ‘max-pods’ only once when machine is added.
2)For PU node only, while updating the resource descriptor, update num_slots_below to max_pods like below.
rd.set_num_slots_below(machine_rd.max_pods());

In this way, we can schedule max_pods number of pods on that PU not just FLAGS_max_tasks_per_pu.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently max-pods is passed to num_slots_below. Need a new field?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since num_slots_below gets changed every scheduling round, This value does not persist. So better to add new field.

rd_ptr->mutable_available_resources()->set_cpu_cores(
available_cpu_cores);
}
// Running/idle task count

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

353 to 324 is already done at 322 to 324. So should be removed. When accumulator->type_ is PU both code sections gets executed. Because 'other' is 'SINK' when accumulator is 'PU'.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

353 to 354?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. Sorry for the mistake.

@islinwb islinwb force-pushed the max_pod_num branch 4 times, most recently from 5ba6239 to 972b86e Compare June 21, 2018 18:06
@islinwb
Copy link
Author

islinwb commented Jun 22, 2018

@shivramsrivastava PTAL

@nrshrivatsan
Copy link

TL;DR

Schedulers should be cognizant about

  1. Workload-Priority
  2. Predicates/Prerequisite for Workloads

Priority

Instead of a discrete logic of MAX permissible Pods on a Node, would you encourage us having a more condition tree based scheduling?

Node ---- HAS ----> Pods ---> have ---> Priority

Schedulers could honor Pod Priority [ P1, P2, P3 ] along-side maximum permissible number of Pods.

Scheduler ---> Node A ---> Spawn max `P1` Pods ----> Spaw max `P2` Pods

Predicate

Context

It's a good start to have a Rate-Limiting function for workload scheduling.
However workloads in Production grade systems tend to have criteria like

  • Affinity/Anti-affinity
  • Taint tolerant workloads
  • Volume dependent
  • Fault-domain resilient

Suggestion

If we abstract the Rate-limit to a be a type of Predicate for scheduling, we could have an elegant predicate-based posidon scheduling.

Reasons

  • From Business availability stand point, there is a priority for any workload
  • MUST have Pods need to be scheduled first, followed by Nice-to-Have pods

@shivramsrivastava @m1093782566 @islinwb - RFC

@deepak-vij
Copy link

Thanks for your feedback. This is exactly we currently support as part of Poseidon/Firmament scheduler. Node Filtering (Hard Constraints) -> Priority/Scoring (Soft Constraints) -> Taints -> Volume dependency. We are about to incorporate max. pods/node capability to go along with all of the above. If this is what you meant.

@nrshrivatsan
Copy link

@deepak-vij thanks for the comment. Could I request for a few reference hyperlinks to the documentation for Hard & Soft constraints?

@shivramsrivastava
Copy link

@nrshrivatsan

  1. Taints and Tolerations

  2. Pod Level Affinity/Anti-Affinity

  3. Node Level Affinity/Anti-Affinity

  4. Volumes dependent workloads:
    We are currently working on this one, we support only few volumes types and only support pre-bound PVC's and do not support dynamic binding/storage classes. We will publish a document on this soon.

Can you please elaborate on the Rate-Limiting predicate?
Are you suggesting to introduce Rate-Limiting for pods scheduling, as a part of predicate operation?

@nrshrivatsan
Copy link

nrshrivatsan commented Sep 19, 2018

@shivramsrivastava Love the links!

Rate Limiting

  • Set of Pods could satisfy Predicates
  • While scheduling these Pods, since resource quotas might be confined, it would be wise if the Scheduler schedules them in a rate-limited fashion
  • Value gains of Rate-limiting : determinism in pod-scheduling rate.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants