Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

enable create workload for h150 #133

Open
wants to merge 3 commits into
base: main
Choose a base branch
from
Open

enable create workload for h150 #133

wants to merge 3 commits into from

Conversation

NinaCai
Copy link
Collaborator

@NinaCai NinaCai commented Apr 18, 2024

Fixes / Features

  • hot fix for reservation
  • enable create workload for h150
  • remove nccl plugin installation in workload creation

Testing / Documentation

Testing details.

  • [ y/n ] Tests pass
  • [ y/n ] Appropriate changes to documentation are included in the PR

@NinaCai NinaCai requested a review from Obliviour as a code owner April 18, 2024 23:30
xpk.py Outdated
' kubectl config view && kubectl config set-context --current'
' --namespace=default'
)
if args.device_type == h150_device_type:
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Obliviour Here we need to add --device-type back to workload deletion, because A3 and A3+ handle zone/region differently.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

get-credentials command is searching for the cluster. XPK only creates regional clusters so it should only need to search for a cluster using --region arg.


A scenario where you would need --zone is if the cluster was created outside of xpk using --zone. I am guessing that is what happened in the h150 cluster you are testing with. So we shouldn't need a different code path.

Copy link
Collaborator

@Obliviour Obliviour left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Initial read, probably need to take a deeper look Monday.

@@ -2123,7 +2102,7 @@ def parse_env_config(args, tensorboard_config):
for key, value in tensorboard_config.items():
env[key.upper()] = value

if device_type == h100_device_type:
if device_type == h100_device_type or device_type == h150_device_type:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe change this to if accelerator_type = AcceleratorType['gpu']?

@@ -3120,7 +3099,7 @@ def get_capacity_arguments_from_capacity_type(
capacity_args = '--spot'
case CapacityType.RESERVATION:
capacity_args = (
f'--reservation-affinity=specific --reservation={args.reservation}'
f'--reservation-affinity=specific --reservation={args.reservation} --placement-policy={args.reservation}'
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this needed in TPUs / CPUs? I see the build test failing with TPUs, I assume related to this.

xpk.py Outdated
' kubectl config view && kubectl config set-context --current'
' --namespace=default'
)
if args.device_type == h150_device_type:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

get-credentials command is searching for the cluster. XPK only creates regional clusters so it should only need to search for a cluster using --region arg.


A scenario where you would need --zone is if the cluster was created outside of xpk using --zone. I am guessing that is what happened in the h150 cluster you are testing with. So we shouldn't need a different code path.

if args.device_type == h100_device_type:
return "kueue.x-k8s.io/queue-name: multislice-queue" # Name of the LocalQueue
else:
return "" #A3+ doesn't support Kueue yet
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Curious why A3+ doesn't support kueue? and what is missing there

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants