Skip to content

Commit

Permalink
Verify reservation exists and print describe details if it does (#37)
Browse files Browse the repository at this point in the history
* Verify reservation exists and print describe details if it does

* more readme details
  • Loading branch information
Obliviour committed Dec 5, 2023
1 parent 0efa2b0 commit 7c7c4b6
Show file tree
Hide file tree
Showing 2 changed files with 64 additions and 2 deletions.
37 changes: 36 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -63,6 +63,19 @@ cleanup with a `Cluster Delete`.

## Cluster Create

First set the project and zone through gcloud config or xpk arguments.

```shell
PROJECT_ID=my-project-id
ZONE=us-east5-b
# gcloud config:
gcloud config set project $PROJECT_ID
gcloud config set compute/zone $ZONE
# xpk arguments
xpk .. --zone $ZONE --project $PROJECT_ID
```


The cluster created is a regional cluster to enable the GKE control plane across
all zones.

Expand All @@ -71,7 +84,7 @@ all zones.
```shell
# Find your reservations
gcloud compute reservations list --project=$PROJECT_ID
# Run cluster create with reservation
# Run cluster create with reservation.
python3 xpk.py cluster create \
--cluster xpk-test --tpu-type=v5litepod-256 \
--num-slices=2 \
Expand All @@ -86,6 +99,14 @@ all zones.
--num-slices=4 --on-demand
```

* Cluster Create (provision spot / preemptable capacity):

```shell
python3 xpk.py cluster create \
--cluster xpk-test --tpu-type=v5litepod-16 \
--num-slices=4 --spot
```

* Cluster Create can be called again with the same `--cluster name` to modify
the number of slices or retry failed steps.

Expand Down Expand Up @@ -375,3 +396,17 @@ python3 xpk.py cluster create --cluster-cpu-machine-type=CPU_TYPE ...
* `requires one of ["container.*"] permission(s)`
Add [Kubernetes Engine Admin](https://cloud.google.com/iam/docs/understanding-roles#kubernetes-engine-roles) to your user.
## Reservation Troubleshooting:
### How to determine your reservation and its size / utilization:
```shell
PROJECT_ID=my-project
ZONE=us-east5-b
RESERVATION=my-reservation-name
# Find the reservations in your project
gcloud beta compute reservations list --project=$PROJECT_ID
# Find the tpu machine type and current utilization of a reservation.
gcloud beta compute reservations describe $RESERVATION --project=$PROJECT_ID --zone=$ZONE
```
29 changes: 28 additions & 1 deletion xpk.py
Original file line number Diff line number Diff line change
Expand Up @@ -791,7 +791,7 @@ def print_reservations(args) -> int:
0 if successful and 1 otherwise.
"""
command = (
f'gcloud compute reservations list --project={args.project}'
f'gcloud beta compute reservations list --project={args.project}'
)
return_code = (
run_command_with_updates(
Expand All @@ -803,6 +803,30 @@ def print_reservations(args) -> int:
return 0


def verify_reservation_exists(args) -> int:
"""Verify the reservation exists.
Args:
args: user provided arguments for running the command.
Returns:
0 if successful and 1 otherwise.
"""
command = (
f'gcloud beta compute reservations describe {args.reservation}'
f' --project={args.project} --zone={args.zone}'
)
return_code = (
run_command_with_updates(
command, 'Describe reservation', args)
)
if return_code != 0:
xpk_print(f'Describe reservation returned ERROR {return_code}')
xpk_print('Please confirm that your reservation name is correct.')
return 1
return 0


def get_capacity_arguments(args) -> tuple[str, int]:
"""Determine the TPU Nodepool creation capacity arguments needed.
Expand All @@ -822,6 +846,9 @@ def get_capacity_arguments(args) -> tuple[str, int]:
capacity_args = ""
num_types+=1
if args.reservation:
return_code = verify_reservation_exists(args)
if return_code > 0:
return capacity_args, return_code
capacity_args = (
f'--reservation-affinity=specific --reservation={args.reservation}'
)
Expand Down

0 comments on commit 7c7c4b6

Please sign in to comment.