Skip to content

Commit

Permalink
improve readme with max-restart, and env-file example (#32)
Browse files Browse the repository at this point in the history
* improve readme with max-restart, and env-file example

* fixes to comments
  • Loading branch information
Obliviour committed Dec 5, 2023
1 parent 933c8a0 commit 93d66c0
Showing 1 changed file with 18 additions and 10 deletions.
28 changes: 18 additions & 10 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -94,7 +94,7 @@ all zones.
```shell
python3 xpk.py cluster create \
--cluster xpk-test --tpu-type=v5litepod-16 \
--num-slices=4
--num-slices=4 --reservation=$RESERVATION_ID
```

and recreates the cluster with 8 slices. The command will rerun to create 4
Expand All @@ -103,7 +103,7 @@ all zones.
```shell
python3 xpk.py cluster create \
--cluster xpk-test --tpu-type=v5litepod-16 \
--num-slices=8
--num-slices=8 --reservation=$RESERVATION_ID
```

and recreates the cluster with 6 slices. The command will rerun to delete 2
Expand All @@ -113,13 +113,13 @@ all zones.
```shell
python3 xpk.py cluster create \
--cluster xpk-test --tpu-type=v5litepod-16 \
--num-slices=6
--num-slices=6 --reservation=$RESERVATION_ID
# Skip delete prompts using --force.
python3 xpk.py cluster create --force \
--cluster xpk-test --tpu-type=v5litepod-16 \
--num-slices=6
--num-slices=6 --reservation=$RESERVATION_ID
```
## Cluster Delete
Expand Down Expand Up @@ -160,6 +160,14 @@ all zones.
xpk-test --tpu-type=v5litepod-16
```

### Set `max-restarts` for production jobs

* `--max-restarts <value>`: By default, this is 0. This will restart the job ""
times when the job terminates. For production jobs, it is recommended to
increase this to a large number, say 50. Real jobs can be interrupted due to
hardware failures and software updates. We assume your job has implemented
checkpointing so the job restarts near where it was interrupted.

### Workload Priority and Preemption
* Set the priority level of your workload with `--priority=LEVEL`

Expand Down Expand Up @@ -294,14 +302,16 @@ workload.
# More advanced facts:
* Workload create accepts a --docker-name and --docker-image.
By using custom images you can achieve very fast boots and hence very fast
feedback.
* Workload create accepts a --env-file flag to allow specifying the container's
environment from a file. Usage is the same as Docker's
[--env-file flag](https://docs.docker.com/engine/reference/commandline/run/#env)
Example File:
```shell
LIBTPU_INIT_ARGS=--my-flag=true --performance=high
MY_ENV_VAR=hello
```
* Workload create accepts a --debug-dump-gcs flag which is a path to GCS bucket.
Passing this flag sets the XLA_FLAGS='--xla_dump_to=/tmp/xla_dump/' and uploads
hlo dumps to the specified GCS bucket for each worker.
Expand Down Expand Up @@ -360,8 +370,6 @@ python3 xpk.py cluster create --cluster-cpu-machine-type=CPU_TYPE ...
gcloud auth login
```
### Roles needed based on permission errors:
* `requires one of ["container.*"] permission(s)`
Expand Down

0 comments on commit 93d66c0

Please sign in to comment.