Skip to content

Commit

Permalink
improve readme with max-restart, and env-file example
Browse files Browse the repository at this point in the history
  • Loading branch information
Obliviour committed Nov 30, 2023
1 parent 933c8a0 commit 7252238
Showing 1 changed file with 18 additions and 10 deletions.
28 changes: 18 additions & 10 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -94,7 +94,7 @@ all zones.
```shell
python3 xpk.py cluster create \
--cluster xpk-test --tpu-type=v5litepod-16 \
--num-slices=4
--num-slices=4 --on-demand
```

and recreates the cluster with 8 slices. The command will rerun to create 4
Expand All @@ -103,7 +103,7 @@ all zones.
```shell
python3 xpk.py cluster create \
--cluster xpk-test --tpu-type=v5litepod-16 \
--num-slices=8
--num-slices=8 --on-demand
```

and recreates the cluster with 6 slices. The command will rerun to delete 2
Expand All @@ -113,13 +113,13 @@ all zones.
```shell
python3 xpk.py cluster create \
--cluster xpk-test --tpu-type=v5litepod-16 \
--num-slices=6
--num-slices=6 --on-demand
# Skip delete prompts using --force.
python3 xpk.py cluster create --force \
--cluster xpk-test --tpu-type=v5litepod-16 \
--num-slices=6
--num-slices=6 --on-demand
```
## Cluster Delete
Expand Down Expand Up @@ -160,6 +160,14 @@ all zones.
xpk-test --tpu-type=v5litepod-16
```

### Set `max-restarts` for production jobs

* `--max-restarts <value>`: By default, this is 0. This will restart the job N times
when the job terminates. For production jobs, it is recommended to increase
`<value>` to 50. Real jobs can be interrupted due to hardware failures and
software updates. This works with checkpoints to restart your job where it last
left off.

### Workload Priority and Preemption
* Set the priority level of your workload with `--priority=LEVEL`

Expand Down Expand Up @@ -294,14 +302,16 @@ workload.
# More advanced facts:
* Workload create accepts a --docker-name and --docker-image.
By using custom images you can achieve very fast boots and hence very fast
feedback.
* Workload create accepts a --env-file flag to allow specifying the container's
environment from a file. Usage is the same as Docker's
[--env-file flag](https://docs.docker.com/engine/reference/commandline/run/#env)
Example File:
```shell
LIBTPU_INIT_ARGS=--my-flag=true --performance=high
MY_ENV_VAR=hello
```
* Workload create accepts a --debug-dump-gcs flag which is a path to GCS bucket.
Passing this flag sets the XLA_FLAGS='--xla_dump_to=/tmp/xla_dump/' and uploads
hlo dumps to the specified GCS bucket for each worker.
Expand Down Expand Up @@ -360,8 +370,6 @@ python3 xpk.py cluster create --cluster-cpu-machine-type=CPU_TYPE ...
gcloud auth login
```
### Roles needed based on permission errors:
* `requires one of ["container.*"] permission(s)`
Expand Down

0 comments on commit 7252238

Please sign in to comment.