From 7252238f270fb7659897dda119cdde76755bc1fa Mon Sep 17 00:00:00 2001 From: Obliviour Date: Thu, 30 Nov 2023 00:50:40 +0000 Subject: [PATCH] improve readme with max-restart, and env-file example --- README.md | 28 ++++++++++++++++++---------- 1 file changed, 18 insertions(+), 10 deletions(-) diff --git a/README.md b/README.md index feac5d9..bc8e579 100644 --- a/README.md +++ b/README.md @@ -94,7 +94,7 @@ all zones. ```shell python3 xpk.py cluster create \ --cluster xpk-test --tpu-type=v5litepod-16 \ - --num-slices=4 + --num-slices=4 --on-demand ``` and recreates the cluster with 8 slices. The command will rerun to create 4 @@ -103,7 +103,7 @@ all zones. ```shell python3 xpk.py cluster create \ --cluster xpk-test --tpu-type=v5litepod-16 \ - --num-slices=8 + --num-slices=8 --on-demand ``` and recreates the cluster with 6 slices. The command will rerun to delete 2 @@ -113,13 +113,13 @@ all zones. ```shell python3 xpk.py cluster create \ --cluster xpk-test --tpu-type=v5litepod-16 \ - --num-slices=6 + --num-slices=6 --on-demand # Skip delete prompts using --force. python3 xpk.py cluster create --force \ --cluster xpk-test --tpu-type=v5litepod-16 \ - --num-slices=6 + --num-slices=6 --on-demand ``` ## Cluster Delete @@ -160,6 +160,14 @@ all zones. xpk-test --tpu-type=v5litepod-16 ``` +### Set `max-restarts` for production jobs + +* `--max-restarts `: By default, this is 0. This will restart the job N times +when the job terminates. For production jobs, it is recommended to increase +`` to 50. Real jobs can be interrupted due to hardware failures and +software updates. This works with checkpoints to restart your job where it last +left off. + ### Workload Priority and Preemption * Set the priority level of your workload with `--priority=LEVEL` @@ -294,14 +302,16 @@ workload. # More advanced facts: -* Workload create accepts a --docker-name and --docker-image. -By using custom images you can achieve very fast boots and hence very fast -feedback. - * Workload create accepts a --env-file flag to allow specifying the container's environment from a file. Usage is the same as Docker's [--env-file flag](https://docs.docker.com/engine/reference/commandline/run/#env) + Example File: + ```shell + LIBTPU_INIT_ARGS=--my-flag=true --performance=high + MY_ENV_VAR=hello + ``` + * Workload create accepts a --debug-dump-gcs flag which is a path to GCS bucket. Passing this flag sets the XLA_FLAGS='--xla_dump_to=/tmp/xla_dump/' and uploads hlo dumps to the specified GCS bucket for each worker. @@ -360,8 +370,6 @@ python3 xpk.py cluster create --cluster-cpu-machine-type=CPU_TYPE ... gcloud auth login ``` - - ### Roles needed based on permission errors: * `requires one of ["container.*"] permission(s)`