Skip to content

Commit

Permalink
separate lammps from hello-world and install nightly release of hq
Browse files Browse the repository at this point in the history
Signed-off-by: vsoch <[email protected]>
  • Loading branch information
vsoch committed Jun 21, 2023
1 parent 39537bc commit a6f7b76
Show file tree
Hide file tree
Showing 11 changed files with 215 additions and 49 deletions.
2 changes: 1 addition & 1 deletion Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -191,7 +191,7 @@ KUSTOMIZE ?= $(LOCALBIN)/kustomize
CONTROLLER_GEN ?= $(LOCALBIN)/controller-gen
ENVTEST ?= $(LOCALBIN)/setup-envtest

# Buidl config
# Build config
.PHONY: build-config
build-config: manifests kustomize ## Deploy controller to the K8s cluster specified in ~/.kube/config.
cd config/manager && $(KUSTOMIZE) edit set image controller=${IMG}
Expand Down
109 changes: 108 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -78,13 +78,111 @@ See logs for the operator
$ kubectl logs -n hyperqueue-operator-system hyperqueue-operator-controller-manager-6f6945579-9pknp
```

#### Hello World Example

Create a "hello-world" interactive cluster:

```bash
$ kubectl apply -f examples/tests/hello-world/hyperqueue.yaml
```

Look at the logs to see the worker/server starting:
After the access pod runs and generates the node access file (which you can inspect):

<details>

<summary>Node access.json generation</summary>

```console
$ kubectl logs -n hyperqueue-operator hyperqueue-sample-access
Get:1 http://security.ubuntu.com/ubuntu jammy-security InRelease [110 kB]
Get:2 http://archive.ubuntu.com/ubuntu jammy InRelease [270 kB]
Get:3 http://security.ubuntu.com/ubuntu jammy-security/universe amd64 Packages [938 kB]
Get:4 http://archive.ubuntu.com/ubuntu jammy-updates InRelease [119 kB]
Get:5 http://archive.ubuntu.com/ubuntu jammy-backports InRelease [108 kB]
Get:6 http://archive.ubuntu.com/ubuntu jammy/multiverse amd64 Packages [266 kB]
Get:7 http://security.ubuntu.com/ubuntu jammy-security/main amd64 Packages [631 kB]
Get:8 http://archive.ubuntu.com/ubuntu jammy/universe amd64 Packages [17.5 MB]
Get:9 http://security.ubuntu.com/ubuntu jammy-security/multiverse amd64 Packages [36.3 kB]
Get:10 http://security.ubuntu.com/ubuntu jammy-security/restricted amd64 Packages [541 kB]
Get:11 http://archive.ubuntu.com/ubuntu jammy/main amd64 Packages [1792 kB]
Get:12 http://archive.ubuntu.com/ubuntu jammy/restricted amd64 Packages [164 kB]
Get:13 http://archive.ubuntu.com/ubuntu jammy-updates/restricted amd64 Packages [545 kB]
Get:14 http://archive.ubuntu.com/ubuntu jammy-updates/multiverse amd64 Packages [42.2 kB]
Get:15 http://archive.ubuntu.com/ubuntu jammy-updates/main amd64 Packages [919 kB]
Get:16 http://archive.ubuntu.com/ubuntu jammy-updates/universe amd64 Packages [1191 kB]
Get:17 http://archive.ubuntu.com/ubuntu jammy-backports/universe amd64 Packages [27.0 kB]
Get:18 http://archive.ubuntu.com/ubuntu jammy-backports/main amd64 Packages [49.4 kB]
Fetched 25.2 MB in 5s (4622 kB/s)
Reading package lists... Done
2023-06-21T00:59:18Z INFO Storing access file as 'operator-access.json'
CUT HERE
{
"version": "nightly-2023-06-20-db011ed5a3faecf31168709417cd8a736a297a50",
"server_uid": "VhbvXU",
"client": {
"host": "hyperqueue-sample-server-0-0.hq-service.hyperqueue-operator.svc.cluster.local",
"port": 6789,
"secret_key": "4ba72c008fb72b29f8db96ba5c103fc29b73157afe10b6e53bb2832b5e152d46"
},
"worker": {
"host": "hyperqueue-sample-server-0-0.hq-service.hyperqueue-operator.svc.cluster.local",
"port": 1234,
"secret_key": "63ca4ca518a568da934edc7652c0178ca05436d720362693d257c68a2b7286aa"
}
}
```

</details>

You should be able to see the server start, submit the job, and then it will `--wait` for it to finish and cat the output
file:

```bash
$ kubectl logs -n hyperqueue-operator hyperqueue-sample-server-0-0-fnnnj -f
```
```console
Found extra command echo hello world
2023-06-21T00:59:44Z INFO No online server found, starting a new server
2023-06-21T00:59:44Z INFO Storing access file as '/root/.hq-server/001/access.json'
+------------------+-------------------------------------------------------------------------------+
| Server directory | /root/.hq-server |
| Server UID | VhbvXU |
| Client host | hyperqueue-sample-server-0-0.hq-service.hyperqueue-operator.svc.cluster.local |
| Client port | 6789 |
| Worker host | hyperqueue-sample-server-0-0.hq-service.hyperqueue-operator.svc.cluster.local |
| Worker port | 1234 |
| Version | nightly-2023-06-20-db011ed5a3faecf31168709417cd8a736a297a50 |
| Pid | 2710 |
| Start date | 2023-06-21 00:59:44 UTC |
+------------------+-------------------------------------------------------------------------------+
2023-06-21T00:59:44Z INFO Worker 1 registered from 10.244.0.22:60396
2023-06-21T00:59:45Z INFO Worker 2 registered from 10.244.0.23:57144
hq submit --wait --name hello-world --nodes 2 --log hello-world.out echo hello world
Job submitted successfully, job ID: 1
Wait finished in 46ms 37us 130ns: 1 job finished
HQ:log
hello world
```

Since `interactive: true` is set, you can now shell in and interact with your cluster!
See the [interactive example](#interactive-example) for how we did this with the LAMMPS example.
Clean up the example when you are done:

```bash
$ kubectl delete -f examples/tests/hello-world/hyperqueue.yaml
```

#### LAMMPS Example

This example (for the time being) uses a custom image with hq already installed.

```bash
$ kubectl apply -f examples/tests/lammps/hyperqueue.yaml
```

Note that since we are pulling large custom containers, this will take a little bit longer.
Make sure the pods are running before trying to look at logs! When they are,
look at the logs to see the worker/server starting:

```console
2023-06-04T06:03:50Z INFO No online server found, starting a new server
Expand Down Expand Up @@ -211,6 +309,12 @@ Dangerous builds not checked
Total wall time: 0:00:12
```

In the above, we see the two workers registering, and then MPI/LAMMPS running with 2 processes
(one thread per node I think). Akin to the first example, we have `interactive: true` here so
you can proceed to the next section for an interactive example.

#### Interactive Example

Since our job sets interactive: true, this means the cluster stays running after the job is finished,
and we can interactively shell in and submit a job, e.g.,:

Expand Down Expand Up @@ -243,6 +347,9 @@ $ hq job list --all

If you don't specify a `--log` file, depending on where you run it, the logs can show up on any worker,
typically in the same working directory in a directory called `log-N` (e.g., log-4). And that's it!

#### Cleanup

When you are finished:

```bash
Expand Down
18 changes: 11 additions & 7 deletions api/v1alpha1/hyperqueue_types.go
Original file line number Diff line number Diff line change
Expand Up @@ -28,7 +28,7 @@ type HyperqueueSpec struct {
// INSERT ADDITIONAL SPEC FIELDS - desired state of cluster
// Important: Run "make" to regenerate code after modifying this file

// Server is the main server to run hyperqueue
// Server is the main server to run Hyperqueue
Server Node `json:"server"`

// Name for the cluster service
Expand All @@ -48,11 +48,15 @@ type HyperqueueSpec struct {
// +kubebuilder:default="0.15.0"
// +default="0.15.0"
// +optional
HyperqueueVersion string `json:"hyperqueueVersion,omitempty"`
HyperqueueVersion string `json:"HyperqueueVersion,omitempty"`

// Size of the hyperqueue (1 server + (N-1) nodes)
// Size of the Hyperqueue (1 server + (N-1) nodes)
Size int32 `json:"size"`

// Global commands to run on all nodes
// +optional
Commands Commands `json:"commands,omitempty"`

// Interactive mode keeps the cluster running
// +optional
Interactive bool `json:"interactive"`
Expand Down Expand Up @@ -87,13 +91,13 @@ type Job struct {
// Node corresponds to a pod (server or worker)
type Node struct {

// Image to use for hyperqueue
// Image to use for Hyperqueue
// +kubebuilder:default="ubuntu"
// +default="ubuntu"
// +optional
Image string `json:"image"`

// Port for hyperqueue to use.
// Port for Hyperqueue to use.
// Since we have a headless service, this
// is not represented in the operator, just
// in starting the server or a worker
Expand All @@ -112,7 +116,7 @@ type Node struct {
// +optional
Command string `json:"command,omitempty"`

// Commands to run around different parts of the hyperqueu setup
// Commands to run around different parts of the hyperqueue setup
// +optional
Commands Commands `json:"commands,omitempty"`

Expand Down Expand Up @@ -201,7 +205,7 @@ type HyperqueueStatus struct{}
//+kubebuilder:object:root=true
//+kubebuilder:subresource:status

// Hyperqueue is the Schema for the hyperqueues API
// Hyperqueue is the Schema for the Hyperqueues API
type Hyperqueue struct {
metav1.TypeMeta `json:",inline"`
metav1.ObjectMeta `json:"metadata,omitempty"`
Expand Down
1 change: 1 addition & 0 deletions api/v1alpha1/zz_generated.deepcopy.go

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

35 changes: 21 additions & 14 deletions config/crd/bases/flux-framework.org_hyperqueues.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,7 @@ spec:
- name: v1alpha1
schema:
openAPIV3Schema:
description: Hyperqueue is the Schema for the hyperqueues API
description: Hyperqueue is the Schema for the Hyperqueues API
properties:
apiVersion:
description: 'APIVersion defines the versioned schema of this representation
Expand All @@ -35,17 +35,24 @@ spec:
spec:
description: HyperqueueSpec defines the desired state of Hyperqueue
properties:
HyperqueueVersion:
default: 0.15.0
description: Release of Hyperqueue to installed (if hq binary not
found in PATH)
type: string
commands:
description: Global commands to run on all nodes
properties:
init:
description: Init runs before anything in both scripts
type: string
type: object
deadlineSeconds:
default: 31500000
description: Time limit for the job Approximately one year. This cannot
be zero or job won't start
format: int64
type: integer
hyperqueueVersion:
default: 0.15.0
description: Release of Hyperqueue to installed (if hq binary not
found in PATH)
type: string
interactive:
description: Interactive mode keeps the cluster running
type: boolean
Expand All @@ -72,13 +79,13 @@ spec:
description: Resources include limits and requests
type: object
server:
description: Server is the main server to run hyperqueue
description: Server is the main server to run Hyperqueue
properties:
command:
description: Command will be honored by a server node
type: string
commands:
description: Commands to run around different parts of the hyperqueu
description: Commands to run around different parts of the hyperqueue
setup
properties:
init:
Expand All @@ -92,10 +99,10 @@ spec:
type: object
image:
default: ubuntu
description: Image to use for hyperqueue
description: Image to use for Hyperqueue
type: string
port:
description: Port for hyperqueue to use. Since we have a headless
description: Port for Hyperqueue to use. Since we have a headless
service, this is not represented in the operator, just in starting
the server or a worker
format: int32
Expand Down Expand Up @@ -140,7 +147,7 @@ spec:
description: Name for the cluster service
type: string
size:
description: Size of the hyperqueue (1 server + (N-1) nodes)
description: Size of the Hyperqueue (1 server + (N-1) nodes)
format: int32
type: integer
worker:
Expand All @@ -151,7 +158,7 @@ spec:
description: Command will be honored by a server node
type: string
commands:
description: Commands to run around different parts of the hyperqueu
description: Commands to run around different parts of the hyperqueue
setup
properties:
init:
Expand All @@ -165,10 +172,10 @@ spec:
type: object
image:
default: ubuntu
description: Image to use for hyperqueue
description: Image to use for Hyperqueue
type: string
port:
description: Port for hyperqueue to use. Since we have a headless
description: Port for Hyperqueue to use. Since we have a headless
service, this is not represented in the operator, just in starting
the server or a worker
format: int32
Expand Down
2 changes: 2 additions & 0 deletions controllers/hyperqueue/templates.go
Original file line number Diff line number Diff line change
Expand Up @@ -37,6 +37,7 @@ type NodeTemplate struct {
Node api.Node
Spec api.HyperqueueSpec
ClusterName string
Namespace string
}

// combineTemplates into one "start"
Expand All @@ -58,6 +59,7 @@ func generateScript(cluster *api.Hyperqueue, node api.Node, startTemplate string
Node: node,
Spec: cluster.Spec,
ClusterName: cluster.Name,
Namespace: cluster.Namespace,
}

// Wrap the named template to identify it later
Expand Down
4 changes: 2 additions & 2 deletions controllers/hyperqueue/templates/access.sh
Original file line number Diff line number Diff line change
Expand Up @@ -7,8 +7,8 @@
# This should start and exit cleanly
# For now use the same port for server and worker, not sure why should be different?
mkdir -p /tmp/access
hq server generate-access operator-access.json --client-port={{ .Spec.Server.Port }} --worker-port={{ .Spec.Worker.Port }} --host {{ .ClusterName }}-server-0-0.{{ .Spec.ServiceName }}.hyperqueue-operator.svc.cluster.local
hq server generate-access operator-access.json --client-port={{ .Spec.Server.Port }} --worker-port={{ .Spec.Worker.Port }} --host {{ .ClusterName }}-server-0-0.{{ .Spec.ServiceName }}.{{ .Namespace }}.svc.cluster.local

sleep 2
echo "CUT HERE"
cat operator-access.json
cat operator-access.json
12 changes: 8 additions & 4 deletions controllers/hyperqueue/templates/components.sh
Original file line number Diff line number Diff line change
Expand Up @@ -3,19 +3,23 @@
# Shared components for the broker and worker template
{{define "init"}}

# Initialization commands
# Initialization commands - first global, then node specific
{{ .Spec.Commands.Init}} > /dev/null 2>&1
{{ .Node.Commands.Init}} > /dev/null 2>&1

which wget > /dev/null 2>&1 || (echo "Please install wget"; exit);

function download() {
wget https://github.com/It4innovations/hyperqueue/releases/download/v{{ .Spec.HyperqueueVersion }}/hq-v{{ .Spec.HyperqueueVersion }}-linux-x64.tar.gz
tar -xvzf hq-v{{ .Spec.HyperqueueVersion }}-linux-x64.tar.gz
# This is just here for development, while our feature isn't provided in a release
wget https://github.com/It4innovations/hyperqueue/releases/download/nightly/hq-nightly-2023-06-20-db011ed5a3faecf31168709417cd8a736a297a50-linux-x64.tar.gz
# wget https://github.com/It4innovations/hyperqueue/releases/download/v{{ .Spec.HyperqueueVersion }}/hq-v{{ .Spec.HyperqueueVersion }}-linux-x64.tar.gz
# tar -xvzf hq-v{{ .Spec.HyperqueueVersion }}-linux-x64.tar.gz
tar -xzvf hq-nightly-2023-06-20-db011ed5a3faecf31168709417cd8a736a297a50-linux-x64.tar.gz
mv hq /usr/bin/hq
}

# If hyperqueue isn't installed, install it
# which hq > /dev/null 2>&1 || (download > /dev/null 2>&1);
which hq > /dev/null 2>&1 || (download > /dev/null 2>&1);
# Download development version for now

# The working directory should be set by the CRD or the container
Expand Down
Loading

0 comments on commit a6f7b76

Please sign in to comment.