Pytorch ML Training Job on EKS with GPUs, orchestrated by airflow #394

kevinsoucy · 2024-01-16T21:44:33Z

Description of changes:
This pull request adds an example manifest for training ml models using airflow, eks, FSx Lustre, and gpu instances. We leverage Pytorch in this example for training the model.

The manifest can be deployed with and adapted for training any ML model on AWS:
seedfarmer apply manifests/ml-training-on-eks/deployment.yaml

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

dgraeber

Please see comments....

dgraeber · 2024-01-22T16:02:51Z

data/mwaa/requirements/requirements-eks-operator.txt

@@ -0,0 +1 @@
+airflow-kubernetes-job-operator~=2.0.14


Does this work with the specified airflow version (2.2.2)?

Do we still need this file, since mwaa is gone?

No we don't I removed it

dgraeber · 2024-01-22T16:03:11Z

manifests/ml-training-on-eks/core-modules.yaml

+  - name: dag-path
+    value: dags
+  - name: airflow-version
+    value: "2.2.2"


Not a fan of this version...but ok ;)

MWAA is removed so this no longer applies

dgraeber · 2024-01-22T16:04:21Z

modules/ml-training/k8s-managed/README.md

+- `eks-cluster-name`: name of the EKS Cluster to send Jobs to
+- `eks-cluster-kubectl-role-arn`: ARN of the IAM Role used to execute `kubectl` commands on the EKS Cluster
+- `eks-oidc-arn`: ARN of the OpenID Connect Provider assigned to the EKS Cluster
+- `eks-cluster-admin-role-arn`: ARN of the IAM Role configured as a Cluster Admin and associated with the `system:masters` Kubernetes Group


Is this the admin or master role (just confirming). Will this role need to make IAM updates (i.e go to the AWS control plane) or only necessary in k8s?

This is the input for:

kubectl_role_arn: An IAM role with cluster administrator and "system:masters" permissions.

cluster = aws_eks.Cluster.from_cluster_attributes(
self,
f"eks-{self.deployment_name}-{self.module_name}",
cluster_name=eks_cluster_name,
open_id_connect_provider=provider,
kubectl_role_arn=eks_admin_role_arn,
)

dgraeber · 2024-01-22T16:05:02Z

modules/ml-training/k8s-managed/tests/test_stack.py

@@ -0,0 +1,2 @@
+def test_placeholder() -> None:


need 80% code coverage for the app and stack

dgraeber · 2024-01-22T16:05:33Z

modules/ml-training/k8s-managed/stack.py

+        super().__init__(
+            scope,
+            id,
+            description="(SO9154) Autonomous Driving Data Framework (ADDF) - k8s-managed",


This is not a part of the guidance..why this description?

It's legacy from the k8s managed simulation module. I updated to: ""Autonomous Driving Data Framework (ADDF) - k8s-managed ML training""

dgraeber · 2024-01-22T16:06:53Z

modules/ml-training/k8s-managed/requirements-dev.txt

+# to these versions
+
+apache-airflow~=2.7.0
+airflow-kubernetes-job-operator~=2.0.4


this does not match your requirements in the data file in this PR

airflow no longer needed here?

correct - file removed

dgraeber · 2024-01-22T16:07:55Z

modules/ml-training/k8s-managed/modulestack.yaml

+              - "logs:CreateLogGroup"
+              - "logs:Describe*"
+              - "logs:DescribeLogGroups"
+            Resource: '*'


this gives the role full access to all ecr .... can you restrict further?

dgraeber · 2024-01-22T16:08:48Z

modules/ml-training/k8s-managed/training_dags/simple_mock.py

+    start_date=days_ago(1),  # type: ignore
+    schedule_interval="@once",
+) as dag:
+    # caller_identity = PythonOperator(task_id="log_caller_identity", dag=dag, python_callable=log_caller_identity)


All dags have been removed

srinivasreddych · 2024-02-09T18:56:08Z

data/mwaa/requirements/requirements-eks-operator.txt

@@ -0,0 +1 @@
+airflow-kubernetes-job-operator~=2.0.14


Do we still need this file, since mwaa is gone?

srinivasreddych · 2024-02-09T18:57:25Z

modules/ml-training/k8s-managed/README.md

+
+## Description
+
+This module:


Can you explain in depth about what this module does?

srinivasreddych · 2024-02-09T18:58:55Z

modules/ml-training/k8s-managed/requirements-dev.txt

+# to these versions
+
+apache-airflow~=2.7.0
+airflow-kubernetes-job-operator~=2.0.4


airflow no longer needed here?

a13zen · 2024-02-27T14:55:12Z

@kevinsoucy. With the current configuration, this module won't be able to deploy.

The fsx-on-eks integration module requires an EKS namespace to deploy to
The ml-on-eks module has a dependency on the fsx-on-eks integration module and also creates the namespace.

a13zen · 2024-03-01T17:44:23Z

@kevinsoucy. With the current configuration, this module won't be able to deploy.

The fsx-on-eks integration module requires an EKS namespace to deploy to The ml-on-eks module has a dependency on the fsx-on-eks integration module and also creates the namespace.

I've added fix for this issue.

dgraeber

@a13zen please see my comment about the fsx-lustre version.
If you want to the auto-import policy capablity...need to add this to the manifest for fsx-lustre
REF:https://github.com/awslabs/idf-modules/blob/main/modules/storage/fsx-lustre/README.md

  - name: import_policy
    value: "NEW_CHANGED_DELETED"

dgraeber · 2024-03-01T18:44:44Z

manifests/ml-training-on-eks/core-modules.yaml

@@ -127,7 +127,7 @@ parameters:
          #   - mitigate-log4shell
 ---
 name: fsx-lustre
-path: git::https://github.com/awslabs/idf-modules.git//modules/storage/fsx-lustre?ref=main&depth=1 
+path: git::https://github.com/awslabs/idf-modules.git//modules/storage/fsx-lustre?ref=release/1.4.0&depth=1 


@a13zen The latest changes for FSx-Lustre as we have discussed have NOT been released on IDF yet....this version is not correct. I can release IDF now...but that will be a different version

Please change this to release/1.4.1 to get the proper version of fsx-lustre

dgraeber · 2024-03-01T21:09:05Z

manifests/ml-training-on-eks/integration-modules.yaml

@@ -0,0 +1,55 @@
+name: lustre-on-eks


@a13zen you will have to change this local path ref at some time...I guess once the PR is merged?

dgraeber · 2024-03-01T21:11:58Z

manifests/ml-training-on-eks/training-modules.yaml

@@ -0,0 +1,30 @@
+name: training
+path: modules/ml-training/k8s-managed


@a13zen also here...we will need to change from local path after pr is merged

dgraeber · 2024-03-01T22:17:57Z

modules/ml-training/k8s-managed/images/pytorch-mnist/Dockerfile

+# We need to use the nvcr.io/nvidia/pytorch image as a base image to support both linux/amd64 and linux_arm64 platforms.
+# PyTorch=1.13.0, cuda=11.8.0
+# Ref: https://github.com/kubeflow/katib/tree/master/examples/v1beta1/trial-images/pytorch-mnist
+FROM 763104351884.dkr.ecr.eu-central-1.amazonaws.com/pytorch-training:2.1.0-gpu-py310-cu121-ubuntu20.04-ec2


When I run this in us-east-1, I cannot pull this image......I think this cannot be hardcoded (or be region specific??)

dgraeber

The mnist module is too far deeply nested...can you move the module up a level? This is also not reflected in the CHANGELOG. And, the docker file has a hard-coded image...this will fail in any region other than the one you are referencing... The mnist image needs to be reviewed as the README is not accurate

dgraeber · 2024-03-07T21:44:46Z

manifests/ml-training-on-eks/deployment.yaml

@@ -0,0 +1,29 @@
+name: ml-eks-ks


dgraeber · 2024-03-07T21:46:00Z

modules/ml-training/training-images/mnist/src/Dockerfile

+# We need to use the nvcr.io/nvidia/pytorch image as a base image to support both linux/amd64 and linux_arm64 platforms.
+# PyTorch=1.13.0, cuda=11.8.0
+# Ref: https://github.com/kubeflow/katib/tree/master/examples/v1beta1/trial-images/pytorch-mnist
+FROM 763104351884.dkr.ecr.eu-central-1.amazonaws.com/pytorch-training:2.1.0-gpu-py310-cu121-ubuntu20.04-ec2


hardcoded....this needs to be corrected

dgraeber · 2024-03-07T21:47:32Z

modules/ml-training/training-images/mnist/deployspec.yaml

+  phases:
+    install:
+      commands:
+      - npm install -g [email protected]


Update this cdk version to a more modern version

modules/ml-training/training-images/mnist/deployspec.yaml

dgraeber · 2024-03-07T21:55:15Z

modules/ml-training/training-images/mnist/README.md

+## Description
+
+This module contains a Docker container for detecting lanes on images using 
+LaneDet (https://github.com/Turoad/lanedet), with the resnet34_tusimple backbone (configs and weights).  It is designed to incorporate the weights, the model code, and the transformation/processing code into one image with the entry point being `tools/detect_lanes.py` when processing.  That entry point as one (1) positional required arguement to indicate the local path to the configuration (`configs/laneatt/resnet34_tusimple.py` in this case).  


Is this readme correct?? I didn't know mnist does lane detection... ;)

dgraeber

I understand that you are passing in the ECR repo info rather than creating it....I think all you need is the removal of the image from the repo. The latest changes that removed the cdk dependency resolved the nesting issue for automation

dgraeber · 2024-03-08T13:17:17Z

modules/ml-training/training-images/mnist/deployspec.yaml

+  phases:
+    build:
+      commands:
+      - echo "TODO Remove all images"


An important TODO

a13zen · 2024-03-08T16:12:21Z

LGTM!

guarpi and others added 6 commits December 7, 2023 22:51

Adding ml-training modules

e255d2d

Adding ml-training modules

396ee55

chore: delete modules that should use gitref

a2b1de0

WIP: update modules to 1.2

15bb162

Merge branch 'awslabs:main' into feat/cleanup

a399619

add requirements for mwaa

a38ed53

dgraeber requested review from dgraeber, malachi-constant and srinivasreddych January 22, 2024 16:02

dgraeber reviewed Jan 22, 2024

View reviewed changes

Kevin Soucy added 3 commits January 25, 2024 14:50

wip: add sfn

60d1f46

use sfn

6737d13

readme

8700b48

srinivasreddych reviewed Feb 9, 2024

View reviewed changes

a13zen and others added 8 commits February 22, 2024 10:25

Merge branch 'awslabs:main' into feat/cleanup

6b35287

fix: module files

1c318be

fix: fixes for ml-training-on-eks

bd8185b

fix: training container code fixes

7e7f015

fix: revert to using networking module and remote references

5625d84

fix: point to main for idf for eks

104bc1f

fix: scale gpu to 0

60c6536

fix: subnets for isg

a8b9ae9

a13zen added 4 commits February 28, 2024 17:41

fix: deployment for ml-on-eks

8349ec2

fix: add asg tag propogation for cluster-autoscaler

d9c386f

fix: remove dag references

ad7401e

feat: remove dag references

f0ea58b

a13zen force-pushed the feat/cleanup branch from 5ca31f1 to f0ea58b Compare March 1, 2024 16:33

a13zen and others added 2 commits March 1, 2024 17:34

Merge branch 'awslabs:main' into feat/cleanup

daa8981

fix: point to idf main

6c332c7

a13zen added 2 commits March 1, 2024 18:09

fix: linting, remove mwaa files

a50b0ed

fix: update README and Changelog

9d34055

a13zen force-pushed the feat/cleanup branch from c2f5cdf to 9d34055 Compare March 1, 2024 17:43

a13zen added 6 commits March 1, 2024 18:48

fix: update requirements, removed unused requirements-dev

ff7e3b7

fix: update idf ref to 1.4.0

81a3a3f

fix: removed stack description

7fe7011

fix: remove module setup.cfg and pyproject

c750c1e

chore: linting

9ed1286

chore: linting

a7bcc67

dgraeber self-requested a review March 1, 2024 18:56

dgraeber requested changes Mar 1, 2024

View reviewed changes

a13zen added 3 commits March 1, 2024 20:02

fix: update to idf 1.4.1

d8b8a1c

fix: ecr permissions

dea1ffe

fix: add more ecr permissions for ml image repo

75fb78b

dgraeber requested changes Mar 1, 2024

View reviewed changes

a13zen and others added 3 commits March 6, 2024 11:50

fix: use specific region for sagemaker images

2ac7018

chore: linting

e5d9e8e

move image to its own module

f3aeecb

dgraeber requested changes Mar 7, 2024

View reviewed changes

fix: use idf ECR module, scope permissions, fix training docker image

aa590a8

dgraeber requested changes Mar 8, 2024

View reviewed changes

parameterize base image

3a2f312

dgraeber self-requested a review March 8, 2024 14:15

dgraeber approved these changes Mar 8, 2024

View reviewed changes

dgraeber merged commit e4cf49b into awslabs:main Mar 8, 2024
69 checks passed

		@@ -0,0 +1,30 @@
		name: training
		path: modules/ml-training/k8s-managed

		@@ -0,0 +1 @@
		airflow-kubernetes-job-operator~=2.0.14

		@@ -0,0 +1 @@
		airflow-kubernetes-job-operator~=2.0.14

Pytorch ML Training Job on EKS with GPUs, orchestrated by airflow #394

Pytorch ML Training Job on EKS with GPUs, orchestrated by airflow #394

Conversation

kevinsoucy commented Jan 16, 2024

dgraeber left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

a13zen commented Feb 27, 2024

a13zen commented Mar 1, 2024

dgraeber left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dgraeber left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dgraeber left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

a13zen commented Mar 8, 2024

dgraeber left a comment •

edited

Loading

dgraeber left a comment •

edited

Loading

dgraeber left a comment •

edited

Loading