Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pytorch ML Training Job on EKS with GPUs, orchestrated by airflow #394

Merged
merged 39 commits into from
Mar 8, 2024

Conversation

kevinsoucy
Copy link
Contributor

Description of changes:
This pull request adds an example manifest for training ml models using airflow, eks, FSx Lustre, and gpu instances. We leverage Pytorch in this example for training the model.

The manifest can be deployed with and adapted for training any ML model on AWS:
seedfarmer apply manifests/ml-training-on-eks/deployment.yaml

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

Copy link
Contributor

@dgraeber dgraeber left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please see comments....

@@ -0,0 +1 @@
airflow-kubernetes-job-operator~=2.0.14
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this work with the specified airflow version (2.2.2)?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we still need this file, since mwaa is gone?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No we don't I removed it

- name: dag-path
value: dags
- name: airflow-version
value: "2.2.2"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a fan of this version...but ok ;)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MWAA is removed so this no longer applies

- `eks-cluster-name`: name of the EKS Cluster to send Jobs to
- `eks-cluster-kubectl-role-arn`: ARN of the IAM Role used to execute `kubectl` commands on the EKS Cluster
- `eks-oidc-arn`: ARN of the OpenID Connect Provider assigned to the EKS Cluster
- `eks-cluster-admin-role-arn`: ARN of the IAM Role configured as a Cluster Admin and associated with the `system:masters` Kubernetes Group
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this the admin or master role (just confirming). Will this role need to make IAM updates (i.e go to the AWS control plane) or only necessary in k8s?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the input for:

kubectl_role_arn: An IAM role with cluster administrator and "system:masters" permissions.

cluster = aws_eks.Cluster.from_cluster_attributes(
self,
f"eks-{self.deployment_name}-{self.module_name}",
cluster_name=eks_cluster_name,
open_id_connect_provider=provider,
kubectl_role_arn=eks_admin_role_arn,
)

@@ -0,0 +1,2 @@
def test_placeholder() -> None:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

need 80% code coverage for the app and stack

super().__init__(
scope,
id,
description="(SO9154) Autonomous Driving Data Framework (ADDF) - k8s-managed",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not a part of the guidance..why this description?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's legacy from the k8s managed simulation module. I updated to: ""Autonomous Driving Data Framework (ADDF) - k8s-managed ML training""

# to these versions

apache-airflow~=2.7.0
airflow-kubernetes-job-operator~=2.0.4
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this does not match your requirements in the data file in this PR

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

airflow no longer needed here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

correct - file removed

- "logs:CreateLogGroup"
- "logs:Describe*"
- "logs:DescribeLogGroups"
Resource: '*'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this gives the role full access to all ecr .... can you restrict further?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will do

start_date=days_ago(1), # type: ignore
schedule_interval="@once",
) as dag:
# caller_identity = PythonOperator(task_id="log_caller_identity", dag=dag, python_callable=log_caller_identity)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All dags have been removed

@@ -0,0 +1 @@
airflow-kubernetes-job-operator~=2.0.14
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we still need this file, since mwaa is gone?


## Description

This module:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you explain in depth about what this module does?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will do

# to these versions

apache-airflow~=2.7.0
airflow-kubernetes-job-operator~=2.0.4
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

airflow no longer needed here?

@a13zen
Copy link
Contributor

a13zen commented Feb 27, 2024

@kevinsoucy. With the current configuration, this module won't be able to deploy.

The fsx-on-eks integration module requires an EKS namespace to deploy to
The ml-on-eks module has a dependency on the fsx-on-eks integration module and also creates the namespace.

@a13zen
Copy link
Contributor

a13zen commented Mar 1, 2024

@kevinsoucy. With the current configuration, this module won't be able to deploy.

The fsx-on-eks integration module requires an EKS namespace to deploy to The ml-on-eks module has a dependency on the fsx-on-eks integration module and also creates the namespace.

I've added fix for this issue.

@dgraeber dgraeber self-requested a review March 1, 2024 18:56
Copy link
Contributor

@dgraeber dgraeber left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@a13zen please see my comment about the fsx-lustre version.
If you want to the auto-import policy capablity...need to add this to the manifest for fsx-lustre
REF:https://github.com/awslabs/idf-modules/blob/main/modules/storage/fsx-lustre/README.md

  - name: import_policy
    value: "NEW_CHANGED_DELETED"

@@ -127,7 +127,7 @@ parameters:
# - mitigate-log4shell
---
name: fsx-lustre
path: git::https://github.com/awslabs/idf-modules.git//modules/storage/fsx-lustre?ref=main&depth=1
path: git::https://github.com/awslabs/idf-modules.git//modules/storage/fsx-lustre?ref=release/1.4.0&depth=1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@a13zen The latest changes for FSx-Lustre as we have discussed have NOT been released on IDF yet....this version is not correct. I can release IDF now...but that will be a different version

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please change this to release/1.4.1 to get the proper version of fsx-lustre

@@ -0,0 +1,55 @@
name: lustre-on-eks
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@a13zen you will have to change this local path ref at some time...I guess once the PR is merged?

@@ -0,0 +1,30 @@
name: training
path: modules/ml-training/k8s-managed
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@a13zen also here...we will need to change from local path after pr is merged

# We need to use the nvcr.io/nvidia/pytorch image as a base image to support both linux/amd64 and linux_arm64 platforms.
# PyTorch=1.13.0, cuda=11.8.0
# Ref: https://github.com/kubeflow/katib/tree/master/examples/v1beta1/trial-images/pytorch-mnist
FROM 763104351884.dkr.ecr.eu-central-1.amazonaws.com/pytorch-training:2.1.0-gpu-py310-cu121-ubuntu20.04-ec2
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When I run this in us-east-1, I cannot pull this image......I think this cannot be hardcoded (or be region specific??)

Copy link
Contributor

@dgraeber dgraeber left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The mnist module is too far deeply nested...can you move the module up a level? This is also not reflected in the CHANGELOG. And, the docker file has a hard-coded image...this will fail in any region other than the one you are referencing... The mnist image needs to be reviewed as the README is not accurate

@@ -0,0 +1,29 @@
name: ml-eks-ks
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ks?

# We need to use the nvcr.io/nvidia/pytorch image as a base image to support both linux/amd64 and linux_arm64 platforms.
# PyTorch=1.13.0, cuda=11.8.0
# Ref: https://github.com/kubeflow/katib/tree/master/examples/v1beta1/trial-images/pytorch-mnist
FROM 763104351884.dkr.ecr.eu-central-1.amazonaws.com/pytorch-training:2.1.0-gpu-py310-cu121-ubuntu20.04-ec2
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hardcoded....this needs to be corrected

phases:
install:
commands:
- npm install -g [email protected]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Update this cdk version to a more modern version

modules/ml-training/training-images/mnist/deployspec.yaml Outdated Show resolved Hide resolved
## Description

This module contains a Docker container for detecting lanes on images using
LaneDet (https://github.com/Turoad/lanedet), with the resnet34_tusimple backbone (configs and weights). It is designed to incorporate the weights, the model code, and the transformation/processing code into one image with the entry point being `tools/detect_lanes.py` when processing. That entry point as one (1) positional required arguement to indicate the local path to the configuration (`configs/laneatt/resnet34_tusimple.py` in this case).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this readme correct?? I didn't know mnist does lane detection... ;)

Copy link
Contributor

@dgraeber dgraeber left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand that you are passing in the ECR repo info rather than creating it....I think all you need is the removal of the image from the repo. The latest changes that removed the cdk dependency resolved the nesting issue for automation

phases:
build:
commands:
- echo "TODO Remove all images"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An important TODO

@dgraeber dgraeber self-requested a review March 8, 2024 14:15
@a13zen
Copy link
Contributor

a13zen commented Mar 8, 2024

LGTM!

@dgraeber dgraeber merged commit e4cf49b into awslabs:main Mar 8, 2024
69 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants