-
Notifications
You must be signed in to change notification settings - Fork 45
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Pytorch ML Training Job on EKS with GPUs, orchestrated by airflow #394
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please see comments....
@@ -0,0 +1 @@ | |||
airflow-kubernetes-job-operator~=2.0.14 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this work with the specified airflow version (2.2.2)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we still need this file, since mwaa is gone?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No we don't I removed it
- name: dag-path | ||
value: dags | ||
- name: airflow-version | ||
value: "2.2.2" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not a fan of this version...but ok ;)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
MWAA is removed so this no longer applies
- `eks-cluster-name`: name of the EKS Cluster to send Jobs to | ||
- `eks-cluster-kubectl-role-arn`: ARN of the IAM Role used to execute `kubectl` commands on the EKS Cluster | ||
- `eks-oidc-arn`: ARN of the OpenID Connect Provider assigned to the EKS Cluster | ||
- `eks-cluster-admin-role-arn`: ARN of the IAM Role configured as a Cluster Admin and associated with the `system:masters` Kubernetes Group |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this the admin or master role (just confirming). Will this role need to make IAM updates (i.e go to the AWS control plane) or only necessary in k8s?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the input for:
kubectl_role_arn: An IAM role with cluster administrator and "system:masters" permissions.
cluster = aws_eks.Cluster.from_cluster_attributes(
self,
f"eks-{self.deployment_name}-{self.module_name}",
cluster_name=eks_cluster_name,
open_id_connect_provider=provider,
kubectl_role_arn=eks_admin_role_arn,
)
@@ -0,0 +1,2 @@ | |||
def test_placeholder() -> None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
need 80% code coverage for the app and stack
super().__init__( | ||
scope, | ||
id, | ||
description="(SO9154) Autonomous Driving Data Framework (ADDF) - k8s-managed", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not a part of the guidance..why this description?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's legacy from the k8s managed simulation module. I updated to: ""Autonomous Driving Data Framework (ADDF) - k8s-managed ML training""
# to these versions | ||
|
||
apache-airflow~=2.7.0 | ||
airflow-kubernetes-job-operator~=2.0.4 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this does not match your requirements in the data file in this PR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
airflow no longer needed here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
correct - file removed
- "logs:CreateLogGroup" | ||
- "logs:Describe*" | ||
- "logs:DescribeLogGroups" | ||
Resource: '*' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this gives the role full access to all ecr .... can you restrict further?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will do
start_date=days_ago(1), # type: ignore | ||
schedule_interval="@once", | ||
) as dag: | ||
# caller_identity = PythonOperator(task_id="log_caller_identity", dag=dag, python_callable=log_caller_identity) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All dags have been removed
@@ -0,0 +1 @@ | |||
airflow-kubernetes-job-operator~=2.0.14 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we still need this file, since mwaa is gone?
|
||
## Description | ||
|
||
This module: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you explain in depth about what this module does?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will do
# to these versions | ||
|
||
apache-airflow~=2.7.0 | ||
airflow-kubernetes-job-operator~=2.0.4 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
airflow no longer needed here?
@kevinsoucy. With the current configuration, this module won't be able to deploy. The fsx-on-eks integration module requires an EKS namespace to deploy to |
I've added fix for this issue. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@a13zen please see my comment about the fsx-lustre version.
If you want to the auto-import policy capablity...need to add this to the manifest for fsx-lustre
REF:https://github.com/awslabs/idf-modules/blob/main/modules/storage/fsx-lustre/README.md
- name: import_policy
value: "NEW_CHANGED_DELETED"
@@ -127,7 +127,7 @@ parameters: | |||
# - mitigate-log4shell | |||
--- | |||
name: fsx-lustre | |||
path: git::https://github.com/awslabs/idf-modules.git//modules/storage/fsx-lustre?ref=main&depth=1 | |||
path: git::https://github.com/awslabs/idf-modules.git//modules/storage/fsx-lustre?ref=release/1.4.0&depth=1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@a13zen The latest changes for FSx-Lustre as we have discussed have NOT been released on IDF yet....this version is not correct. I can release IDF now...but that will be a different version
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please change this to release/1.4.1 to get the proper version of fsx-lustre
@@ -0,0 +1,55 @@ | |||
name: lustre-on-eks |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@a13zen you will have to change this local path ref at some time...I guess once the PR is merged?
@@ -0,0 +1,30 @@ | |||
name: training | |||
path: modules/ml-training/k8s-managed |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@a13zen also here...we will need to change from local path after pr is merged
# We need to use the nvcr.io/nvidia/pytorch image as a base image to support both linux/amd64 and linux_arm64 platforms. | ||
# PyTorch=1.13.0, cuda=11.8.0 | ||
# Ref: https://github.com/kubeflow/katib/tree/master/examples/v1beta1/trial-images/pytorch-mnist | ||
FROM 763104351884.dkr.ecr.eu-central-1.amazonaws.com/pytorch-training:2.1.0-gpu-py310-cu121-ubuntu20.04-ec2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When I run this in us-east-1, I cannot pull this image......I think this cannot be hardcoded (or be region specific??)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The mnist module is too far deeply nested...can you move the module up a level? This is also not reflected in the CHANGELOG. And, the docker file has a hard-coded image...this will fail in any region other than the one you are referencing... The mnist image needs to be reviewed as the README is not accurate
@@ -0,0 +1,29 @@ | |||
name: ml-eks-ks |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ks?
# We need to use the nvcr.io/nvidia/pytorch image as a base image to support both linux/amd64 and linux_arm64 platforms. | ||
# PyTorch=1.13.0, cuda=11.8.0 | ||
# Ref: https://github.com/kubeflow/katib/tree/master/examples/v1beta1/trial-images/pytorch-mnist | ||
FROM 763104351884.dkr.ecr.eu-central-1.amazonaws.com/pytorch-training:2.1.0-gpu-py310-cu121-ubuntu20.04-ec2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hardcoded....this needs to be corrected
phases: | ||
install: | ||
commands: | ||
- npm install -g [email protected] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Update this cdk version to a more modern version
## Description | ||
|
||
This module contains a Docker container for detecting lanes on images using | ||
LaneDet (https://github.com/Turoad/lanedet), with the resnet34_tusimple backbone (configs and weights). It is designed to incorporate the weights, the model code, and the transformation/processing code into one image with the entry point being `tools/detect_lanes.py` when processing. That entry point as one (1) positional required arguement to indicate the local path to the configuration (`configs/laneatt/resnet34_tusimple.py` in this case). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this readme correct?? I didn't know mnist does lane detection... ;)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand that you are passing in the ECR repo info rather than creating it....I think all you need is the removal of the image from the repo. The latest changes that removed the cdk dependency resolved the nesting issue for automation
phases: | ||
build: | ||
commands: | ||
- echo "TODO Remove all images" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
An important TODO
LGTM! |
Description of changes:
This pull request adds an example manifest for training ml models using airflow, eks, FSx Lustre, and gpu instances. We leverage Pytorch in this example for training the model.
The manifest can be deployed with and adapted for training any ML model on AWS:
seedfarmer apply manifests/ml-training-on-eks/deployment.yaml
By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.