Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pytorch ML Training Job on EKS with GPUs, orchestrated by airflow #394

Merged
merged 39 commits into from
Mar 8, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
39 commits
Select commit Hold shift + click to select a range
e255d2d
Adding ml-training modules
guarpi Dec 8, 2023
396ee55
Adding ml-training modules
guarpi Dec 8, 2023
a2b1de0
chore: delete modules that should use gitref
gonzalobarbeito Dec 12, 2023
15bb162
WIP: update modules to 1.2
gonzalobarbeito Dec 14, 2023
a399619
Merge branch 'awslabs:main' into feat/cleanup
kevinsoucy Jan 16, 2024
a38ed53
add requirements for mwaa
Jan 16, 2024
60d1f46
wip: add sfn
Jan 25, 2024
6737d13
use sfn
Feb 1, 2024
8700b48
readme
Feb 1, 2024
6b35287
Merge branch 'awslabs:main' into feat/cleanup
a13zen Feb 22, 2024
1c318be
fix: module files
a13zen Feb 19, 2024
bd8185b
fix: fixes for ml-training-on-eks
a13zen Feb 21, 2024
7e7f015
fix: training container code fixes
a13zen Feb 21, 2024
5625d84
fix: revert to using networking module and remote references
a13zen Feb 22, 2024
104bc1f
fix: point to main for idf for eks
a13zen Feb 22, 2024
60c6536
fix: scale gpu to 0
a13zen Feb 27, 2024
a8b9ae9
fix: subnets for isg
a13zen Feb 27, 2024
8349ec2
fix: deployment for ml-on-eks
a13zen Feb 28, 2024
d9c386f
fix: add asg tag propogation for cluster-autoscaler
a13zen Mar 1, 2024
ad7401e
fix: remove dag references
a13zen Mar 1, 2024
f0ea58b
feat: remove dag references
a13zen Mar 1, 2024
daa8981
Merge branch 'awslabs:main' into feat/cleanup
a13zen Mar 1, 2024
6c332c7
fix: point to idf main
a13zen Mar 1, 2024
a50b0ed
fix: linting, remove mwaa files
a13zen Mar 1, 2024
9d34055
fix: update README and Changelog
a13zen Mar 1, 2024
ff7e3b7
fix: update requirements, removed unused requirements-dev
a13zen Mar 1, 2024
81a3a3f
fix: update idf ref to 1.4.0
a13zen Mar 1, 2024
7fe7011
fix: removed stack description
a13zen Mar 1, 2024
c750c1e
fix: remove module setup.cfg and pyproject
a13zen Mar 1, 2024
9ed1286
chore: linting
a13zen Mar 1, 2024
a7bcc67
chore: linting
a13zen Mar 1, 2024
d8b8a1c
fix: update to idf 1.4.1
a13zen Mar 1, 2024
dea1ffe
fix: ecr permissions
a13zen Mar 1, 2024
75fb78b
fix: add more ecr permissions for ml image repo
a13zen Mar 1, 2024
2ac7018
fix: use specific region for sagemaker images
a13zen Mar 6, 2024
e5d9e8e
chore: linting
a13zen Mar 6, 2024
f3aeecb
move image to its own module
Mar 7, 2024
aa590a8
fix: use idf ECR module, scope permissions, fix training docker image
a13zen Mar 8, 2024
3a2f312
parameterize base image
Mar 8, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 4 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,7 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
## UNRELEASED

### **Added**
- Added `ml-training/k8s-managed` module: Run ML Training jobs on EKS via AWS Step Function (Sample PyTorch Training Job)

### **Changed**

Expand All @@ -22,6 +23,9 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
- updated `typing-extensions~=4.6.3` in modules that were below that spec to support seed-farmer 3.1.x
- fixed security updates about apache-airflow version
- made os-tunnel module generic by importing seedfarmer project name
- updated fsx-lustre-on-eks module to create EKS namespace if it does not exist
- fix removed default region from fsx-lustre-on-eks deployspec
- update cdk version for 'fsx-lustre-on-eks' module

### **Removed**

Expand Down
164 changes: 164 additions & 0 deletions manifests/ml-training-on-eks/core-modules.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,164 @@
name: eks
# TODO: Update and point to new release when PR is merged: https://github.com/awslabs/idf-modules/pull/128
path: git::https://github.com/awslabs/idf-modules.git//modules/compute/eks?ref=release/1.4.1&depth=1
dataFiles:
- filePath: git::https://github.com/awslabs/idf-modules.git//data/eks_dockerimage-replication/versions/1.25.yaml?ref=release/1.4.1&depth=1
- filePath: git::https://github.com/awslabs/idf-modules.git//data/eks_dockerimage-replication/versions/default.yaml?ref=release/1.4.1&depth=1
parameters:
- name: replicated-ecr-images-metadata-s3-path
valueFrom:
moduleMetadata:
group: replication
name: replication
key: s3_full_path
- name: vpc-id
valueFrom:
moduleMetadata:
group: optionals
name: networking
key: VpcId
- name: controlplane-subnet-ids
valueFrom:
moduleMetadata:
group: optionals
name: networking
key: PrivateSubnetIds
- name: dataplane-subnet-ids
valueFrom:
moduleMetadata:
group: optionals
name: networking
key: PrivateSubnetIds
- name: eks-admin-role-name
value: Admin
- name: eks-poweruser-role-name
value: PowerUser
- name: eks-read-only-role-name
value: ReadOnly
- name: eks-version
value: "1.25"
# valueFrom:
# envVariable: GLOBAL_EKS_VERSION
- name: eks-compute
value:
eks_nodegroup_config:
- eks_ng_name: ng1
eks_node_quantity: 2
eks_node_max_quantity: 5
eks_node_min_quantity: 1
eks_node_disk_size: 50
eks_node_instance_type: "m5.large"
- eks_ng_name: ng-gpu
eks_node_quantity: 0
eks_node_max_quantity: 2
eks_node_min_quantity: 0
eks_node_disk_size: 100
eks_node_instance_type: "g4dn.xlarge"
eks_node_labels:
usage: gpu
eks_node_spot: False
eks_secrets_envelope_encryption: False
eks_api_endpoint_private: False
- name: eks-addons
value:
deploy_aws_lb_controller: True # We deploy it unless set to False
deploy_external_dns: True # We deploy it unless set to False
deploy_aws_ebs_csi: True # We deploy it unless set to False
deploy_aws_efs_csi: True # We deploy it unless set to False
deploy_aws_fsx_csi: True # We deploy it unless set to False
deploy_cluster_autoscaler: True # We deploy it unless set to False
deploy_metrics_server: True # We deploy it unless set to False
deploy_secretsmanager_csi: True # We deploy it unless set to False
deploy_external_secrets: False
deploy_cloudwatch_container_insights_metrics: True # We deploy it unless set to False
deploy_cloudwatch_container_insights_logs: True
cloudwatch_container_insights_logs_retention_days: 7
deploy_adot: False
deploy_amp: False
deploy_grafana_for_amp: False
deploy_kured: False
deploy_calico: False
deploy_nginx_controller:
value: False
nginx_additional_annotations:
nginx.ingress.kubernetes.io/whitelist-source-range: "100.64.0.0/10,10.0.0.0/8"
deploy_kyverno:
value: False
kyverno_policies:
validate:
- block-ephemeral-containers
- block-stale-images
- block-updates-deletes
- check-deprecated-apis
- disallow-cri-sock-mount
- disallow-custom-snippets
- disallow-empty-ingress-host
- disallow-helm-tiller
- disallow-latest-tag
- disallow-localhost-services
- disallow-secrets-from-env-vars
- ensure-probes-different
- ingress-host-match-tls
- limit-hostpath-vols
- prevent-naked-pods
- require-drop-cap-net-raw
- require-emptydir-requests-limits
- require-labels
- require-pod-requests-limits
- require-probes
- restrict-annotations
- restrict-automount-sa-token
- restrict-binding-clusteradmin
- restrict-clusterrole-nodesproxy
- restrict-escalation-verbs-roles
- restrict-ingress-classes
- restrict-ingress-defaultbackend
- restrict-node-selection
- restrict-path
- restrict-service-external-ips
- restrict-wildcard-resources
- restrict-wildcard-verbs
- unique-ingress-host-and-path
# mutate:
# - add-networkpolicy-dns
# - add-pod-priorityclassname
# - add-ttl-jobs
# - always-pull-images
# - mitigate-log4shell
---
name: fsx-lustre
path: git::https://github.com/awslabs/idf-modules.git//modules/storage/fsx-lustre?ref=release/1.4.1&depth=1
parameters:
- name: vpc-id
valueFrom:
moduleMetadata:
group: optionals
name: networking
key: VpcId
- name: private-subnet-ids
valueFrom:
moduleMetadata:
group: optionals
name: networking
key: PrivateSubnetIds
- name: fs_deployment_type
value: SCRATCH_2
- name: storage_throughput
value: 50
- name: data_bucket_name
valueFrom:
moduleMetadata:
group: optionals
name: datalake-buckets
key: IntermediateBucketName
- name: export_path
value: "/fsx/export/"
- name: import_path
value: "/fsx/import/"
- name: fsx_version
value : "2.15"
- name: Namespace
valueFrom:
parameterValue: trainingNamespaceName
- name: import_policy
value: "NEW_CHANGED_DELETED"
29 changes: 29 additions & 0 deletions manifests/ml-training-on-eks/deployment.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,29 @@
name: ml-eks
toolchainRegion: eu-central-1
forceDependencyRedeploy: True
groups:
- name: optionals
path: manifests/ml-training-on-eks/optional-modules.yaml
- name: replication
path: manifests/ml-training-on-eks/replicator-modules.yaml
- name: core
path: manifests/ml-training-on-eks/core-modules.yaml
- name: integration
path: manifests/ml-training-on-eks/integration-modules.yaml
- name: images
path: manifests/ml-training-on-eks/images.yaml
- name: training
path: manifests/ml-training-on-eks/training-modules.yaml
targetAccountMappings:
- alias: primary
accountId:
valueFrom:
envVariable: ACCOUNT_ID
default: true
codebuildImage: aws/codebuild/standard:7.0
parametersGlobal:
dockerCredentialsSecret: aws-addf-docker-credentials
trainingNamespaceName: training
regionMappings:
- region: eu-central-1
default: true
20 changes: 20 additions & 0 deletions manifests/ml-training-on-eks/images.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
name: mnist
path: modules/training-images/mnist
parameters:
- name: ecr-repository-name
valueFrom:
moduleMetadata:
group: optionals
name: ecr-ml-images
key: EcrRepositoryName
- name: ecr-repository-arn
valueFrom:
moduleMetadata:
group: optionals
name: ecr-ml-images
key: EcrRepositoryArn
# Base Image = 763104351884.dkr.ecr.$AWS_DEFAULT_REGION.amazonaws.com/pytorch-training:2.1.0-gpu-py310-cu121-ubuntu20.04-ec2:
- name: base-image-ecr-account-id
value: 763104351884
- name: base-image-name
value: pytorch-training:2.1.0-gpu-py310-cu121-ubuntu20.04-ec2
55 changes: 55 additions & 0 deletions manifests/ml-training-on-eks/integration-modules.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,55 @@
name: lustre-on-eks
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@a13zen you will have to change this local path ref at some time...I guess once the PR is merged?

# TODO: Fix local ref to release
path: modules/integration/fsx-lustre-on-eks
parameters:
- name: EksClusterAdminRoleArn
valueFrom:
moduleMetadata:
group: core
name: eks
key: EksClusterMasterRoleArn
- name: EksClusterName
valueFrom:
moduleMetadata:
group: core
name: eks
key: EksClusterName
- name: EksOidcArn
valueFrom:
moduleMetadata:
group: core
name: eks
key: EksOidcArn
- name: EksClusterSecurityGroupId
valueFrom:
moduleMetadata:
group: core
name: eks
key: EksClusterSecurityGroupId
- name: Namespace
valueFrom:
parameterValue: trainingNamespaceName
- name: FsxFileSystemId
valueFrom:
moduleMetadata:
group: core
name: fsx-lustre
key: FSxLustreFileSystemId
- name: FsxSecurityGroupId
valueFrom:
moduleMetadata:
group: core
name: fsx-lustre
key: FSxLustreSecurityGroup
- name: FsxMountName
valueFrom:
moduleMetadata:
group: core
name: fsx-lustre
key: FSxLustreMountName
- name: FsxDnsName
valueFrom:
moduleMetadata:
group: core
name: fsx-lustre
key: FSxLustreAttrDnsName
21 changes: 21 additions & 0 deletions manifests/ml-training-on-eks/optional-modules.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
name: datalake-buckets
path: git::https://github.com/awslabs/autonomous-driving-data-framework.git//modules/optionals/datalake-buckets
parameters:
- name: encryption-type
value: SSE
---
name: networking
path: git::https://github.com/awslabs/idf-modules.git//modules/network/basic-cdk?ref=release/1.4.0&depth=1
parameters:
- name: internet-accessible
value: false
---
name: ecr-ml-images
path: git::https://github.com/awslabs/idf-modules.git//modules/storage/ecr?ref=release/1.4.1&depth=1
parameters:
- name: repository-name
value: ml-mnist-images
- name: image-tag-mutability
value: "MUTABLE"
- name: lifecycle-max-image-count
value: 10
10 changes: 10 additions & 0 deletions manifests/ml-training-on-eks/replicator-modules.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
name: replication
path: git::https://github.com/awslabs/idf-modules.git//modules/replication/dockerimage-replication?ref=release/1.4.0&depth=1
dataFiles:
- filePath: git::https://github.com/awslabs/idf-modules.git//data/eks_dockerimage-replication/versions/1.25.yaml?ref=release/1.4.0&depth=1
- filePath: git::https://github.com/awslabs/idf-modules.git//data/eks_dockerimage-replication/versions/default.yaml?ref=release/1.4.0&depth=1
parameters:
- name: eks-version
value: "1.25"
# valueFrom:
# envVariable: GLOBAL_EKS_VERSION
36 changes: 36 additions & 0 deletions manifests/ml-training-on-eks/training-modules.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,36 @@
name: training
path: modules/ml-training/k8s-managed
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@a13zen also here...we will need to change from local path after pr is merged

parameters:
- name: eks-cluster-name
valueFrom:
moduleMetadata:
group: core
name: eks
key: EksClusterName
- name: eks-oidc-arn
valueFrom:
moduleMetadata:
group: core
name: eks
key: EksOidcArn
- name: eks-cluster-admin-role-arn
valueFrom:
moduleMetadata:
group: core
name: eks
key: EksClusterMasterRoleArn
- name: pvc-name
valueFrom:
moduleMetadata:
group: integration
name: lustre-on-eks
key: PersistentVolumeClaimName
- name: training-namespace-name
valueFrom:
parameterValue: trainingNamespaceName
- name: training-image-uri
valueFrom:
moduleMetadata:
group: images
name: mnist
key: ImageUri
8 changes: 4 additions & 4 deletions modules/integration/fsx-lustre-on-eks/deployspec.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -2,25 +2,25 @@ deploy:
phases:
install:
commands:
- npm install -g aws-cdk@2.49.1
- npm install -g aws-cdk@2.128.0
- pip install -r requirements.txt
build:
commands:
- >
if [[ ${ADDF_PARAMETER_NAMESPACE_SECRET} ]]; then
export EKS_NAMESPACE=$(aws secretsmanager get-secret-value --secret-id ${ADDF_PARAMETER_NAMESPACE_SECRET} --query SecretString --output text --region us-east-1 | jq -r '.username');
export EKS_NAMESPACE=$(aws secretsmanager get-secret-value --secret-id ${ADDF_PARAMETER_NAMESPACE_SECRET} --query SecretString --output text | jq -r '.username');
elif [[ ${ADDF_PARAMETER_NAMESPACE_SSM} ]]; then
export EKS_NAMESPACE=${ADDF_PARAMETER_NAMESPACE_SSM} ;
else
export EKS_NAMESPACE=${ADDF_PARAMETER_NAMESPACE} ;
fi;
fi;
- cdk deploy --require-approval never --progress events --app "python app.py" --outputs-file ./cdk-exports.json
- export ADDF_MODULE_METADATA=$(python -c "import json; file=open('cdk-exports.json'); print(json.load(file)['addf-${ADDF_DEPLOYMENT_NAME}-${ADDF_MODULE_NAME}']['metadata'])")
destroy:
phases:
install:
commands:
- npm install -g aws-cdk@2.49.1
- npm install -g aws-cdk@2.128.0
- pip install -r requirements.txt
build:
commands:
Expand Down
3 changes: 3 additions & 0 deletions modules/integration/fsx-lustre-on-eks/requirements.in
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
aws-cdk-lib==2.128.0
cdk-nag==2.28.39
constructs==10.3.0
Loading
Loading