-
Notifications
You must be signed in to change notification settings - Fork 45
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Pytorch ML Training Job on EKS with GPUs, orchestrated by airflow #394
Merged
Merged
Changes from all commits
Commits
Show all changes
39 commits
Select commit
Hold shift + click to select a range
e255d2d
Adding ml-training modules
guarpi 396ee55
Adding ml-training modules
guarpi a2b1de0
chore: delete modules that should use gitref
gonzalobarbeito 15bb162
WIP: update modules to 1.2
gonzalobarbeito a399619
Merge branch 'awslabs:main' into feat/cleanup
kevinsoucy a38ed53
add requirements for mwaa
60d1f46
wip: add sfn
6737d13
use sfn
8700b48
readme
6b35287
Merge branch 'awslabs:main' into feat/cleanup
a13zen 1c318be
fix: module files
a13zen bd8185b
fix: fixes for ml-training-on-eks
a13zen 7e7f015
fix: training container code fixes
a13zen 5625d84
fix: revert to using networking module and remote references
a13zen 104bc1f
fix: point to main for idf for eks
a13zen 60c6536
fix: scale gpu to 0
a13zen a8b9ae9
fix: subnets for isg
a13zen 8349ec2
fix: deployment for ml-on-eks
a13zen d9c386f
fix: add asg tag propogation for cluster-autoscaler
a13zen ad7401e
fix: remove dag references
a13zen f0ea58b
feat: remove dag references
a13zen daa8981
Merge branch 'awslabs:main' into feat/cleanup
a13zen 6c332c7
fix: point to idf main
a13zen a50b0ed
fix: linting, remove mwaa files
a13zen 9d34055
fix: update README and Changelog
a13zen ff7e3b7
fix: update requirements, removed unused requirements-dev
a13zen 81a3a3f
fix: update idf ref to 1.4.0
a13zen 7fe7011
fix: removed stack description
a13zen c750c1e
fix: remove module setup.cfg and pyproject
a13zen 9ed1286
chore: linting
a13zen a7bcc67
chore: linting
a13zen d8b8a1c
fix: update to idf 1.4.1
a13zen dea1ffe
fix: ecr permissions
a13zen 75fb78b
fix: add more ecr permissions for ml image repo
a13zen 2ac7018
fix: use specific region for sagemaker images
a13zen e5d9e8e
chore: linting
a13zen f3aeecb
move image to its own module
aa590a8
fix: use idf ECR module, scope permissions, fix training docker image
a13zen 3a2f312
parameterize base image
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,164 @@ | ||
name: eks | ||
# TODO: Update and point to new release when PR is merged: https://github.com/awslabs/idf-modules/pull/128 | ||
path: git::https://github.com/awslabs/idf-modules.git//modules/compute/eks?ref=release/1.4.1&depth=1 | ||
dataFiles: | ||
- filePath: git::https://github.com/awslabs/idf-modules.git//data/eks_dockerimage-replication/versions/1.25.yaml?ref=release/1.4.1&depth=1 | ||
- filePath: git::https://github.com/awslabs/idf-modules.git//data/eks_dockerimage-replication/versions/default.yaml?ref=release/1.4.1&depth=1 | ||
parameters: | ||
- name: replicated-ecr-images-metadata-s3-path | ||
valueFrom: | ||
moduleMetadata: | ||
group: replication | ||
name: replication | ||
key: s3_full_path | ||
- name: vpc-id | ||
valueFrom: | ||
moduleMetadata: | ||
group: optionals | ||
name: networking | ||
key: VpcId | ||
- name: controlplane-subnet-ids | ||
valueFrom: | ||
moduleMetadata: | ||
group: optionals | ||
name: networking | ||
key: PrivateSubnetIds | ||
- name: dataplane-subnet-ids | ||
valueFrom: | ||
moduleMetadata: | ||
group: optionals | ||
name: networking | ||
key: PrivateSubnetIds | ||
- name: eks-admin-role-name | ||
value: Admin | ||
- name: eks-poweruser-role-name | ||
value: PowerUser | ||
- name: eks-read-only-role-name | ||
value: ReadOnly | ||
- name: eks-version | ||
value: "1.25" | ||
# valueFrom: | ||
# envVariable: GLOBAL_EKS_VERSION | ||
- name: eks-compute | ||
value: | ||
eks_nodegroup_config: | ||
- eks_ng_name: ng1 | ||
eks_node_quantity: 2 | ||
eks_node_max_quantity: 5 | ||
eks_node_min_quantity: 1 | ||
eks_node_disk_size: 50 | ||
eks_node_instance_type: "m5.large" | ||
- eks_ng_name: ng-gpu | ||
eks_node_quantity: 0 | ||
eks_node_max_quantity: 2 | ||
eks_node_min_quantity: 0 | ||
eks_node_disk_size: 100 | ||
eks_node_instance_type: "g4dn.xlarge" | ||
eks_node_labels: | ||
usage: gpu | ||
eks_node_spot: False | ||
eks_secrets_envelope_encryption: False | ||
eks_api_endpoint_private: False | ||
- name: eks-addons | ||
value: | ||
deploy_aws_lb_controller: True # We deploy it unless set to False | ||
deploy_external_dns: True # We deploy it unless set to False | ||
deploy_aws_ebs_csi: True # We deploy it unless set to False | ||
deploy_aws_efs_csi: True # We deploy it unless set to False | ||
deploy_aws_fsx_csi: True # We deploy it unless set to False | ||
deploy_cluster_autoscaler: True # We deploy it unless set to False | ||
deploy_metrics_server: True # We deploy it unless set to False | ||
deploy_secretsmanager_csi: True # We deploy it unless set to False | ||
deploy_external_secrets: False | ||
deploy_cloudwatch_container_insights_metrics: True # We deploy it unless set to False | ||
deploy_cloudwatch_container_insights_logs: True | ||
cloudwatch_container_insights_logs_retention_days: 7 | ||
deploy_adot: False | ||
deploy_amp: False | ||
deploy_grafana_for_amp: False | ||
deploy_kured: False | ||
deploy_calico: False | ||
deploy_nginx_controller: | ||
value: False | ||
nginx_additional_annotations: | ||
nginx.ingress.kubernetes.io/whitelist-source-range: "100.64.0.0/10,10.0.0.0/8" | ||
deploy_kyverno: | ||
value: False | ||
kyverno_policies: | ||
validate: | ||
- block-ephemeral-containers | ||
- block-stale-images | ||
- block-updates-deletes | ||
- check-deprecated-apis | ||
- disallow-cri-sock-mount | ||
- disallow-custom-snippets | ||
- disallow-empty-ingress-host | ||
- disallow-helm-tiller | ||
- disallow-latest-tag | ||
- disallow-localhost-services | ||
- disallow-secrets-from-env-vars | ||
- ensure-probes-different | ||
- ingress-host-match-tls | ||
- limit-hostpath-vols | ||
- prevent-naked-pods | ||
- require-drop-cap-net-raw | ||
- require-emptydir-requests-limits | ||
- require-labels | ||
- require-pod-requests-limits | ||
- require-probes | ||
- restrict-annotations | ||
- restrict-automount-sa-token | ||
- restrict-binding-clusteradmin | ||
- restrict-clusterrole-nodesproxy | ||
- restrict-escalation-verbs-roles | ||
- restrict-ingress-classes | ||
- restrict-ingress-defaultbackend | ||
- restrict-node-selection | ||
- restrict-path | ||
- restrict-service-external-ips | ||
- restrict-wildcard-resources | ||
- restrict-wildcard-verbs | ||
- unique-ingress-host-and-path | ||
# mutate: | ||
# - add-networkpolicy-dns | ||
# - add-pod-priorityclassname | ||
# - add-ttl-jobs | ||
# - always-pull-images | ||
# - mitigate-log4shell | ||
--- | ||
name: fsx-lustre | ||
path: git::https://github.com/awslabs/idf-modules.git//modules/storage/fsx-lustre?ref=release/1.4.1&depth=1 | ||
parameters: | ||
- name: vpc-id | ||
valueFrom: | ||
moduleMetadata: | ||
group: optionals | ||
name: networking | ||
key: VpcId | ||
- name: private-subnet-ids | ||
valueFrom: | ||
moduleMetadata: | ||
group: optionals | ||
name: networking | ||
key: PrivateSubnetIds | ||
- name: fs_deployment_type | ||
value: SCRATCH_2 | ||
- name: storage_throughput | ||
value: 50 | ||
- name: data_bucket_name | ||
valueFrom: | ||
moduleMetadata: | ||
group: optionals | ||
name: datalake-buckets | ||
key: IntermediateBucketName | ||
- name: export_path | ||
value: "/fsx/export/" | ||
- name: import_path | ||
value: "/fsx/import/" | ||
- name: fsx_version | ||
value : "2.15" | ||
- name: Namespace | ||
valueFrom: | ||
parameterValue: trainingNamespaceName | ||
- name: import_policy | ||
value: "NEW_CHANGED_DELETED" |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,29 @@ | ||
name: ml-eks | ||
toolchainRegion: eu-central-1 | ||
forceDependencyRedeploy: True | ||
groups: | ||
- name: optionals | ||
path: manifests/ml-training-on-eks/optional-modules.yaml | ||
- name: replication | ||
path: manifests/ml-training-on-eks/replicator-modules.yaml | ||
- name: core | ||
path: manifests/ml-training-on-eks/core-modules.yaml | ||
- name: integration | ||
path: manifests/ml-training-on-eks/integration-modules.yaml | ||
- name: images | ||
path: manifests/ml-training-on-eks/images.yaml | ||
- name: training | ||
path: manifests/ml-training-on-eks/training-modules.yaml | ||
targetAccountMappings: | ||
- alias: primary | ||
accountId: | ||
valueFrom: | ||
envVariable: ACCOUNT_ID | ||
default: true | ||
codebuildImage: aws/codebuild/standard:7.0 | ||
parametersGlobal: | ||
dockerCredentialsSecret: aws-addf-docker-credentials | ||
trainingNamespaceName: training | ||
regionMappings: | ||
- region: eu-central-1 | ||
default: true |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,20 @@ | ||
name: mnist | ||
path: modules/training-images/mnist | ||
parameters: | ||
- name: ecr-repository-name | ||
valueFrom: | ||
moduleMetadata: | ||
group: optionals | ||
name: ecr-ml-images | ||
key: EcrRepositoryName | ||
- name: ecr-repository-arn | ||
valueFrom: | ||
moduleMetadata: | ||
group: optionals | ||
name: ecr-ml-images | ||
key: EcrRepositoryArn | ||
# Base Image = 763104351884.dkr.ecr.$AWS_DEFAULT_REGION.amazonaws.com/pytorch-training:2.1.0-gpu-py310-cu121-ubuntu20.04-ec2: | ||
- name: base-image-ecr-account-id | ||
value: 763104351884 | ||
- name: base-image-name | ||
value: pytorch-training:2.1.0-gpu-py310-cu121-ubuntu20.04-ec2 |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,55 @@ | ||
name: lustre-on-eks | ||
# TODO: Fix local ref to release | ||
path: modules/integration/fsx-lustre-on-eks | ||
parameters: | ||
- name: EksClusterAdminRoleArn | ||
valueFrom: | ||
moduleMetadata: | ||
group: core | ||
name: eks | ||
key: EksClusterMasterRoleArn | ||
- name: EksClusterName | ||
valueFrom: | ||
moduleMetadata: | ||
group: core | ||
name: eks | ||
key: EksClusterName | ||
- name: EksOidcArn | ||
valueFrom: | ||
moduleMetadata: | ||
group: core | ||
name: eks | ||
key: EksOidcArn | ||
- name: EksClusterSecurityGroupId | ||
valueFrom: | ||
moduleMetadata: | ||
group: core | ||
name: eks | ||
key: EksClusterSecurityGroupId | ||
- name: Namespace | ||
valueFrom: | ||
parameterValue: trainingNamespaceName | ||
- name: FsxFileSystemId | ||
valueFrom: | ||
moduleMetadata: | ||
group: core | ||
name: fsx-lustre | ||
key: FSxLustreFileSystemId | ||
- name: FsxSecurityGroupId | ||
valueFrom: | ||
moduleMetadata: | ||
group: core | ||
name: fsx-lustre | ||
key: FSxLustreSecurityGroup | ||
- name: FsxMountName | ||
valueFrom: | ||
moduleMetadata: | ||
group: core | ||
name: fsx-lustre | ||
key: FSxLustreMountName | ||
- name: FsxDnsName | ||
valueFrom: | ||
moduleMetadata: | ||
group: core | ||
name: fsx-lustre | ||
key: FSxLustreAttrDnsName |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,21 @@ | ||
name: datalake-buckets | ||
path: git::https://github.com/awslabs/autonomous-driving-data-framework.git//modules/optionals/datalake-buckets | ||
parameters: | ||
- name: encryption-type | ||
value: SSE | ||
--- | ||
name: networking | ||
path: git::https://github.com/awslabs/idf-modules.git//modules/network/basic-cdk?ref=release/1.4.0&depth=1 | ||
parameters: | ||
- name: internet-accessible | ||
value: false | ||
--- | ||
name: ecr-ml-images | ||
path: git::https://github.com/awslabs/idf-modules.git//modules/storage/ecr?ref=release/1.4.1&depth=1 | ||
parameters: | ||
- name: repository-name | ||
value: ml-mnist-images | ||
- name: image-tag-mutability | ||
value: "MUTABLE" | ||
- name: lifecycle-max-image-count | ||
value: 10 |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,10 @@ | ||
name: replication | ||
path: git::https://github.com/awslabs/idf-modules.git//modules/replication/dockerimage-replication?ref=release/1.4.0&depth=1 | ||
dataFiles: | ||
- filePath: git::https://github.com/awslabs/idf-modules.git//data/eks_dockerimage-replication/versions/1.25.yaml?ref=release/1.4.0&depth=1 | ||
- filePath: git::https://github.com/awslabs/idf-modules.git//data/eks_dockerimage-replication/versions/default.yaml?ref=release/1.4.0&depth=1 | ||
parameters: | ||
- name: eks-version | ||
value: "1.25" | ||
# valueFrom: | ||
# envVariable: GLOBAL_EKS_VERSION |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,36 @@ | ||
name: training | ||
path: modules/ml-training/k8s-managed | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @a13zen also here...we will need to change from local path after pr is merged |
||
parameters: | ||
- name: eks-cluster-name | ||
valueFrom: | ||
moduleMetadata: | ||
group: core | ||
name: eks | ||
key: EksClusterName | ||
- name: eks-oidc-arn | ||
valueFrom: | ||
moduleMetadata: | ||
group: core | ||
name: eks | ||
key: EksOidcArn | ||
- name: eks-cluster-admin-role-arn | ||
valueFrom: | ||
moduleMetadata: | ||
group: core | ||
name: eks | ||
key: EksClusterMasterRoleArn | ||
- name: pvc-name | ||
valueFrom: | ||
moduleMetadata: | ||
group: integration | ||
name: lustre-on-eks | ||
key: PersistentVolumeClaimName | ||
- name: training-namespace-name | ||
valueFrom: | ||
parameterValue: trainingNamespaceName | ||
- name: training-image-uri | ||
valueFrom: | ||
moduleMetadata: | ||
group: images | ||
name: mnist | ||
key: ImageUri |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,3 @@ | ||
aws-cdk-lib==2.128.0 | ||
cdk-nag==2.28.39 | ||
constructs==10.3.0 |
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@a13zen you will have to change this local path ref at some time...I guess once the PR is merged?