Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Module deployment failure: jupyter-hub #583

Open
serge-dolgavin-dxc opened this issue Sep 4, 2024 · 16 comments · Fixed by #585
Open

[BUG] Module deployment failure: jupyter-hub #583

serge-dolgavin-dxc opened this issue Sep 4, 2024 · 16 comments · Fixed by #585
Assignees
Labels
bug Something isn't working

Comments

@serge-dolgavin-dxc
Copy link

Describe the bug

addf-demo-ide-jupyter-hub deployment failure, due to no longer supported runtime.

To Reproduce
deploy jupyter-hub module

Expected behavior
jupyter-hub deployed without issues

Screenshots
na

Additional context
...
Failed resources:
addf-demo-ide-jupyter-hub | 10:17:07 AM | CREATE_FAILED | AWS::Lambda::Function | AWSCDKCfnUtilsProviderCustomResourceProvider/Handler
handler returned message: "The runtime parameter of nodejs12.x is no longer supported for creating or updating AWS Lambda functions. We recommend you use a supported runtime while creating or updating functions. (Service: Lambda, Status Code: 400,

❌ addf-demo-ide-jupyter-hub failed: Error: The stack named addf-demo-ide-jupyter-hub failed creation, it may need to be manually deleted from the AWS console: ROLLBACK_COMPLETE
...

@malachi-constant
Copy link
Contributor

@serge-dolgavin-dxc

You can try this before merge, I need to test from the groundup so this may take awhile before its merged.

File: ide-modules.yaml

name: jupyter-hub
path: git::https://github.com/awslabs/autonomous-driving-data-framework.git//modules/demo-only/jupyter-hub?ref=chore/583&depth=1

@serge-dolgavin-dxc
Copy link
Author

@malachi-constant ,

unfortunately,

name: jupyter-hub
path: git::https://github.com/awslabs/autonomous-driving-data-framework.git//modules/demo-only/jupyter-hub?ref=chore/583&depth=1

doesn't work for me:

$ seedfarmer apply ./manifests/demo/deployment.yaml --dry-run
...
[2024-09-05 07:10:17,386 | INFO | _deployment_commands.py:636 | MainThread ]  Verifying all modules in ide for deploy 
Traceback (most recent call last):
...
  cmdline: git pull -v -- origin chore/583
  stderr: 'fatal: couldn't find remote ref chore/583'

During handling of the above exception, another exception occurred:
...
.../autonomous-driving-data-framework/.venv/lib/python3.8/site-packages/seedfarmer/mgmt/git_support.py", line 79, in clone_module_repo
    raise InvalidConfigurationError(f"\n Cannot Clone Repo: {ge} {messages.git_error_support()}")
seedfarmer.errors.seedfarmer_errors.InvalidConfigurationError: 
 Cannot Clone Repo: Cmd('git') failed due to: exit code(1)
  cmdline: git pull -v -- origin chore/583
  stderr: 'fatal: couldn't find remote ref chore/583' 
    1. Make sure your path to the repo is correct and valid (check your module manifests!)
    2. The credentials used to call SeedFarmer have access to the repo
    3. The credentials used to call SeedFarmer have not expired

@serge-dolgavin-dxc
Copy link
Author

@malachi-constant ,

with

name: jupyter-hub
path: modules/demo-only/jupyter-hub/

I got the following error:

...
addf-demo-ide-jupyter-hub | 4/11 | 7:17:32 AM | CREATE_IN_PROGRESS | Custom::AWSCDK-EKS-KubernetesResource | addf-demo-ide-jupyter-hub-eks-cluster/manifest-jupyter-hub-namespace/Resource/Default (addfdemoidejupyterhubeksclustermanifestjupyterhubnamespaceXXXXXXXXXXXX) Resource creation Initiated

1321 | addf-demo-ide-jupyter-hub | 4/11 | 7:17:33 AM | CREATE_FAILED | Custom::AWSCDK-EKS-KubernetesResource | addf-demo-ide-jupyter-hub-eks-cluster/manifest-jupyter-hub-namespace/Resource/Default (addfdemoidejupyterhubeksclustermanifestjupyterhubnamespaceXXXXXXXX) Received response status [FAILED] from custom resource. Message returned: Error: b'\nAn error occurred (AccessDenied) when calling the AssumeRole operation: User: arn:aws:sts::XXXXXXXXX:assumed-role/addf-demo-ide-jupyter-hub-HandlerServiceRoleXXXXXXXXXXXXXXX/addf-demo-ide-jupyter-hub-addfdemo-HandlerXXXXXXXXXXXX is not authorized to perform: sts:AssumeRole on resource: arn:aws:iam::XXXXXXXXXXXXX:role/addf-demo-core-eks-clusterCreationRoleXXXXXXXXXXX\nUnable to connect to the server: getting credentials: exec: executable aws failed with exit code 255\n'
...

@serge-dolgavin-dxc
Copy link
Author

@malachi-constant ,

please find the attached the codebuild log for jupyter-hub module: jupyter-hub_CodeBuild.log

@malachi-constant
Copy link
Contributor

@serge-dolgavin-dxc Can you try this module from main that branch was deleted after merge

@serge-dolgavin-dxc
Copy link
Author

serge-dolgavin-dxc commented Sep 10, 2024

@malachi-constant ,

Sorry that my messages are not clear and for confusion.

I have recognized that the branch was deleted and I am already using main for the last 5 days.

The yesterday's codebuild log for jupyter-hub module is based of the recent main branch.

@malachi-constant
Copy link
Contributor

Gotcha missed that, taking a look...

@malachi-constant
Copy link
Contributor

@serge-dolgavin-dxc Are you able to provide the trust policy for arn:aws:iam::XXXXXXXXXXXXX:role/addf-demo-core-eks-clusterCreationRoleXXXXXXXXXXX\ with account values sanitized as well so I compare to what I have tested? I am not able to replicate.

@malachi-constant malachi-constant added question Further information is requested investigating and removed bug Something isn't working implementation labels Sep 10, 2024
@serge-dolgavin-dxc
Copy link
Author

@malachi-constant , please find the attached policy details along with the latest codebuild log.
jupyter-hub.zip

@malachi-constant
Copy link
Contributor

Ok so the trust is not being added for some reason, can you also tell me which version of the eks module is deployed?

@serge-dolgavin-dxc
Copy link
Author

I am using the latest main branch (default demo / example-dev manifests).

name: eks
path: git::https://github.com/awslabs/idf-modules.git//modules/compute/eks?ref=release/1.11.0
dataFiles:
  - filePath: git::https://github.com/awslabs/idf-modules.git//data/eks_dockerimage-replication/versions/1.29.yaml?ref=release/1.11.0
  - filePath: git::https://github.com/awslabs/idf-modules.git//data/eks_dockerimage-replication/versions/default.yaml?ref=release/1.11.0

@malachi-constant
Copy link
Contributor

Ok thanks, was able to replicate, working on it...

@malachi-constant malachi-constant added bug Something isn't working and removed question Further information is requested investigating labels Sep 12, 2024
@malachi-constant
Copy link
Contributor

@serge-dolgavin-dxc

See manifest in PR

This error is resolved by updating ide-modules.yaml

name: jupyter-hub
path: modules/demo-only/jupyter-hub/
parameters:
 - name: eks-cluster-admin-role-arn
   valueFrom:
     moduleMetadata:
       group: core
       name: eks
       key: EksClusterMasterRoleArn

@serge-dolgavin-dxc
Copy link
Author

@malachi-constant , thanks a lot for your help! I was able to deploy jupyter-hub module.

┏━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━┓
┃ Account ┃ Region    ┃ Deployment ┃ Group       ┃ Module           ┃
┡━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━┩
│ primary │ eu-west-1 │ demo       │ optionals   │ networking       │
│ primary │ eu-west-1 │ demo       │ optionals   │ datalake-buckets │
│ primary │ eu-west-1 │ demo       │ replication │ replication      │
│ primary │ eu-west-1 │ demo       │ core        │ metadata-storage │
│ primary │ eu-west-1 │ demo       │ core        │ eks              │
│ primary │ eu-west-1 │ demo       │ core        │ batch-compute    │
│ primary │ eu-west-1 │ demo       │ core        │ efs              │
│ primary │ eu-west-1 │ demo       │ ide         │ jupyter-hub      │
└─────────┴───────────┴────────────┴─────────────┴──────────────────┘

Unfortunately, I got two issues after the deployment:

  1. I was not able to query the DNS Name of the JupyterHub
$ kubectl get ing jupyterhub -n jupyter-hub -o jsonpath="{.status.loadBalancer.ingress[0].hostname}"
E0913 07:48:55.773416   20574 memcache.go:265] couldn't get current server API group list: Get "https://7AB5A3CFD6880B49EFACA781A5D20570.gr7.eu-central-1.eks.amazonaws.com/api?timeout=32s": dial tcp: lookup 7AB5A3CFD6880B49EFACA781A5D20570.gr7.eu-central-1.eks.amazonaws.com on 172.20.48.1:53: no such host
E0913 07:48:55.778115   20574 memcache.go:265] couldn't get current server API group list: Get "https://7AB5A3CFD6880B49EFACA781A5D20570.gr7.eu-central-1.eks.amazonaws.com/api?timeout=32s": dial tcp: lookup 7AB5A3CFD6880B49EFACA781A5D20570.gr7.eu-central-1.eks.amazonaws.com on 172.20.48.1:53: no such host
E0913 07:48:55.781898   20574 memcache.go:265] couldn't get current server API group list: Get "https://7AB5A3CFD6880B49EFACA781A5D20570.gr7.eu-central-1.eks.amazonaws.com/api?timeout=32s": dial tcp: lookup 7AB5A3CFD6880B49EFACA781A5D20570.gr7.eu-central-1.eks.amazonaws.com on 172.20.48.1:53: no such host
E0913 07:48:55.785906   20574 memcache.go:265] couldn't get current server API group list: Get "https://7AB5A3CFD6880B49EFACA781A5D20570.gr7.eu-central-1.eks.amazonaws.com/api?timeout=32s": dial tcp: lookup 7AB5A3CFD6880B49EFACA781A5D20570.gr7.eu-central-1.eks.amazonaws.com on 172.20.48.1:53: no such host
E0913 07:48:55.794678   20574 memcache.go:265] couldn't get current server API group list: Get "https://7AB5A3CFD6880B49EFACA781A5D20570.gr7.eu-central-1.eks.amazonaws.com/api?timeout=32s": dial tcp: lookup 7AB5A3CFD6880B49EFACA781A5D20570.gr7.eu-central-1.eks.amazonaws.com on 172.20.48.1:53: no such host
Unable to connect to the server: dial tcp: lookup 7AB5A3CFD6880B49EFACA781A5D20570.gr7.eu-central-1.eks.amazonaws.com on 172.20.48.1:53: no such host

Please notice regions. ADDF demo was deployed in eu-west-1, not eu-central-1.

  1. Spawn failed after authentication on jupyter-hub:
Event log
Server requested
2024-09-13T05:51:01.159711Z [Normal] Successfully assigned jupyter-hub/jupyter-testadmin to ip-10-0-5-247.eu-west-1.compute.internal
2024-09-13T05:51:05Z [Normal] AttachVolume.Attach succeeded for volume "pvc-476aa8dd-ff44-4961-bd31-e335e243b2c2"
2024-09-13T05:51:06Z [Warning] Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "c32ec298a72142146904b12cd76eed4d0de1cb67d0bcffe61ace594ef57748f4": plugin type="aws-cni" name="aws-cni" failed (add): add cmd: failed to assign an IP address to container
2024-09-13T05:51:20Z [Warning] Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "947b92fa3b8b39f8d5739e39c4c3fb9dd4ec4c086e9ab1c245c071f4d830ba01": plugin type="aws-cni" name="aws-cni" failed (add): add cmd: failed to assign an IP address to container
2024-09-13T05:51:33Z [Warning] Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "812dcbfd7cc2d97c2557ff5e647fd0459b3578b5bf57266ea52a32f61e24b4be": plugin type="aws-cni" name="aws-cni" failed (add): add cmd: failed to assign an IP address to container
2024-09-13T05:51:46Z [Warning] Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "814fcaf4f3b16a97d283b3ebec306bf2c04c8cf18f223648c954300a9ddfa72e": plugin type="aws-cni" name="aws-cni" failed (add): add cmd: failed to assign an IP address to container
2024-09-13T05:52:00Z [Warning] Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "07333715950cac4d843c6d86f2c3cddff3ee6e9089303c427e7036dd0c255a83": plugin type="aws-cni" name="aws-cni" failed (add): add cmd: failed to assign an IP address to container
2024-09-13T05:52:12Z [Warning] Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "4adad6b61854075ae7cc294aa9879bdebb999dc3f12e32c882e5386fa4a711f6": plugin type="aws-cni" name="aws-cni" failed (add): add cmd: failed to assign an IP address to container
2024-09-13T05:52:25Z [Warning] Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "9f420ca72cbf547e4e5fca53640b569ed68ede236b597f1c1b9d7ba00e666aea": plugin type="aws-cni" name="aws-cni" failed (add): add cmd: failed to assign an IP address to container
2024-09-13T05:52:39Z [Warning] Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "b8d0932f530f7347b9119f49782f83cbf9af1434a5330ad6ce3fa146187b5f31": plugin type="aws-cni" name="aws-cni" failed (add): add cmd: failed to assign an IP address to container
2024-09-13T05:52:51Z [Warning] Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "bc3e7ba7bd1eae2c9279fc247d72dcd87355413ce1919b58f3e451c159ff39cd": plugin type="aws-cni" name="aws-cni" failed (add): add cmd: failed to assign an IP address to container
2024-09-13T05:53:04Z [Warning] (combined from similar events): Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "66b1d8132d73dcbba43f04acc1fe2926c56e641a060d9bcb5f12833f20f7c284": plugin type="aws-cni" name="aws-cni" failed (add): add cmd: failed to assign an IP address to container
Spawn failed: pod jupyter-hub/jupyter-testadmin did not start in 300 seconds!

Could you please advise how to address these issues?

@dgraeber
Copy link
Contributor

@serge-dolgavin-dxc

I think your credentials for kubectl are pointing to the wrong cluster (do you have multiple clusters defined in .kube?)...this command:
kubectl get ing jupyterhub -n jupyter-hub -o jsonpath="{.status.loadBalancer.ingress[0].hostname}"
Should be executed against the proper cluster...
REF: https://kubernetes.io/docs/tasks/access-application-cluster/configure-access-multiple-clusters/

@serge-dolgavin-dxc
Copy link
Author

@dgraeber , thanks a lot for your hint!

addf-demo-core-eks-cluster configuration was missing.

The first issue was solved, but the second still remain. Is it an issue with access rights?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants