Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix many issues with examples #670

Open
wants to merge 26 commits into
base: wdl-1.2
Choose a base branch
from
Open

Fix many issues with examples #670

wants to merge 26 commits into from

Conversation

jdidion
Copy link
Collaborator

@jdidion jdidion commented Jun 26, 2024

Checklist

  • Pull request details were added to CHANGELOG.md

Closes #672

@jdidion
Copy link
Collaborator Author

jdidion commented Jul 24, 2024

@stxue1 @adamnovak would also appreciate your reviews

SPEC.md Outdated
}
```

An execution engine may support [other ways](#input-and-output-formats) to specify `File` and `Directory` inputs (e.g., as URIs), but prior to task execution it must [localize inputs](#task-input-localization) so that the runtime value of a `File`/`Directory` variable is a local path.
Within a WDL file, literal values for files and directories may only be paths that are local to the execution environment.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So an engine is not allowed to support something like:

inputs {
    File genome_reference = "https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/001/405/GCA_000001405.15_GRCh38/seqs_for_alignment_pipelines.ucsc_ids/GCA_000001405.15_GRCh38_no_alt_analysis_set.fna.gz"
}

?

Copy link
Collaborator Author

@jdidion jdidion Jul 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point - I've updated the spec to say that the execution engine is only required to support paths - the implication being it's also free to support URLs but that may not be portable across engines.

I'm open to expanding this in WDL 1.3 to require engines to support http(s) URLs as well. This would be consistent with the requirement to support http(s) URLs in imports.

SPEC.md Outdated

If a mount point is specified, then it must be an absolute path to a location in the host environment. If the mount point is omitted, it is assumed to be a persistent volume mounted at the root of the execution directory within a task.
If a mount point is specified, then it must be an absolute path to a location in the host environment (i.e., within the container). The specified path either must not already exist in the host environment, or it must be empty and have at least the requested amount of space available. The mount point should be assumed to be ephemeral, i.e., it will be deleted after the task completes.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've always seen "host environment" in the context of containerized applications used to mean the system outside the container; not the system within the container. If I am on a machine booted into Fedora and I run a WDL workflow that calls for an ubuntu:22.04 container to run, the "host" as I understand it would be the Fedora environment.

Also, I don't think that containers have a notion of some amount of space being available at a path in the container. I don't think I can make an image that has 20 GiB of free space at /mnt/freespace and upload it to Docker Hub, as far as I know. So I don't think saying must be empty and have at least the requested amount of space available is different from just saying must be empty. You can mount at paths that don't exist in your container image, or you can mount at paths that exist as empty directories, but you can't mount at paths that are files or paths that are directories that have contents.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we should be more consistent with usage. I've updated the spec such that all uses of "execution environment" refers to the context in which the task is executed, and "host" refers to the system that hosts the execution environment.

SPEC.md Show resolved Hide resolved
@@ -3386,7 +3419,7 @@ A document is imported using it's [URI](https://en.wikipedia.org/wiki/Uniform_Re
* `https://`
* 🗑 `file://` - Using the `file://` protocol for local imports can be problematic. Its use is deprecated and will be removed in WDL 2.0.

In the event that there is no protocol specified, the import is resolved **relative to the location of the current document**. In the primary WDL document, a protocol-less import is relative to the host file system. If a protocol-less import starts with `/` it is interpreted as relative to the root of the host in the resolved URI.
In the event that there is no protocol specified, the import is resolved **relative to the location of the current document**. In the primary WDL document, a protocol-less import is relative to the folder that contains the primary WDL file. If a protocol-less import starts with `/` it is interpreted as relative to the root of the file system that contains the primary WDL file.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The language here makes an assumption that path resolution happens on disk. This is not always true, as would be the case for resolving the imports of a Dockstore workflow. Its common practice to use relative import uris, but when resolving the workflow you would resolve it against the https:// context

@@ -5081,9 +5119,11 @@ Test config:
* `Array[String]` - An array of disk specifications.
* Default value: `1 GiB`

The `disks` attribute provides a way to request one or more persistent volumes, each of which has a minimum size and is mounted at a specific location. When the `disks` attribute is provided, the execution engine must guarantee the requested resources are available or immediately fail the task prior to instantiating the command.
The `disks` attribute provides a way to request one or more persistent volumes, each of which has a minimum size and is mounted at a specific location with both read and write permissions. When the `disks` attribute is provided, the execution engine must guarantee the requested resources are available or immediately fail the task prior to instantiating the command.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

read and write permissions

I am worried this requirement constitutes a breaking change. I am not 1--% clear how cromwell works with data-disks and other existing disks but I would not be surprised if it treated some mounts as read-only


If a mount point is specified, then it must be an absolute path to a location in the host environment. If the mount point is omitted, it is assumed to be a persistent volume mounted at the root of the execution directory within a task.
If a mount point is specified, then it must be an absolute path to a location in the execution environment (i.e., within the container). The specified path either must not already exist in the execution environment, or it must be empty and have at least the requested amount of space available. The mount point should be assumed to be ephemeral, i.e., it will be deleted after the task completes.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The specified path either must not already exist in the execution environment, or it must be empt

I think this is technically hard to achieve when using a container technology like Docker, which simply allows overriding paths.

@@ -7957,7 +7996,7 @@ File join_paths(File, Array[String]+)
File join_paths(Array[String]+)
```

Joins together two or more paths into an absolute path in the host filesystem.
Joins together two or more paths into an absolute path in the execution environment's filesystem.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should define these somewhere. Execution environment can mean multiple different things in different contexts. IE in this case it could be

  • the trusted compute environment ie a cloud project
  • a VM
  • The running docker container


It is up to the execution engine to resolve input files and directories and stage them into the execution environment. The execution engine is free to specify the values that are allowed for `File` and `Directory` parameters, but at a minimum it is required to support POSIX absolute file paths (e.g., `/path/to/file`).

It is strongly recommended that input files and directories be specified as absolute paths to local files or as URLs. If relative paths are allowed, then it is suggested that they be resolved relative to the directory that contains the input JSON file (if a file is provided) or to the working directory in which the workflow is initially launched.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should change the suggestion from URL to URI` to capture use cases of cloud URIs

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Semantics for runtime disks mount points are confusing
4 participants