Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Pull-through caching #1299

Merged
merged 1 commit into from
Jan 17, 2024
Merged

Add Pull-through caching #1299

merged 1 commit into from
Jan 17, 2024

Conversation

lubosmj
Copy link
Member

@lubosmj lubosmj commented Jun 4, 2023

  • Devise a workflow for adding content to a single repository version (adding content one by one will not follow the planned path for having repositories with consolidated repository versions)
  • Write a functional test (creating a pull-through cache remote and distribution and pulling content via the Pulp Container Registry)
  • Bump up the pulpcore requirement correspondingly because of migrations
  • Clean up the code (refactoring the code to remove duplicates and speeding up the operation)

closes #507

"""
TODO: Add permissions.
"""
TYPE = "container"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs a new identifier. (almost certain)

@lubosmj lubosmj force-pushed the pull-through-cache branch 6 times, most recently from 6555d70 to c4028e0 Compare June 12, 2023 18:49
@lubosmj lubosmj force-pushed the pull-through-cache branch 8 times, most recently from d4ab757 to 64640d5 Compare June 15, 2023 11:25
@lubosmj
Copy link
Member Author

lubosmj commented Jun 15, 2023

@ipanova, @mdellweg, would you mind reviewing this PR? Focus on the underlying logic.

Things to consider:

  1. The pull-through cache logic is placed exclusively in the live API (within a synchronous context). I could not lock a repository from the content app.
  2. Repository/distributions' names/base_paths are created from remotes' upstream names. This might cause problems with existing repository/distributions' names/base_paths.
  3. Similarly to the push workflow, I adopted the concept of pending blobs/manifests and extended it to classic repositories. The content is added to a repository only when the latest remaining blob from the manifest's layers is requested to prevent a situation in which the content is scattered in multiple repository versions.
  4. In some cases (when some manifest layers are stored in the local storage), podman does not pull all blobs from Pulp. Because of this, we are not committing the content to the repository since we are waiting for a user to request it first. Is this fine? I thought of removing the need for adding content to the repository. In pulp-to-pulp sync scenarios where one Pulp instance acts as a master (with enabled pull-through-caching), the content will be committed to the repository because the master will serve it to another instance.

@lubosmj
Copy link
Member Author

lubosmj commented Jun 15, 2023

Oh, the reason why I did not include sending head requests beforehand is that "docker-content-digest" is not a required header and is probably not present in other non-docker registries.

@mdellweg
Copy link
Member

  1. The pull-through cache logic is placed exclusively in the live API (within a synchronous context). I could not lock a repository from the content app.

Can't you dispatch the add_content task from the content app? In the end, the content app does not even need to wait for it to finish, right?

@mdellweg mdellweg closed this Jun 26, 2023
@mdellweg mdellweg reopened this Jun 26, 2023
@mdellweg
Copy link
Member

Sorry, wrong button

@lubosmj
Copy link
Member Author

lubosmj commented Jun 28, 2023

I believe the problem was due to the asynchronous context. I could not dispatch the task because of it.

@lubosmj
Copy link
Member Author

lubosmj commented Jun 28, 2023

@ipanova and I concluded that we should preserve the idea of adding content to a repository (the exact opposite of what we are doing in other plugins). The 4th bullet point is no longer a concern if we assume that there is a user who does not have cached layers on his system and will eventually download all pending blobs (this leads to committing the repository version). Repositories/distributions, created from special distributions, will be visible to users because we allow the pull operation.

Besides that, we identified two flaws in the current implementation:

  1. We need to address garbage collection. At this time, it is sufficient to mimic the behaviour of mirror=True synced repositories (add_and_remove - delete content which is not needed + retain_repo_version=1). Later we can handle the garbage collection with: As an admin/user I can maintain a controlled registry that doesn't continually grow in space #1268.
  2. The name of a distribution and a repository should have the following format to mitigate the conflicts: {special_cache_distribution_base_path}/{upstream_name}

Things to work on next:

  • delete manifests and blobs from a caching repository, set retain_repo_version=1
  • update the base_path/name initialization for distribution/remote/repository in get_pull_through_drv
  • issue head/get requests to fetch the latest version of cached content
  • add support for namespaced caching (allow caching a specific organization besides the whole registry)

@lubosmj lubosmj force-pushed the pull-through-cache branch 4 times, most recently from 86fb607 to c50e570 Compare July 24, 2023 18:08
Copy link
Member

@mdellweg mdellweg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think immediate tasks that run async code called from other async code will attempt to create their own async event loop and that is the setup for the error you are seeing. But i thought we were dispatching the task to run independently in the background anyway never to be awaited.

@@ -16,7 +16,10 @@

log = getLogger(__name__)

InMemoryDownloadResult = namedtuple("InMemoryDownloadResult", ["data", "headers", "status_code"])
HeadResult = namedtuple(
"InMemoryDownloadResult",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You might want to adjust that name.

response = await downloader.run(extra_data={"headers": V2_ACCEPT_HEADERS})
downloader = remote.get_downloader(url=tag_url)
try:
response = await downloader.run(extra_data={"headers": V2_ACCEPT_HEADERS})
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While this downloader runs, are we already streaming the data to the user?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, we want to ensure that we initialize the manifest and remote blobs first and add them to the pending_* content before dispatching the task and streaming data back to a client. The client can be faster than the task in this matter.

pulp_container/app/models.py Show resolved Hide resolved
pulp_container/app/registry.py Outdated Show resolved Hide resolved

digest = response.headers.get("docker-content-digest")
if tag.tagged_manifest.digest != digest:
downloader = remote.get_downloader(url=tag_url)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you already have it on line 208

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In that case, we check for the digest with the HEAD request.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that's ok but you don't need to write twice remote.get_downloader(url=tag_url)

pulp_container/app/registry.py Outdated Show resolved Hide resolved
media_type = determine_media_type(manifest_data, response)
if media_type not in (MEDIA_TYPE.MANIFEST_LIST, MEDIA_TYPE.INDEX_OCI):
await self.save_manifest_and_blobs(
digest, manifest_data, media_type, remote, repository, saved_artifact
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the digest you got from docker-content-digest header is not reliable because that is not a required header. you should resort to calcualt_digest as in the except branch on line 176

async def save_manifest_and_blobs(
self, digest, manifest_data, media_type, remote, repository, artifact
):
config_digest = manifest_data["config"]["digest"]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how are you sure this is not a schema1 manifest?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not. 😭

try:
manifest_data = json.loads(raw_data)
except json.decoder.JSONDecodeError:
raise PathNotResolved(digest)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should be path here

@@ -318,7 +486,54 @@ async def get_by_digest(self, request):
"Docker-Content-Digest": ca_content.digest,
}
except ObjectDoesNotExist:
raise PathNotResolved(path)
distribution = await distribution.acast()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you need to do something about this code being repeated 3 times in 3 places

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I extracted a new class.

manifest = repository.pending_manifests.get(digest=pk)
manifest.touch()
except models.Manifest.DoesNotExist:
pass
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not raising the error?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is still a chance that the fired pull-through download task has not finished and the user is trying to get a new listed manifest. We do not pre-record all listed manifests and their blobs in content-app. So, the manifest does not exist in pending_manifests and is still not associated with any repository.

@ipanova
Copy link
Member

ipanova commented Dec 15, 2023

@lubosmj

2. Setting `mirror=True` in the staging resulted in older tags being removed from the repository. I am using the default `mirror=False`.

I thought this is the desired behavior. Since this is a pull-through-cache repository, it should match exactly the content remotely. So if a tag was removed remotely I would not know why we should keep it locally.

@lubosmj
Copy link
Member Author

lubosmj commented Dec 15, 2023

I thought this is the desired behavior. Since this is a pull-through-cache repository, it should match exactly the content remotely. So if a tag was removed remotely I would not know why we should keep it locally.

But, using mirror=True causes the sync pipeline to think that the old tag, which was previously pulled by the user, had gone. It removes it even though it might be present on the remote. See the test case test_pull_manifest_list where I pull two different tags. With mirror=True, the test fails because of the missing latest tag.

How can I forcefully tell the sync pipeline to not remove the existing tag? Also, how do we know if the tag was removed from the remote registry if the user never asks for it and thus we never realize that?

@ipanova
Copy link
Member

ipanova commented Dec 15, 2023

@lubosmj since we using ContainerPullThroughCacheDeclarativeVersion that never checks rest of content as it would usually check in the normal DeclarativeVersion, it will always keep only the latest pulled tag with mirror=True. So you're right we should use mirror=False because we have no choice

@lubosmj lubosmj marked this pull request as draft December 20, 2023 17:18
@lubosmj lubosmj force-pushed the pull-through-cache branch 10 times, most recently from f5dab49 to cbd654e Compare January 2, 2024 19:40
@lubosmj lubosmj marked this pull request as ready for review January 2, 2024 20:33
else:
raise PathNotResolved(tag_name)
else:
if distribution.remote_id and distribution.pull_through_distribution_id:
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add the cast call here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Explicitly state that inside this if branch we are working through the pull-through distribution.

extra_data={"headers": V2_ACCEPT_HEADERS, "http_method": "head"}
)
except ClientResponseError:
raise PathNotResolved(path)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead, return the existing tag. The tag will just not be refreshed.

"Docker-Distribution-API-Version": "registry/2.0",
}
return web.Response(text=raw_manifest, headers=headers)
else:
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe, mention that we parse "blobs" and initialize a remote artifact here.

# it is necessary to pass this information back to the client
raise HTTPTooManyRequests()
else:
raise PathNotResolved(self.path)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Leave a TODO comment about possible changes in the future. Right now, we are masking error messages that might be useful to the client.


manifest = Manifest(
digest=digest,
schema_version=2,
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Get the schema version from media_type.

Comment on lines +1025 to +1031
tag = models.Tag(name=pk, tagged_manifest=manifest)
try:
tag.save()
except IntegrityError:
tag = models.Tag.objects.get(name=tag.name, tagged_manifest=manifest)
tag.touch()
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add the tag to the repository via an immediate task.

@@ -1207,12 +1325,18 @@ def head(self, request, path, pk=None):

def get(self, request, path, pk):
"""Return a signature identified by its sha256 checksum."""
_, _, repository_version = self.get_drv_pull(path)
_, repository, repository_version = self.get_drv_pull(path)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe, revert this change.

@@ -1302,6 +1382,103 @@ def destroy(self, request, pk, **kwargs):
return OperationPostponedResponse(async_result, request)


class ContainerPullThroughDistributionViewSet(DistributionViewSet, RolesMixin):
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO: add a comment about inheriting the private flag from the pull-through cache distribution.

**remote_data,
)

cache_distribution, _ = models.ContainerDistribution.objects.get_or_create(
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO in the future? Propagate the permissions and private flag from the pull-through distribution to this distribution.

pre-configure a new repository and sync it to facilitate the retrieval of the actual content. This
speeds up the whole process of shipping containers from its early management stages to distribution.
Similarly to on-demand syncing, the feature also **reduces external network dependencies**, and
ensures a more reliable container deployment system in production environments.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Distributions are public by default.

Copy link
Member

@ipanova ipanova left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We walked through the PR on the call and @lubosmj has few things to update and can mere afterwards

@lubosmj lubosmj enabled auto-merge (rebase) January 17, 2024 17:29
@lubosmj lubosmj merged commit cbaa073 into pulp:main Jan 17, 2024
14 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

As a user I can pull-through cache container images when remote is defined on distribution
3 participants