Add Pull-through caching #1299

lubosmj · 2023-06-04T19:55:52Z

Devise a workflow for adding content to a single repository version (adding content one by one will not follow the planned path for having repositories with consolidated repository versions)
Write a functional test (creating a pull-through cache remote and distribution and pulling content via the Pulp Container Registry)
Bump up the pulpcore requirement correspondingly because of migrations
Clean up the code (refactoring the code to remove duplicates and speeding up the operation)

closes #507

mdellweg · 2023-06-05T08:03:43Z

pulp_container/app/models.py

+    """
+    TODO: Add permissions.
+    """
+    TYPE = "container"


This needs a new identifier. (almost certain)

lubosmj · 2023-06-15T12:45:41Z

@ipanova, @mdellweg, would you mind reviewing this PR? Focus on the underlying logic.

Things to consider:

The pull-through cache logic is placed exclusively in the live API (within a synchronous context). I could not lock a repository from the content app.
Repository/distributions' names/base_paths are created from remotes' upstream names. This might cause problems with existing repository/distributions' names/base_paths.
Similarly to the push workflow, I adopted the concept of pending blobs/manifests and extended it to classic repositories. The content is added to a repository only when the latest remaining blob from the manifest's layers is requested to prevent a situation in which the content is scattered in multiple repository versions.
In some cases (when some manifest layers are stored in the local storage), podman does not pull all blobs from Pulp. Because of this, we are not committing the content to the repository since we are waiting for a user to request it first. Is this fine? I thought of removing the need for adding content to the repository. In pulp-to-pulp sync scenarios where one Pulp instance acts as a master (with enabled pull-through-caching), the content will be committed to the repository because the master will serve it to another instance.

lubosmj · 2023-06-15T13:07:24Z

Oh, the reason why I did not include sending head requests beforehand is that "docker-content-digest" is not a required header and is probably not present in other non-docker registries.

mdellweg · 2023-06-26T15:47:27Z

The pull-through cache logic is placed exclusively in the live API (within a synchronous context). I could not lock a repository from the content app.

Can't you dispatch the add_content task from the content app? In the end, the content app does not even need to wait for it to finish, right?

mdellweg · 2023-06-26T15:47:41Z

Sorry, wrong button

lubosmj · 2023-06-28T09:32:52Z

I believe the problem was due to the asynchronous context. I could not dispatch the task because of it.

lubosmj · 2023-06-28T16:56:32Z

@ipanova and I concluded that we should preserve the idea of adding content to a repository (the exact opposite of what we are doing in other plugins). The 4th bullet point is no longer a concern if we assume that there is a user who does not have cached layers on his system and will eventually download all pending blobs (this leads to committing the repository version). Repositories/distributions, created from special distributions, will be visible to users because we allow the pull operation.

Besides that, we identified two flaws in the current implementation:

We need to address garbage collection. At this time, it is sufficient to mimic the behaviour of mirror=True synced repositories (add_and_remove - delete content which is not needed + retain_repo_version=1). Later we can handle the garbage collection with: As an admin/user I can maintain a controlled registry that doesn't continually grow in space #1268.
The name of a distribution and a repository should have the following format to mitigate the conflicts: {special_cache_distribution_base_path}/{upstream_name}

Things to work on next:

delete manifests and blobs from a caching repository, set retain_repo_version=1
update the base_path/name initialization for distribution/remote/repository in get_pull_through_drv
issue head/get requests to fetch the latest version of cached content
add support for namespaced caching (allow caching a specific organization besides the whole registry)

mdellweg

I think immediate tasks that run async code called from other async code will attempt to create their own async event loop and that is the setup for the error you are seeing. But i thought we were dispatching the task to run independently in the background anyway never to be awaited.

mdellweg · 2023-12-13T07:46:54Z

pulp_container/app/downloaders.py

@@ -16,7 +16,10 @@

 log = getLogger(__name__)

-InMemoryDownloadResult = namedtuple("InMemoryDownloadResult", ["data", "headers", "status_code"])
+HeadResult = namedtuple(
+    "InMemoryDownloadResult",


You might want to adjust that name.

mdellweg · 2023-12-13T07:52:11Z

pulp_container/app/registry.py

-                response = await downloader.run(extra_data={"headers": V2_ACCEPT_HEADERS})
+                downloader = remote.get_downloader(url=tag_url)
+                try:
+                    response = await downloader.run(extra_data={"headers": V2_ACCEPT_HEADERS})


While this downloader runs, are we already streaming the data to the user?

No, we want to ensure that we initialize the manifest and remote blobs first and add them to the pending_* content before dispatching the task and streaming data back to a client. The client can be faster than the task in this matter.

pulp_container/app/models.py

pulp_container/app/registry.py

ipanova · 2023-12-15T11:54:33Z

pulp_container/app/registry.py

+
+                digest = response.headers.get("docker-content-digest")
+                if tag.tagged_manifest.digest != digest:
+                    downloader = remote.get_downloader(url=tag_url)


you already have it on line 208

In that case, we check for the digest with the HEAD request.

that's ok but you don't need to write twice remote.get_downloader(url=tag_url)

pulp_container/app/registry.py

ipanova · 2023-12-15T12:04:07Z

pulp_container/app/registry.py

+                    media_type = determine_media_type(manifest_data, response)
+                    if media_type not in (MEDIA_TYPE.MANIFEST_LIST, MEDIA_TYPE.INDEX_OCI):
+                        await self.save_manifest_and_blobs(
+                            digest, manifest_data, media_type, remote, repository, saved_artifact


the digest you got from docker-content-digest header is not reliable because that is not a required header. you should resort to calcualt_digest as in the except branch on line 176

ipanova · 2023-12-15T12:05:33Z

pulp_container/app/registry.py

+    async def save_manifest_and_blobs(
+        self, digest, manifest_data, media_type, remote, repository, artifact
+    ):
+        config_digest = manifest_data["config"]["digest"]


how are you sure this is not a schema1 manifest?

I am not. 😭

ipanova · 2023-12-15T12:08:04Z

pulp_container/app/registry.py

+                try:
+                    manifest_data = json.loads(raw_data)
+                except json.decoder.JSONDecodeError:
+                    raise PathNotResolved(digest)


should be path here

ipanova · 2023-12-15T12:09:14Z

pulp_container/app/registry.py

@@ -318,7 +486,54 @@ async def get_by_digest(self, request):
                "Docker-Content-Digest": ca_content.digest,
            }
        except ObjectDoesNotExist:
-            raise PathNotResolved(path)
+            distribution = await distribution.acast()


you need to do something about this code being repeated 3 times in 3 places

I extracted a new class.

ipanova · 2023-12-15T12:11:40Z

pulp_container/app/registry_api.py

+                    manifest = repository.pending_manifests.get(digest=pk)
+                    manifest.touch()
+                except models.Manifest.DoesNotExist:
+                    pass


why not raising the error?

There is still a chance that the fired pull-through download task has not finished and the user is trying to get a new listed manifest. We do not pre-record all listed manifests and their blobs in content-app. So, the manifest does not exist in pending_manifests and is still not associated with any repository.

ipanova · 2023-12-15T12:20:06Z

@lubosmj

2. Setting `mirror=True` in the staging resulted in older tags being removed from the repository. I am using the default `mirror=False`.

I thought this is the desired behavior. Since this is a pull-through-cache repository, it should match exactly the content remotely. So if a tag was removed remotely I would not know why we should keep it locally.

lubosmj · 2023-12-15T14:36:38Z

I thought this is the desired behavior. Since this is a pull-through-cache repository, it should match exactly the content remotely. So if a tag was removed remotely I would not know why we should keep it locally.

But, using mirror=True causes the sync pipeline to think that the old tag, which was previously pulled by the user, had gone. It removes it even though it might be present on the remote. See the test case test_pull_manifest_list where I pull two different tags. With mirror=True, the test fails because of the missing latest tag.

How can I forcefully tell the sync pipeline to not remove the existing tag? Also, how do we know if the tag was removed from the remote registry if the user never asks for it and thus we never realize that?

ipanova · 2023-12-15T14:51:51Z

@lubosmj since we using ContainerPullThroughCacheDeclarativeVersion that never checks rest of content as it would usually check in the normal DeclarativeVersion, it will always keep only the latest pulled tag with mirror=True. So you're right we should use mirror=False because we have no choice

lubosmj · 2024-01-16T11:37:32Z

pulp_container/app/registry.py

+            else:
+                raise PathNotResolved(tag_name)
+        else:
+            if distribution.remote_id and distribution.pull_through_distribution_id:


Add the cast call here.

Explicitly state that inside this if branch we are working through the pull-through distribution.

lubosmj · 2024-01-16T11:38:41Z

pulp_container/app/registry.py

+                        extra_data={"headers": V2_ACCEPT_HEADERS, "http_method": "head"}
+                    )
+                except ClientResponseError:
+                    raise PathNotResolved(path)


Instead, return the existing tag. The tag will just not be refreshed.

lubosmj · 2024-01-16T11:45:38Z

pulp_container/app/registry.py

+                        "Docker-Distribution-API-Version": "registry/2.0",
+                    }
+                    return web.Response(text=raw_manifest, headers=headers)
+                else:


Maybe, mention that we parse "blobs" and initialize a remote artifact here.

lubosmj · 2024-01-16T11:59:08Z

pulp_container/app/registry.py

+                # it is necessary to pass this information back to the client
+                raise HTTPTooManyRequests()
+            else:
+                raise PathNotResolved(self.path)


Leave a TODO comment about possible changes in the future. Right now, we are masking error messages that might be useful to the client.

lubosmj · 2024-01-16T12:02:22Z

pulp_container/app/registry.py

+
+        manifest = Manifest(
+            digest=digest,
+            schema_version=2,


Get the schema version from media_type.

lubosmj · 2024-01-16T12:27:47Z

pulp_container/app/registry_api.py

+                    tag = models.Tag(name=pk, tagged_manifest=manifest)
+                    try:
+                        tag.save()
+                    except IntegrityError:
+                        tag = models.Tag.objects.get(name=tag.name, tagged_manifest=manifest)
+                        tag.touch()


Add the tag to the repository via an immediate task.

lubosmj · 2024-01-16T12:30:07Z

pulp_container/app/registry_api.py

@@ -1207,12 +1325,18 @@ def head(self, request, path, pk=None):

    def get(self, request, path, pk):
        """Return a signature identified by its sha256 checksum."""
-        _, _, repository_version = self.get_drv_pull(path)
+        _, repository, repository_version = self.get_drv_pull(path)


Maybe, revert this change.

lubosmj · 2024-01-16T12:48:56Z

pulp_container/app/viewsets.py

@@ -1302,6 +1382,103 @@ def destroy(self, request, pk, **kwargs):
        return OperationPostponedResponse(async_result, request)


+class ContainerPullThroughDistributionViewSet(DistributionViewSet, RolesMixin):


TODO: add a comment about inheriting the private flag from the pull-through cache distribution.

lubosmj · 2024-01-16T12:52:59Z

pulp_container/app/registry_api.py

+                    **remote_data,
+                )
+
+                cache_distribution, _ = models.ContainerDistribution.objects.get_or_create(


TODO in the future? Propagate the permissions and private flag from the pull-through distribution to this distribution.

lubosmj · 2024-01-16T12:53:11Z

docs/workflows/host.rst

+pre-configure a new repository and sync it to facilitate the retrieval of the actual content. This
+speeds up the whole process of shipping containers from its early management stages to distribution.
+Similarly to on-demand syncing, the feature also **reduces external network dependencies**, and
+ensures a more reliable container deployment system in production environments.


Distributions are public by default.

ipanova

We walked through the PR on the call and @lubosmj has few things to update and can mere afterwards

closes pulp#507

mdellweg reviewed Jun 5, 2023

View reviewed changes

lubosmj force-pushed the pull-through-cache branch 6 times, most recently from 6555d70 to c4028e0 Compare June 12, 2023 18:49

lubosmj force-pushed the pull-through-cache branch 8 times, most recently from d4ab757 to 64640d5 Compare June 15, 2023 11:25

lubosmj requested review from mdellweg and ipanova June 15, 2023 12:45

lubosmj force-pushed the pull-through-cache branch from 64640d5 to 1534909 Compare June 15, 2023 12:50

mdellweg closed this Jun 26, 2023

mdellweg reopened this Jun 26, 2023

lubosmj force-pushed the pull-through-cache branch 4 times, most recently from 86fb607 to c50e570 Compare July 24, 2023 18:08

mdellweg reviewed Dec 13, 2023

View reviewed changes

lubosmj force-pushed the pull-through-cache branch from af9443b to 88d8c84 Compare December 13, 2023 10:18

ipanova reviewed Dec 15, 2023

View reviewed changes

lubosmj marked this pull request as draft December 20, 2023 17:18

lubosmj force-pushed the pull-through-cache branch 10 times, most recently from f5dab49 to cbd654e Compare January 2, 2024 19:40

lubosmj marked this pull request as ready for review January 2, 2024 20:33

lubosmj commented Jan 16, 2024

View reviewed changes

ipanova approved these changes Jan 16, 2024

View reviewed changes

lubosmj force-pushed the pull-through-cache branch 4 times, most recently from b53d1d4 to d096e0d Compare January 16, 2024 23:48

github-actions bot removed the multi-commit label Jan 16, 2024

lubosmj force-pushed the pull-through-cache branch from d096e0d to af01caa Compare January 17, 2024 15:22

Add pull-through caching

f57edfe

closes pulp#507

lubosmj force-pushed the pull-through-cache branch from af01caa to f57edfe Compare January 17, 2024 17:25

lubosmj enabled auto-merge (rebase) January 17, 2024 17:29

lubosmj merged commit cbaa073 into pulp:main Jan 17, 2024
14 checks passed

		@@ -1302,6 +1382,103 @@ def destroy(self, request, pk, **kwargs):
		return OperationPostponedResponse(async_result, request)


		class ContainerPullThroughDistributionViewSet(DistributionViewSet, RolesMixin):

Add Pull-through caching #1299

Add Pull-through caching #1299

Conversation

lubosmj commented Jun 4, 2023 • edited Loading

Choose a reason for hiding this comment

lubosmj commented Jun 15, 2023

lubosmj commented Jun 15, 2023

mdellweg commented Jun 26, 2023

mdellweg commented Jun 26, 2023

lubosmj commented Jun 28, 2023

lubosmj commented Jun 28, 2023 • edited Loading

mdellweg left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ipanova commented Dec 15, 2023

lubosmj commented Dec 15, 2023

ipanova commented Dec 15, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ipanova left a comment

Choose a reason for hiding this comment

lubosmj commented Jun 4, 2023 •

edited

Loading

lubosmj commented Jun 28, 2023 •

edited

Loading