Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CI: make 090-events parallel-safe #23987

Merged
merged 1 commit into from
Sep 18, 2024

Conversation

edsantiago
Copy link
Member

@edsantiago edsantiago commented Sep 17, 2024

...or at least as much as possible. Some tests cannot
be run in parallel due to #23750: "--events-backend=file"
does not actually work the way a naïve user would intuit.
Stop/die events are asynchronous, and can be gathered
by ANY OTHER podman process running after it, and if
that process has the default events-backend=journal,
that's where the event will be logged. See #23987 for
further discussion.

Signed-off-by: Ed Santiago [email protected]

None

@openshift-ci openshift-ci bot added release-note-none approved Indicates a PR has been approved by an approver from all required OWNERS files. labels Sep 17, 2024
Copy link
Member

@Luap99 Luap99 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some tests cannot
be run in parallel due to #23750 ("--events-backend=file"
does not actually mean "events-backend=file").

Well I think this is just misleading, --events-backend=file works fine like this. What happens is that another process not configured to use the file logger can also see your contianer and sync the state there which then creates the died event. The event logger is per process and NOT per container as such this seem totally fine to me.
In general it is not really sane to mix the event loggers like this anyway and no user should do that.

Comment on lines +36 to +40
# Wait for container to truly be gone.
# 99% of the time this will return immediately with a "no such container" error,
# which is fine. Under heavy load, it might actually catch the container while
# it's being cleaned up. Either way, this guarantees the "died" event is logged.
PODMAN_TIMEOUT=4 run_podman '?' wait $id
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at this you could just rmeove the -d from podman run as the if podman run runs in the foreground it will ensure to delete the container before exit. I guess then we need another way to fetch the cid so I guess in terms of podman command nothing changed so the wait here is fine for me.

@edsantiago
Copy link
Member Author

The event logger is per process and NOT per container

That's an internal distinction that makes sense to you, but as you admitted in the bug page it violates POLA for everyone else.

In general it is not really sane to mix the event loggers like this anyway and no user should do that.

This is something that's very hard for me to understand, even with your explanations. I'm sure in six months I will forget it, and run podman with --events-backend because it's an existing documented flag that's very handy for some cases. If it should not be used, maybe there should be a warning emitted every time it's seen on the command line?

run_podman run --name=$cname --rm $IMAGE true
# FIXME FIXME FIXME, please confirm that my 'container=' change is correct
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Eyeballs requested on this change, please

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know what the indentation was but yes this seems correct status=$cname clearly makes no sense

@edsantiago edsantiago marked this pull request as draft September 17, 2024 17:31
@openshift-ci openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Sep 17, 2024
@edsantiago
Copy link
Member Author

Do not merge until my FIXME question is answered

@Luap99
Copy link
Member

Luap99 commented Sep 17, 2024

That's an internal distinction that makes sense to you, but as you admitted in the bug page it violates POLA for everyone else.

Let me put it another way if you do podman --events-backend journald run --name c1 ... and then podman --events-backend file stop c1` then which events should be generated where? In practice we always race between the foreground podman stop (file) and the background podman container cleanup (journald), both of them can create the died/cleanup and possible other events and that just depends on who gets there first.

And --event-backend is a global option not an option when creating a container so it applies only to the current libpod runtime. It really is no different like --db-backend, --network-backend and other which should just not be mixed without knowing the consequences.

And yes I realize that in your parallel tests it is a bit more complicated because you do not interact with the container directly from another test but rather just a operation that runs on all container such as podman mount causes the died event to be generated. The reason for this is that the died event should be created as soon as we notice the exit and this means each command that acts on this container and syncs the state can and will cause this event.

So if you want to run this in parallel you cannot have other podman process in parallel that act on all containers and of course this is much more difficult then not using --event-backend in parallel

If it should not be used, maybe there should be a warning emitted every time it's seen on the command line?

you can use it, but it must be consistently for all commands like we do in e2e. Not only for some.

...or at least as much as possible. Some tests cannot
be run in parallel due to containers#23750: "--events-backend=file"
does not actually work the way a naïve user would intuit.
Stop/die events are asynchronous, and can be gathered
by *ANY OTHER* podman process running after it, and if
that process has the default events-backend=journal,
that's where the event will be logged. See containers#23987 for
further discussion.

Signed-off-by: Ed Santiago <[email protected]>
@edsantiago
Copy link
Member Author

if you do podman --events-backend journald run --name c1 ... and then podman --events-backend file stop c1 then which events should be generated where?

Thank you, that is a good illustration that helps me understand. However... I think we're in a middle situation because there is no podman operation on c1 that does not also have --events-backend=file. What I think is happening (based on your comment in the issue) is something closer to:

  1. podman --events-backend=file run c1
  2. (c1 runs, then completes, and generates the die event)
  3. a completely unrelated test runs podman something, right after c1 finishes but before that podman command writes the event. This podman something command uses the default journal, and it sees this event, and "helpfully" writes it to the journal

Is that a sensible explanation of what is happening?

Anyhow, I updated the commit message, is that more acceptable?

@Luap99
Copy link
Member

Luap99 commented Sep 18, 2024

Is that a sensible explanation of what is happening?

Yes

Anyhow, I updated the commit message, is that more acceptable?

sounds good

Copy link
Member

@Luap99 Luap99 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Contributor

openshift-ci bot commented Sep 18, 2024

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: edsantiago, Luap99

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@edsantiago edsantiago marked this pull request as ready for review September 18, 2024 15:55
@openshift-ci openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Sep 18, 2024
@rhatdan
Copy link
Member

rhatdan commented Sep 18, 2024

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Sep 18, 2024
@openshift-merge-bot openshift-merge-bot bot merged commit 04d193d into containers:main Sep 18, 2024
56 checks passed
@edsantiago edsantiago deleted the safename-090 branch September 18, 2024 16:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged. release-note-none
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants