Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HealthCheck log output options #23900

Open
wants to merge 3 commits into
base: main
Choose a base branch
from
Open

Conversation

Honny1
Copy link
Member

@Honny1 Honny1 commented Sep 9, 2024

This PR creates three new flags that can affect the output of the HealtCheck log.

Currently, when a container is configured with HealthCheck, the output from the HealthCheck command is only logged to the container status file, which is accessible via podman inspect. It is also limited to the last five executions and the first 500 characters per execution.

This makes debugging past problems very difficult, since the only information available about the failure of the HealthCheck command is the generic healthcheck service failed record.

  • The --health-log-destination flag sets the destination of the HealthCheck log.

    • none: (default behavior) HealthCheckResults are stored in overlay containers. (For example: ./run/containers/storage/overlay-containers/<container-ID>/healthcheck.log)
    • directory: creates a log file named <container-ID>-healthcheck.log with JSON HealthCheckResults in the specified directory.
    • events_logger: The log will be written with logging mechanism set by events_logger.
  • The --health-max-log-count flag sets the maximum number of attempts in the HealthCheck log file.

    • A value of 0 indicates an infinite number of attempts in the log file.
    • The default value is 5 attempts in the log file.
  • The --health-max-log-size flag sets the maximum length of the log stored.

    • A value of 0 indicates an infinite log length.
    • The default value is 500 log characters.

Does this PR introduce a user-facing change?

Added --health-log-destination, --health-max-log-count and --health-max-log-size flags that affect HealtCheck log output.

Fixes: RHEL-24623

@openshift-ci openshift-ci bot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. do-not-merge/release-note-label-needed Enforce release-note requirement, even if just None labels Sep 9, 2024
Copy link
Contributor

openshift-ci bot commented Sep 9, 2024

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: Honny1
Once this PR has been reviewed and has the lgtm label, please assign flouthoc for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@Honny1 Honny1 force-pushed the healthcheck-log branch 11 times, most recently from 331ce42 to ebadb8b Compare September 10, 2024 15:58
@containers containers deleted a comment from packit-as-a-service bot Sep 10, 2024
@Honny1 Honny1 force-pushed the healthcheck-log branch 15 times, most recently from ee1d02a to bf7e1cb Compare September 13, 2024 10:56
@openshift-ci openshift-ci bot added release-note and removed do-not-merge/release-note-label-needed Enforce release-note requirement, even if just None labels Sep 17, 2024
@containers containers deleted a comment from packit-as-a-service bot Sep 17, 2024
@Honny1 Honny1 changed the title Options for HealthCheck log output HealthCheck log output options Sep 17, 2024
@Honny1 Honny1 marked this pull request as ready for review September 17, 2024 13:37
@openshift-ci openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Sep 17, 2024
@@ -413,6 +413,14 @@ type ContainerMiscConfig struct {
HealthCheckConfig *manifest.Schema2HealthConfig `json:"healthcheck"`
// HealthCheckOnFailureAction defines an action to take once the container turns unhealthy.
HealthCheckOnFailureAction define.HealthCheckOnFailureAction `json:"healthcheck_on_failure_action"`
// HealthLogDestination defines the destination where the log is stored
HealthLogDestination string `json:"healthLogDestination"`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please mark all of these as omitempty - we shouldn't bother including them in the JSON by default

if path := c.healthCheckLogPath(); path != "" {
path, err := c.healthCheckLogPath()
if err != nil {
logrus.Error(err)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason this shouldn't be fatal? I think this seems like a hard-stop.

@@ -156,7 +148,18 @@ func (c *Container) runHealthCheck(ctx context.Context, isStartup bool) (define.
hcl := newHealthCheckLog(timeStart, timeEnd, returnCode, eventLog)
logStatus, err := c.updateHealthCheckLog(hcl, inStartPeriod, isStartup)
if err != nil {
return hcResult, "", fmt.Errorf("unable to update health check log %s for %s: %w", c.healthCheckLogPath(), c.ID(), err)
path, err := c.healthCheckLogPath()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, from reading this - do we support logging to file and journald at the same time? That seems unnecessary - one or the other should be sufficient.

}

// HealthCheckLogPath returns the path for where the health check log is
func (c *Container) healthCheckLogPath() string {
return filepath.Join(filepath.Dir(c.state.RunDir), "healthcheck.log")
func (c *Container) healthCheckLogPath() (string, error) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe consider replacing this with a more generic readHealthcheckLog() to better handle the fact that they can go in journald now?

@@ -186,3 +187,21 @@ func (c *Container) hcUnitName(isStartup, bare bool) string {
}
return unitName
}

func (c *Container) sendHealthCheckLogToJournal(status string, log string) error {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have an existing healthcheck event, right? Can you extend that with optional fields instead of creating a new event for the log?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but then the target could be affected by events_logger. I don't think this is a problem, but I wonder why HealthCheckResults were originally stored in a separate file.

@Honny1
Copy link
Member Author

Honny1 commented Sep 19, 2024

@mheon PTAL

### `HealthLogDestination=`

Set the destination of the HealthCheck log. Directory path or events_logger (none use container state file)
Equivalent to the Podman `--health-log-destination` option.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe add a sentence at the end here saying "the default is none" so it's very clear - I initially missed the (default) bit below.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, none doesn't feel like the right name. Maybe local?

### `HealthMaxLogCount=`

Set maximum number of attempts in the HealthCheck log file. ('0' value means an infinite number of attempts in the log file)
Equivalent to the Podman `--Health-max-log-count` option.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the default here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

5 attempts

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please put these in the manpage, make it clear to users what the expectation is.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have added default values to man pages.

### `HealthMaxLogSize=`

Set maximum length in characters of stored HealthCheck log. ("0" value means an infinite log length)
Equivalent to the Podman `--Health-max-log-size` option.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the default here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

500 characters

@mheon
Copy link
Member

mheon commented Sep 19, 2024

@Luap99 PTAL, particularly at the events bits. I don't really mind but we're getting a lot of feedback about how we're handling events.

Copy link
Member

@Luap99 Luap99 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your PR description is great, but I would love to see that information to fins its way into the git log. I don't see a lot of point in splitting you changes in three different commits without any description in them on its own. I would rather have you squash them and add the PR info to the commit message.

And yes in general I rather would write one event instead of a second one just add the msg to the existing heal_status event sounds fine to me.


Set the destination of the HealthCheck log. Directory path or events_logger (none use container state file) (Default: none)

* `none`: (default) HealthCheck logs are stored in overlay containers. (For example: `./run/containers/storage/overlay-containers/<container-ID>/healthcheck.log`)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

logically looking at this I fine the none name extremely confusing, this makes it sound like there will be no logs at all. But it uses the default current location in this case.
Second the mentioned path is not correct, it is only the default runroot for root but then can be changed via many ways so this should say under $runroot/...

Set the destination of the HealthCheck log. Directory path or events_logger (none use container state file) (Default: none)

* `none`: (default) HealthCheck logs are stored in overlay containers. (For example: `./run/containers/storage/overlay-containers/<container-ID>/healthcheck.log`)
* `directory`: creates a log file named `<container-ID>-healthcheck.log` with HealthCheck logs in the specified directory.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know I should have asked on the design doc but is there reason to use a directory here? I think it would make sense to just use the full path as file instead.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the case that there are multiple containers with the same path, to avoid mixing health check logs, the log file name is generated automatically. Probably the container name could be used instead of the container-id if it is specific, to improve the user experience. @Luap99 WDYT?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is properly fine like this then. Prevent using the same path on multiple container like this seems reasonable. Using the ID will be better than the name as the name may be larger than the 255 char file name limit which would break us.

libpod/events.go Outdated
e.HealthStatus = healthCheckResult.Status
if status == events.HealthLog {
if len(healthCheckResult.Log) > 0 {
log_, err := json.Marshal(healthCheckResult.Log[len(healthCheckResult.Log)-1])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do not use underscore in a variable name

Comment on lines 496 to 499
sort.SliceStable(containerEvents, func(i, j int) bool {
return containerEvents[i].Time.After(containerEvents[j].Time)
})

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Events should be in order why are you sorting them again?

@@ -642,6 +642,14 @@ func createContainerOptions(rt *libpod.Runtime, s *specgen.SpecGenerator, pod *l
options = append(options, libpod.WithHealthCheckOnFailureAction(s.ContainerHealthCheckConfig.HealthCheckOnFailureAction))
}

if healthCheckSet {
if s.ContainerHealthCheckConfig.HealthLogDestination != "" {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a user supplied value here and never validate before we create the container which means we get random runtime failures instead so there must be validation here.
In general we should check that the string is either one of the special values or an existing absolute path, the absolute path is import because otherwise things just break weirdly if we or the users changes the cwd.
This also mean we keep the option to add new special value later that do not start with / because with your current version defining a new value could mean breaking a users who used that as relative dir (I know super unlikely).

"type=container",
}

containerEvents, err := c.runtime.GetEvents(context.Background(), filters)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This in general is very inefficient if you have many events as we have to reading every single one of them for every single inspect and healtchcheck update, this will kill performance. On a system with a ton of events this option will be unusable

$ date; bin/podman events --stream=false --filter container=abc  >/dev/null; date
Thu 19 Sep 17:50:48 CEST 2024
Thu 19 Sep 17:51:10 CEST 2024

20s on my laptop, and yes I have close to 300k events but I don't think that is to unusual.

Looking at your design doc you seem to mention that you log to both the journal and file but I don't see that here anymore? I would have assumed that the journal should be write only for this use case. And you keep using the file to read it for inspect.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, from reading this - do we support logging to file and journald at the same time? That seems unnecessary - one or the other should be sufficient.

I did it according to @mheon's comment. But yes. I will extend the healt_status event with a log and use a file so that it doesn't affect performance as I wrote it in the design doc.



run_podman inspect $ctrname --format "{{.State.Health.Log}}"
[ $(echo "$output" | grep -o "$msg" - | wc -l) -eq 5 ]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is not acceptable per @edsantiago's rules, have a look at other tests, the template file and the Readme file in the test/system dir to see how to match things


* `none`: (default) HealthCheck logs are stored in overlay containers. (For example: `./run/containers/storage/overlay-containers/<container-ID>/healthcheck.log`)
* `directory`: creates a log file named `<container-ID>-healthcheck.log` with HealthCheck logs in the specified directory.
* `journal`: The log will be written to the journal. A log with HealthCheck logs will also be written to the default path.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This does not match the description from the cli option.

@Honny1 Honny1 force-pushed the healthcheck-log branch 3 times, most recently from a7320a7 to 33d3069 Compare September 20, 2024 12:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants