Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Show aggregate health status in the dashboard #5770

Merged
merged 3 commits into from
Sep 19, 2024

Conversation

davidfowl
Copy link
Member

@davidfowl davidfowl commented Sep 18, 2024

Description

Added the aggregated health state to the resource server protocol to enable showing it in the dashboard.

Contributes to #5569

healthstate.mp4

Checklist

  • Is this feature complete?
    • Yes. Ready to ship.
    • No. Follow-up changes expected.
  • Are you including unit tests for the changes and scenario tests if relevant?
    • Yes
    • No
  • Did you add public API?
    • Yes
      • If yes, did you have an API Review for it?
        • Yes
        • No
      • Did you add <remarks /> and <code /> elements on your triple slash comments?
        • Yes
        • No
    • No
  • Does the change make any security assumptions or guarantees?
    • Yes
      • If yes, have you done a threat model and had a security review?
        • Yes
        • No
    • No
  • Does the change require an update in our Aspire docs?
    • Yes
      • Link to aspire-docs issue:
    • No
Microsoft Reviewers: Open in CodeFlow

@@ -170,6 +170,16 @@ message Resource {

// The set of volumes mounted to the resource. Only applies to containers.
repeated Volume volumes = 15;

// The aggregate health state of the resource
optional HealthStateKind HealthState = 16;
Copy link
Member

@JamesNK JamesNK Sep 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this part is public API, I think we should do the right thing here and have complete health state information. In other words, have a list of the different health checks and what there status is.

e.g.

repeated HealthCheck health_checks = 16;

message HealthCheck {
    string name = 1;
    HealthStateKind state = 2;
}

It will be simple to use this information to figure out the aggregate health state. And in the future it will be easier to add more detailed information about the health status, such as a list in the resource details, or a new view.

Copy link
Member Author

@davidfowl davidfowl Sep 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you want the dashboard to compute the aggregate? Don't you want to avoid that computation for the overview? That requires the dashboard to look at health changes per resource and track the aggregate.

Copy link
Member

@JamesNK JamesNK Sep 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see a problem with the dashboard calculating it. When the dashboard builds the ResourceViewModel it can loop over the health checks and set a property.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we want a single source of truth. The resource server in the case of the apphost needs the aggregate as well (its how WaitFor works), so it makes sense to avoid recomputing it unnecessarily. I think we should send the aggregate and defer the health check detail for when we invent UI to show them.

Copy link
Member

@drewnoakes drewnoakes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me.

src/Aspire.Hosting/Dashboard/proto/Partials.cs Outdated Show resolved Hide resolved
src/Aspire.Hosting/Dashboard/ResourceSnapshot.cs Outdated Show resolved Hide resolved
src/Aspire.Dashboard/Model/ResourceViewModel.cs Outdated Show resolved Hide resolved
@davidfowl davidfowl changed the title WIP: Health checks in UI Show aggregate health status in the dashboard Sep 19, 2024
@davidfowl davidfowl added area-dashboard area-app-model Issues pertaining to the APIs in Aspire.Hosting, e.g. DistributedApplication labels Sep 19, 2024
@davidfowl davidfowl marked this pull request as ready for review September 19, 2024 05:39
@JamesNK
Copy link
Member

JamesNK commented Sep 19, 2024

I pulled the branch to try it out.

My initial reaction when I see this in the dashboard is why do some resources say they're ready, and others don't?

image

IMO, don't display (Ready) if the resource is healthy. Conceptually a resource without health checks and a resource with passing health checks are both ready to use. I think they should both just show Running.

@davidfowl
Copy link
Member Author

So we only display No ready and never ready.

@JamesNK
Copy link
Member

JamesNK commented Sep 19, 2024

Yes. I think it is cleaner UI, and IMO it's less confusing.

As always, if someone wants more details then they can open the details view.

@JamesNK
Copy link
Member

JamesNK commented Sep 19, 2024

Mobile view:

image

I think this is fine. Text is cut off but the icon shows that it isn't ready. The user can view details to get more info.

Copy link
Member

@JamesNK JamesNK left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM in the resource service and dashboard layers

…enable showing it in the dashboard.

- Don't show healthy/unhealthy, just use the heath state to show readiness in the dashboard UX.
- Use an internal annotation to keep track of DCP resource names for state updates.
- Publish health state updates to instances as well as the main resource.
@davidfowl
Copy link
Member Author

@Alirexaa the pgweb test failed here:

[xUnit.net 00:02:49.46]         | [2024-09-19T08:03:20] Aspire.Hosting.Tests.Resources.pg1-pgweb Information: 3: 2024-09-19T08:03:19.6335235Z Starting server...
[xUnit.net 00:02:49.46]         | [2024-09-19T08:03:20] Aspire.Hosting.Tests.Resources.pg1-pgweb Information: 4: 2024-09-19T08:03:19.6335934Z To view database open http://0.0.0.0:8081/ in browser
[xUnit.net 00:02:49.46]         | [2024-09-19T08:03:20] System.Net.Http.HttpClient.Default.LogicalHandler Information: Start processing HTTP request POST http://localhost:34781/api/connect
[xUnit.net 00:02:49.46]         | [2024-09-19T08:03:20] System.Net.Http.HttpClient.Default.ClientHandler Information: Sending HTTP request POST http://localhost:34781/api/connect
[xUnit.net 00:02:49.46]         | [2024-09-19T08:03:20] System.Net.Http.HttpClient.Default.ClientHandler Information: Received HTTP response headers after 7.4991ms - 400
[xUnit.net 00:02:49.46]         | [2024-09-19T08:03:20] Polly Information: Execution attempt. Source: '-standard//Standard-Retry', Operation Key: '', Result: '400', Handled: 'False', Attempt: '0', Execution Time: 8.0316ms
[xUnit.net 00:02:49.46]         | [2024-09-19T08:03:20] System.Net.Http.HttpClient.Default.LogicalHandler Information: End processing HTTP request after 8.7504ms - 400
[xUnit.net 00:02:49.46]         | [2024-09-19T08:03:20] Aspire.Hosting.Dcp.ApplicationExecutor Debug: Log streaming for pg1-ygakhxwc-3609f899 was cancelled.
[xUnit.net 00:02:49.46]         | [2024-09-19T08:03:20] Aspire.Hosting.Dcp.ApplicationExecutor Debug: Log streaming for pg1-pgweb-sbwkwmgr-3609f899 was cancelled.
[xUnit.net 00:02:49.47]   Finished:    Aspire.Hosting.PostgreSQL.Tests
Data collector 'Blame' message: All tests finished running, Sequence file will not be generated.
  Failed Aspire.Hosting.PostgreSQL.Tests.PostgresFunctionalTests.VerifyWithPgWeb [9 s]
  Error Message:
   System.Net.Http.HttpRequestException : Response status code does not indicate success: 400 (Bad Request).

I think we need a better wait for and potentially more retries.

@davidfowl davidfowl enabled auto-merge (squash) September 19, 2024 08:58
@davidfowl davidfowl enabled auto-merge (squash) September 19, 2024 08:58
@Alirexaa
Copy link
Contributor

@Alirexaa the pgweb test failed here:

[xUnit.net 00:02:49.46]         | [2024-09-19T08:03:20] Aspire.Hosting.Tests.Resources.pg1-pgweb Information: 3: 2024-09-19T08:03:19.6335235Z Starting server...
[xUnit.net 00:02:49.46]         | [2024-09-19T08:03:20] Aspire.Hosting.Tests.Resources.pg1-pgweb Information: 4: 2024-09-19T08:03:19.6335934Z To view database open http://0.0.0.0:8081/ in browser
[xUnit.net 00:02:49.46]         | [2024-09-19T08:03:20] System.Net.Http.HttpClient.Default.LogicalHandler Information: Start processing HTTP request POST http://localhost:34781/api/connect
[xUnit.net 00:02:49.46]         | [2024-09-19T08:03:20] System.Net.Http.HttpClient.Default.ClientHandler Information: Sending HTTP request POST http://localhost:34781/api/connect
[xUnit.net 00:02:49.46]         | [2024-09-19T08:03:20] System.Net.Http.HttpClient.Default.ClientHandler Information: Received HTTP response headers after 7.4991ms - 400
[xUnit.net 00:02:49.46]         | [2024-09-19T08:03:20] Polly Information: Execution attempt. Source: '-standard//Standard-Retry', Operation Key: '', Result: '400', Handled: 'False', Attempt: '0', Execution Time: 8.0316ms
[xUnit.net 00:02:49.46]         | [2024-09-19T08:03:20] System.Net.Http.HttpClient.Default.LogicalHandler Information: End processing HTTP request after 8.7504ms - 400
[xUnit.net 00:02:49.46]         | [2024-09-19T08:03:20] Aspire.Hosting.Dcp.ApplicationExecutor Debug: Log streaming for pg1-ygakhxwc-3609f899 was cancelled.
[xUnit.net 00:02:49.46]         | [2024-09-19T08:03:20] Aspire.Hosting.Dcp.ApplicationExecutor Debug: Log streaming for pg1-pgweb-sbwkwmgr-3609f899 was cancelled.
[xUnit.net 00:02:49.47]   Finished:    Aspire.Hosting.PostgreSQL.Tests
Data collector 'Blame' message: All tests finished running, Sequence file will not be generated.
  Failed Aspire.Hosting.PostgreSQL.Tests.PostgresFunctionalTests.VerifyWithPgWeb [9 s]
  Error Message:
   System.Net.Http.HttpRequestException : Response status code does not indicate success: 400 (Bad Request).

I think we need a better wait for and potentially more retries.

I will check.

@davidfowl davidfowl merged commit 3bcf24f into main Sep 19, 2024
11 checks passed
@davidfowl davidfowl deleted the davidfowl/health-check-ui branch September 19, 2024 15:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area-app-model Issues pertaining to the APIs in Aspire.Hosting, e.g. DistributedApplication area-dashboard
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants