Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add command to benchmark Web Vitals via web-vitals library #40

Merged
merged 12 commits into from
Mar 1, 2023

Conversation

felixarntz
Copy link
Collaborator

@felixarntz felixarntz commented Feb 17, 2023

This introduces a new benchmark-web-vitals command that, similar to the existing benchmark-url command allows making 1 or more requests to 1 or more URLs, capture the results and compute the medians for each metric. However, the notable difference is that this one measures Core Web Vitals and other key client metrics, using https://github.com/GoogleChrome/web-vitals.

As part of this PR, the benchmark-url command is renamed to a more appropriate benchmark-server-timing, to clearly differentiate the two commands and clarify the different purposes. They also function quite differently, as the existing command simply makes a curl request to the URL while the new one here uses a headless browser.

The initial iteration implemented in this PR only includes load time metrics (FCP, LCP, and TTFB) as those are the most critical for us to measure, and it appears that they are easier to implement. An early draft PR for adding other Web Vitals CLS, FID, and INP exists at #41 and can be explored afterwards as an iteration.

Testing

You can test this command by following the instructions included in this PR (see https://github.com/GoogleChromeLabs/wpp-research/tree/add/benchmark-web-vitals/cli#benchmark-web-vitals), which also provide examples.

Comment on lines +102 to +105
* In the future, additional Web Vitals like CLS, FID, and INP should be
* added, however they are slightly more complex to retrieve through an
* automated headless browser test.
* See https://github.com/GoogleChromeLabs/wpp-research/pull/41.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FID and INP are more complex as require an interaction, but the load CLS should be similar to FCP/LCP/TTFB shouldn't it?

Granted, that often isn't the full CLS of a page (as it can be impacted by scroll, or other interactions), but it's still usually a good part of CLS. And that "load CLS" is what Lighthouse and PSI already captures despite it's limitations.

The only thing is that, kind of like LCP, CLS is not emitted by web-vitals.js, until the page is hidden (including navigating away from it). For the web-vitals.js library we dummy this visibility change:
https://github.com/GoogleChrome/web-vitals/blob/main/test/utils/stubVisibilityChange.js

And then call it like this:

    // Wait until all images are loaded and rendered, then change to hidden.
    await imagesPainted();
    await stubVisibilityChange('hidden');

Maybe that's enough to make it too complicated here?

The alternative is to use the reportAllChanges flag and only report at the end of the test.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @tunetheweb, that it great context. I am a bit unsure about CLS being more similar to the load time metrics; based on what you're saying it makes sense, but when I initially tried using all of them, I always got data for FID actually but not CLS.

In any case, I think due to those additional complexities it makes sense to separate the work into more incremental steps, and I think starting with only FCP/LCP/TTFB works well as those are all load time metrics and they are simplest to implement here. #41 covers the follow up work, so I'd prefer to handle CLS, FID, and INP in that (or maybe even broken down into further separated PRs).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah sounds like it’s due to not changing the visibility so CLS not being “finalised”. Which I realised and edited after writing that first sentence (but before submitting the review).

cli/commands/benchmark-web-vitals.mjs Outdated Show resolved Hide resolved
@felixarntz
Copy link
Collaborator Author

@oandregal Would you mind giving this PR a review? Have you been able to test the command yet?

vishalkakadiya
vishalkakadiya previously approved these changes Feb 28, 2023
Copy link
Collaborator

@vishalkakadiya vishalkakadiya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tested the command with all the options, and it is working great. Approving PR now!

10upsimon
10upsimon previously approved these changes Feb 28, 2023
Copy link
Collaborator

@10upsimon 10upsimon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've tested all commands as listed in the branches README, and all are functioning correctly for me. Approved from my end.

Copy link
Collaborator

@mukeshpanchal27 mukeshpanchal27 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @felixarntz, Quick review and approve for merge.

@oandregal
Copy link
Collaborator

oandregal commented Mar 7, 2023

@oandregal Would you mind giving this PR a review? Have you been able to test the command yet?

Sorry for the late response. I had a use case in mind and only now was able to create a test case. WordPress/gutenberg#48728 (comment)

This command is very useful, thanks!

Some feedback:

LCP-TTFB.

I miss having data for the LCP-TTFB median based on the of each result instead of subtracting LCP and TTFB medians, as I had to do in the test case above.

Quality check.

This command has shown better data quality than the other approaches I've used. Nice!

I saw you are adding percentiles at #43 and that will be useful! I am still unsure about how best to address the fact that performance analysis contains high variability. For example, using 10 runs, I've seen:

  • differences up to 15ms between the p25 and p75 for TTFB and 25ms between the p25 and p75 for LCP
  • differences up to 9ms in the mean of TTFB (123 vs 132)
  • differences up to 10ms in the mean of LCP (210 vs 220)

The data has more variability than I'd like to. Finding ways to reduce it would be nice. Otherwise, executing many runs (hundreds) is still my preferred mechanism.

Access to the full dataset.

If I have access to the full dataset, I can do analysis after the fact.

@felixarntz
Copy link
Collaborator Author

Thanks @oandregal for the feedback!

Regarding LCP - TTFB, that's a great idea to add as a metric, I'll take a note to do that (though I'm AFK for the next few days, so it'll take a couple of days). Feel free to open a PR if you have capacity, but otherwise I'll get to it once I'm back.

For the variability, are those ms differences you're sharing between different runs of the same command? For example, one time you ran it with 10 runs, you got p25 set to some value, and the other time the p25 was 15ms different? I'm definitely interested in exploring this further to see how we can get those numbers down, but partially of course it will be impossible due to varying machine resources etc. For example I'd be curious how the results differed for the other percentiles in those specific runs you did. Maybe we need to conduct a more detailed analysis on how the results differ between different runs with the tool. Would be great if we could use that to find the "sweet spot" for how many runs to do per each invocation of the command.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants