feat: collect and cache builtin instructions cost and count per transaction #2692

tao-stones · 2024-08-22T00:18:13Z

Problem

#2561

Summary of Changes

add a new bench case that test tx has 355 instructions that are all builtin instructions (including compute-budget ixs). This is worst-case as all instructions need to resolve its cost.
collect tx's builtin instruction counts and cost, remove compute_budget_ from instruction_details name to reflect that struct caches more than just compute-budget details
updated filter to cache resolve builtin instruction cost, to avoid repeated hashing and lookup from BUILTIN_INSTRUCTION_COSTS

Fixes #2561

rename compute_budget_instruction_details to instruction_details as it contains more than just compute-budget ix info;

…in cost

tao-stones · 2024-08-22T00:31:47Z

The three commits are organized for bench incremental steps: original, simply add needed function, perf optimized:

Commit 1, bench before change

     Running benches/process_compute_budget_instructions.rs (target/release/deps/process_compute_budget_instructions-c23596bf6c26a34b)
bench_process_compute_budget_instructions_empty/0 instructions
                        time:   [6.9498 µs 6.9685 µs 6.9868 µs]
                        thrpt:  [146.56 Melem/s 146.95 Melem/s 147.34 Melem/s]
                 change:
                        time:   [+3.1493% +3.4355% +3.7214%] (p = 0.00 < 0.05)
                        thrpt:  [-3.5879% -3.3214% -3.0531%]
                        Performance has regressed.
Found 5 outliers among 100 measurements (5.00%)
  2 (2.00%) low mild
  3 (3.00%) high mild

bench_process_compute_budget_instructions_no_builtins/4 dummy Instructions
                        time:   [12.728 µs 12.744 µs 12.759 µs]
                        thrpt:  [80.259 Melem/s 80.354 Melem/s 80.450 Melem/s]
                 change:
                        time:   [+1.8458% +2.0980% +2.3406%] (p = 0.00 < 0.05)
                        thrpt:  [-2.2870% -2.0549% -1.8124%]
                        Performance has regressed.
Found 18 outliers among 100 measurements (18.00%)
  9 (9.00%) low mild
  7 (7.00%) high mild
  2 (2.00%) high severe

bench_process_compute_budget_instructions_compute_budgets/4 compute-budget instructions
                        time:   [26.258 µs 26.293 µs 26.332 µs]
                        thrpt:  [38.888 Melem/s 38.945 Melem/s 38.998 Melem/s]
                 change:
                        time:   [+0.8114% +1.0765% +1.3665%] (p = 0.00 < 0.05)
                        thrpt:  [-1.3481% -1.0651% -0.8049%]
                        Change within noise threshold.
Found 5 outliers among 100 measurements (5.00%)
  3 (3.00%) high mild
  2 (2.00%) high severe

bench_process_compute_budget_instructions_builtins/4 dummy builtins
                        time:   [16.261 µs 16.285 µs 16.310 µs]
                        thrpt:  [62.782 Melem/s 62.882 Melem/s 62.973 Melem/s]
                 change:
                        time:   [-0.3143% -0.1345% +0.0376%] (p = 0.13 > 0.05)
                        thrpt:  [-0.0376% +0.1347% +0.3153%]
                        No change in performance detected.
Found 8 outliers among 100 measurements (8.00%)
  5 (5.00%) low mild
  1 (1.00%) high mild
  2 (2.00%) high severe

bench_process_compute_budget_instructions_mixed/355 mixed instructions
                        time:   [496.86 µs 497.54 µs 498.27 µs]
                        thrpt:  [2.0551 Melem/s 2.0581 Melem/s 2.0609 Melem/s]
                 change:
                        time:   [-0.1292% +0.0754% +0.2721%] (p = 0.47 > 0.05)
                        thrpt:  [-0.2714% -0.0753% +0.1293%]
                        No change in performance detected.
Found 3 outliers among 100 measurements (3.00%)
  2 (2.00%) high mild
  1 (1.00%) high severe

bench_process_compute_budget_and_transfer_only/355 transfer instructions and compute budget ixs
                        time:   [490.41 µs 490.74 µs 491.12 µs]
                        thrpt:  [2.0850 Melem/s 2.0867 Melem/s 2.0880 Melem/s]
                 change:
                        time:   [+0.2029% +0.3755% +0.5466%] (p = 0.00 < 0.05)
                        thrpt:  [-0.5436% -0.3741% -0.2025%]
                        Change within noise threshold.
Found 3 outliers among 100 measurements (3.00%)
  3 (3.00%) high mild

commit 2: add function to collect builtin cost; It adds hashing for every program_id; results: 0-ix regress a bit due to added math to calc non-cb-ix-count, but hashing made 4-ix benches worse, and much worse for many-ix benches

     Running benches/process_compute_budget_instructions.rs (target/release/deps/process_compute_budget_instructions-9affcf53954823c1)
bench_process_compute_budget_instructions_empty/0 instructions
                        time:   [7.8715 µs 7.8805 µs 7.8878 µs]
                        thrpt:  [129.82 Melem/s 129.94 Melem/s 130.09 Melem/s]
                 change:
                        time:   [+17.913% +18.243% +18.527%] (p = 0.00 < 0.05)
                        thrpt:  [-15.631% -15.428% -15.192%]
                        Performance has regressed.
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high mild

bench_process_compute_budget_instructions_no_builtins/4 dummy Instructions
                        time:   [24.825 µs 24.861 µs 24.901 µs]
                        thrpt:  [41.122 Melem/s 41.189 Melem/s 41.249 Melem/s]
                 change:
                        time:   [+100.37% +100.68% +101.03%] (p = 0.00 < 0.05)
                        thrpt:  [-50.257% -50.170% -50.092%]
                        Performance has regressed.
Found 20 outliers among 100 measurements (20.00%)
  2 (2.00%) low severe
  1 (1.00%) low mild
  3 (3.00%) high mild
  14 (14.00%) high severe

bench_process_compute_budget_instructions_compute_budgets/4 compute-budget instructions
                        time:   [41.173 µs 41.188 µs 41.205 µs]
                        thrpt:  [24.851 Melem/s 24.862 Melem/s 24.871 Melem/s]
                 change:
                        time:   [+59.869% +60.266% +60.630%] (p = 0.00 < 0.05)
                        thrpt:  [-37.745% -37.604% -37.449%]
                        Performance has regressed.
Found 11 outliers among 100 measurements (11.00%)
  5 (5.00%) high mild
  6 (6.00%) high severe

bench_process_compute_budget_instructions_builtins/4 dummy builtins
                        time:   [34.823 µs 34.838 µs 34.857 µs]
                        thrpt:  [29.377 Melem/s 29.393 Melem/s 29.406 Melem/s]
                 change:
                        time:   [+112.60% +113.18% +113.73%] (p = 0.00 < 0.05)
                        thrpt:  [-53.211% -53.091% -52.964%]
                        Performance has regressed.
Found 8 outliers among 100 measurements (8.00%)
  2 (2.00%) high mild
  6 (6.00%) high severe

Benchmarking bench_process_compute_budget_instructions_mixed/355 mixed instructions: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 9.1s, enable flat sampling, or reduce sample count to 50.
bench_process_compute_budget_instructions_mixed/355 mixed instructions
                        time:   [1.8023 ms 1.8044 ms 1.8067 ms]
                        thrpt:  [566.78 Kelem/s 567.51 Kelem/s 568.15 Kelem/s]
                 change:
                        time:   [+258.39% +262.14% +264.37%] (p = 0.00 < 0.05)
                        thrpt:  [-72.555% -72.386% -72.097%]
                        Performance has regressed.
Found 15 outliers among 100 measurements (15.00%)
  5 (5.00%) high mild
  10 (10.00%) high severe

Benchmarking bench_process_compute_budget_and_transfer_only/355 transfer instructions and compute budget ixs: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 9.2s, enable flat sampling, or reduce sample count to 50.
bench_process_compute_budget_and_transfer_only/355 transfer instructions and compute budget ixs
                        time:   [1.8198 ms 1.8211 ms 1.8227 ms]
                        thrpt:  [561.81 Kelem/s 562.31 Kelem/s 562.69 Kelem/s]
                 change:
                        time:   [+272.41% +272.99% +273.55%] (p = 0.00 < 0.05)
                        thrpt:  [-73.230% -73.190% -73.148%]
                        Performance has regressed.
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high mild

Commit 3: updated filter to cached resolved builtin ix cost. It adds additional cost of allocating larger array per tx, but removed all repeated hashing; results: 0-ix bench regressed, 4-ix bench has small changes, many-ix benches significantly improved

     Running benches/process_compute_budget_instructions.rs (target/release/deps/process_compute_budget_instructions-b53a7f97abe58117)
bench_process_compute_budget_instructions_empty/0 instructions
                        time:   [16.950 µs 16.990 µs 17.026 µs]
                        thrpt:  [60.144 Melem/s 60.271 Melem/s 60.412 Melem/s]
                 change:
                        time:   [+114.95% +115.54% +116.15%] (p = 0.00 < 0.05)
                        thrpt:  [-53.735% -53.606% -53.478%]
                        Performance has regressed.
Found 3 outliers among 100 measurements (3.00%)
  2 (2.00%) high mild
  1 (1.00%) high severe

bench_process_compute_budget_instructions_no_builtins/4 dummy Instructions
                        time:   [22.542 µs 22.604 µs 22.656 µs]
                        thrpt:  [45.198 Melem/s 45.302 Melem/s 45.427 Melem/s]
                 change:
                        time:   [-9.8919% -9.6990% -9.4811%] (p = 0.00 < 0.05)
                        thrpt:  [+10.474% +10.741% +10.978%]
                        Performance has improved.

bench_process_compute_budget_instructions_compute_budgets/4 compute-budget instructions
                        time:   [48.071 µs 48.132 µs 48.174 µs]
                        thrpt:  [21.256 Melem/s 21.275 Melem/s 21.302 Melem/s]
                 change:
                        time:   [+16.602% +16.779% +16.923%] (p = 0.00 < 0.05)
                        thrpt:  [-14.474% -14.368% -14.238%]
                        Performance has regressed.
Found 12 outliers among 100 measurements (12.00%)
  4 (4.00%) low severe
  1 (1.00%) low mild
  4 (4.00%) high mild
  3 (3.00%) high severe

bench_process_compute_budget_instructions_builtins/4 dummy builtins
                        time:   [48.941 µs 48.973 µs 49.008 µs]
                        thrpt:  [20.894 Melem/s 20.909 Melem/s 20.923 Melem/s]
                 change:
                        time:   [+40.020% +40.242% +40.451%] (p = 0.00 < 0.05)
                        thrpt:  [-28.801% -28.695% -28.582%]
                        Performance has regressed.
Found 15 outliers among 100 measurements (15.00%)
  3 (3.00%) low severe
  2 (2.00%) low mild
  8 (8.00%) high mild
  2 (2.00%) high severe

bench_process_compute_budget_instructions_mixed/355 mixed instructions
                        time:   [556.23 µs 557.12 µs 558.17 µs]
                        thrpt:  [1.8346 Melem/s 1.8380 Melem/s 1.8410 Melem/s]
                 change:
                        time:   [-69.185% -69.137% -69.087%] (p = 0.00 < 0.05)
                        thrpt:  [+223.49% +224.01% +224.52%]
                        Performance has improved.
Found 6 outliers among 100 measurements (6.00%)
  5 (5.00%) high mild
  1 (1.00%) high severe

bench_process_compute_budget_and_transfer_only/355 transfer instructions and compute budget ixs
                        time:   [639.73 µs 641.01 µs 642.66 µs]
                        thrpt:  [1.5934 Melem/s 1.5975 Melem/s 1.6007 Melem/s]
                 change:
                        time:   [-64.939% -64.876% -64.810%] (p = 0.00 < 0.05)
                        thrpt:  [+184.17% +184.71% +185.22%]
                        Performance has improved.

runtime-transaction/benches/process_compute_budget_instructions.rs

apfitzge

Mostly looks good to me, small preference on the nested-enum choice

apfitzge · 2024-08-22T16:21:28Z

runtime-transaction/src/builtin_auxiliary_data_store.rs

+    //   None - un-checked
+    //   Some<None> - checked, not builtin
+    //   Some<Some<(bool, u32)>> - checked, is builtin and (is-compute-budget, default-cost)


It seems to me this would better represented by a new enum:

#[derive(Default)] enum BuiltinCheckStatus { #[default] Unchecked, NotBuiltin, Builtin{ is_compute_budget: bool, default_cost: u32, } }

think this saves a byte as well since we don't need 2 option discriminants

enum is a way to go

d76291d

savings cross all benches, due to smaller memory footprint

Running benches/process_compute_budget_instructions.rs (target/release/deps/process_compute_budget_instructions-b53a7f97abe58117) bench_process_compute_budget_instructions_empty/0 instructions time: [11.957 µs 11.975 µs 11.995 µs] thrpt: [85.368 Melem/s 85.513 Melem/s 85.641 Melem/s] change: time: [-29.911% -29.703% -29.488%] (p = 0.00 < 0.05) thrpt: [+41.820% +42.253% +42.676%] Performance has improved. Found 7 outliers among 100 measurements (7.00%) 5 (5.00%) high mild 2 (2.00%) high severe bench_process_compute_budget_instructions_no_builtins/4 dummy Instructions time: [18.924 µs 18.941 µs 18.959 µs] thrpt: [54.011 Melem/s 54.062 Melem/s 54.112 Melem/s] change: time: [-15.416% -15.210% -15.005%] (p = 0.00 < 0.05) thrpt: [+17.654% +17.938% +18.225%] Performance has improved. Found 6 outliers among 100 measurements (6.00%) 1 (1.00%) low mild 1 (1.00%) high mild 4 (4.00%) high severe bench_process_compute_budget_instructions_compute_budgets/4 compute-budget instructions time: [37.252 µs 37.332 µs 37.424 µs] thrpt: [27.362 Melem/s 27.430 Melem/s 27.488 Melem/s] change: time: [-21.462% -21.177% -20.898%] (p = 0.00 < 0.05) thrpt: [+26.419% +26.867% +27.328%] Performance has improved. Found 2 outliers among 100 measurements (2.00%) 2 (2.00%) high mild bench_process_compute_budget_instructions_builtins/4 dummy builtins time: [41.314 µs 41.521 µs 41.730 µs] thrpt: [24.539 Melem/s 24.662 Melem/s 24.786 Melem/s] change: time: [-8.4225% -8.0835% -7.7755%] (p = 0.00 < 0.05) thrpt: [+8.4311% +8.7944% +9.1972%] Performance has improved. Found 18 outliers among 100 measurements (18.00%) 18 (18.00%) high mild bench_process_compute_budget_instructions_mixed/355 mixed instructions time: [539.98 µs 540.43 µs 540.90 µs] thrpt: [1.8931 Melem/s 1.8948 Melem/s 1.8964 Melem/s] change: time: [-2.9610% -2.8023% -2.6445%] (p = 0.00 < 0.05) thrpt: [+2.7164% +2.8831% +3.0513%] Performance has improved. Found 3 outliers among 100 measurements (3.00%) 1 (1.00%) high mild 2 (2.00%) high severe bench_process_compute_budget_and_transfer_only/355 transfer instructions and compute budget ixs time: [607.07 µs 607.59 µs 608.19 µs] thrpt: [1.6837 Melem/s 1.6853 Melem/s 1.6868 Melem/s] change: time: [-4.9817% -4.8051% -4.6671%] (p = 0.00 < 0.05) thrpt: [+4.8956% +5.0477% +5.2429%] Performance has improved. Found 7 outliers among 100 measurements (7.00%) 5 (5.00%) high mild 2 (2.00%) high severe

jstarry · 2024-08-23T08:48:56Z

Implementation with the aux cache looks much better than what you had before! But shouldn't this code be behind a feature gate and shouldn't we at least have a SIMD written up describing the intended feature gated change in behavior?

apfitzge · 2024-08-23T14:15:35Z

Implementation with the aux cache looks much better than what you had before! But shouldn't this code be behind a feature gate and shouldn't we at least have a SIMD written up describing the intended feature gated change in behavior?

This doesn't change the behavior yet though, right? It's caching this data because we plan to use it, but the compute-budget-details we get from the sanitize_compute... function is unchanged by this (afaict).

tao-stones · 2024-08-23T19:10:53Z

Implementation with the aux cache looks much better than what you had before! But shouldn't this code be behind a feature gate and shouldn't we at least have a SIMD written up describing the intended feature gated change in behavior?

This doesn't change the behavior yet though, right? It's caching this data because we plan to use it, but the compute-budget-details we get from the sanitize_compute... function is unchanged by this (afaict).

Yes. No change in this PR, it just adds "collect and cache" function. The follow-up PR is going to use cached builtin cost, which will change the behavior, It's feature gate: #2562

I didn't create a SIMD because this feature gate isn't to change protocol, but to fix a bug; the bug being "compute budget allocates 200K per builtin, yet only consume its default cost; except for compute-budget instructions, that it does not allocate units but still consume its default cost".

jstarry · 2024-08-24T02:01:23Z

This doesn't change the behavior yet though, right? It's caching this data because we plan to use it, but the compute-budget-details we get from the sanitize_compute... function is unchanged by this (afaict).

There's a non-zero perf hit, which I see as a behavior change. There's no reason to cache this data when the feature isn't enabled right? But if it's too difficult to put this new caching behind a feature gate, maybe it's fine to keep it as is.

I didn't create a SIMD because this feature gate isn't to change protocol, but to fix a bug; the bug being "compute budget allocates 200K per builtin, yet only consume its default cost; except for compute-budget instructions, that it does not allocate units but still consume its default cost".

This is a protocol change to fix a bug. If firedancer isn't aware of this protocol change they could process transactions differently. Imagine a transaction doesn't set a compute limit but relies on the fact that adding a few builtin instructions to their transaction will increase their tx compute limit which is used fully by an invocation to a custom program. The transaction would succeed before the feature gate and would fail after the feature gate leading to a divergence if all clients aren't in sync for implementation. Given that this feature needs coordination between client teams and that it could break downstream users, I think we should have a SIMD to discuss.

tao-stones · 2024-08-26T15:40:34Z

Synced with FD previously, FD planned to rebase cost model implementation after this fix is in (they are currently using agave runtime, and its cost model implementation, but do not handle adjust-up, so over packing in some cases).

Just chatted with Philip, considering currently schedule, it seems it makes better sense to do all that after breakpoint. In this case, a SIMD would be very helpful to document the change, and perhaps discuss other possible solutions. I'll open one then link to feature gate issue #2562.

As for this PR, wdyt to merge it if no other open issues itself?

jstarry · 2024-08-27T08:55:04Z

I really don't think it makes sense to merge yet, what's the rush?

tao-stones · 2024-08-27T16:57:47Z

I really don't think it makes sense to merge yet, what's the rush?

I have few PRs after this, but I can reorg my pipeline. Let's keep this open while SIMD solana-foundation/solana-improvement-documents#170 being discussed.

apfitzge · 2024-09-09T16:09:20Z

I think rather than waiting on the SIMD process, which could take a while, how about we split this up?

Keep the old version such that we can get the compute budget without the builtin-check overhead, but also this new version which has the builtins checked.

Ideally we'd have a cost-model fn that we could pass this new struct into (with feature_set) so we can get the cost without doing separate scans for compute-budget AND builtins - having that would also help the transition to new tx type and runtime-transaction.
Once runtime transaction is used we can remove the old version since its' not accessed, and this more detailed meta info will be cached so we only calculate once.

@jstarry @tao-stones does that seem reasonable to you?

tao-stones added 3 commits August 21, 2024 16:49

add another bench full of builtins

d38228d

collect builtin instructions cost and count;

28e6492

rename compute_budget_instruction_details to instruction_details as it contains more than just compute-budget ix info;

rename Filter to builtin_auxiliary_data_store, caches looked-up built…

24ccfdb

…in cost

tao-stones requested review from apfitzge and jstarry August 22, 2024 15:48

apfitzge reviewed Aug 22, 2024

View reviewed changes

runtime-transaction/benches/process_compute_budget_instructions.rs Outdated Show resolved Hide resolved

apfitzge reviewed Aug 22, 2024

View reviewed changes

add enum BuiltinCheckStatus

d76291d

tao-stones requested a review from apfitzge August 22, 2024 18:49

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: collect and cache builtin instructions cost and count per transaction #2692

feat: collect and cache builtin instructions cost and count per transaction #2692

tao-stones commented Aug 22, 2024

tao-stones commented Aug 22, 2024 •

edited

Loading

apfitzge left a comment

apfitzge Aug 22, 2024

apfitzge Aug 22, 2024

tao-stones Aug 22, 2024

tao-stones Aug 22, 2024

jstarry commented Aug 23, 2024

apfitzge commented Aug 23, 2024

tao-stones commented Aug 23, 2024

jstarry commented Aug 24, 2024

tao-stones commented Aug 26, 2024

jstarry commented Aug 27, 2024

tao-stones commented Aug 27, 2024

apfitzge commented Sep 9, 2024

feat: collect and cache builtin instructions cost and count per transaction #2692

Are you sure you want to change the base?

feat: collect and cache builtin instructions cost and count per transaction #2692

Conversation

tao-stones commented Aug 22, 2024

Problem

Summary of Changes

tao-stones commented Aug 22, 2024 • edited Loading

apfitzge left a comment

Choose a reason for hiding this comment

apfitzge Aug 22, 2024

Choose a reason for hiding this comment

apfitzge Aug 22, 2024

Choose a reason for hiding this comment

tao-stones Aug 22, 2024

Choose a reason for hiding this comment

tao-stones Aug 22, 2024

Choose a reason for hiding this comment

jstarry commented Aug 23, 2024

apfitzge commented Aug 23, 2024

tao-stones commented Aug 23, 2024

jstarry commented Aug 24, 2024

tao-stones commented Aug 26, 2024

jstarry commented Aug 27, 2024

tao-stones commented Aug 27, 2024

apfitzge commented Sep 9, 2024

tao-stones commented Aug 22, 2024 •

edited

Loading