Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[test-utils] Pretty-print Nexus config parse error #6500

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

hawkw
Copy link
Member

@hawkw hawkw commented Aug 31, 2024

Presently, errors parsing the test Nexus config are printed using
fmt::Debug, due to the use of expect(). The Debug format for
TomlError is...pretty hard to read, since it looks like this:

EXTREMELY UGLY WALL OF TEXT, LOOK AT YOUR OWN RISK
--- STDERR:              nexus-reconfigurator-execution omicron_zones::test::test_clean_up_cockroach_zones ---
thread 'omicron_zones::test::test_clean_up_cockroach_zones' panicked at nexus/test-utils/src/lib.rs:186:10:
failed to load config.test.toml: LoadError { path: "tests/config.test.toml", kind: Parse(Error { inner: Error { inner: TomlError { message: "missing field `region_snapshot_replacement_finish`", raw: Some("#\n# Oxide API: configuration file for test suite\n#\n\n[console]\n# Directory for static assets. Absolute path or relative to CWD.\nstatic_dir = \"tests/static\"\nsession_idle_timeout_minutes = 480 # 8 hours\nsession_absolute_timeout_minutes = 1440 # 24 hours\n\n# List of authentication schemes to support.\n[authn]\nschemes_external = [\"spoof\", \"session_cookie\"]\n\n#\n# NOTE: for the test suite, if mode = \"file\", the file path MUST be the sentinel\n# string \"UNUSED\".  The actual path will be generated by the test suite for each\n# test.\n#\n[log]\nlevel = \"trace\"\nmode = \"file\"\nif_exists = \"fail\"\npath = \"UNUSED\"\n\n# Configuration for interacting with the timeseries database. This is overwritten\n# by the test suite once ClickHouse starts, with the actual address on which it\n# is listening.\n[timeseries_db]\naddress = \"[::1]:0\"\n\n# Tunable parameters used during tests\n[tunables]\n# Allow small subnets, so we can test IP address exhaustion easily / quickly\nmax_vpc_ipv4_subnet_prefix = 29\n\n[deployment]\n# Identifier for this instance of Nexus.\n# NOTE: The test suite always overrides this.\nid = \"913233fe-92a8-4635-9572-183f495429c4\"\nrack_id = \"c19a698f-c6f9-4a17-ae30-20d711b8f7dc\"\ntechport_external_server_port = 0\n\n# Nexus may need to resolve external hosts (e.g. to grab IdP metadata).\n# These are the DNS servers it should use.\nexternal_dns_servers = [\"1.1.1.1\", \"9.9.9.9\"]\n\n[deployment.dropshot_external]\n# NOTE: for the test suite, the port MUST be 0 (in order to bind to any\n# available port) because the test suite will be running many servers\n# concurrently.\nbind_address = \"127.0.0.1:0\"\nrequest_body_max_bytes = 8388608\n\n[deployment.dropshot_internal]\nbind_address = \"127.0.0.1:0\"\nrequest_body_max_bytes = 8388608\n\n#\n# NOTE: for the test suite, the internal DNS address will be replaced with one\n# that's started by the test runner.\n#\n[deployment.internal_dns]\ntype = \"from_subnet\"\nsubnet.net = \"fd00:1122:3344:0100::/56\"\n\n#\n# NOTE: for the test suite, the database URL will be replaced with one\n# appropriate for the database that's started by the test runner.\n#\n[deployment.database]\ntype = \"from_url\"\nurl = \"postgresql://[email protected]:0/omicron?sslmode=disable\"\n\n# Dendrite (dataplane daemon) API configuration\n[dendrite.switch0]\naddress = \"[::1]:12224\"\n\n[background_tasks]\ndns_internal.period_secs_config = 60\ndns_internal.period_secs_servers = 60\ndns_internal.period_secs_propagation = 60\ndns_internal.max_concurrent_server_updates = 5\ndns_external.period_secs_config = 60\ndns_external.period_secs_servers = 60\ndns_external.period_secs_propagation = 60\ndns_external.max_concurrent_server_updates = 5\nmetrics_producer_gc.period_secs = 60\n# How frequently we check the list of stored TLS certificates.  This is\n# approximately an upper bound on how soon after updating the list of\n# certificates it will take _other_ Nexus instances to notice and stop serving\n# them (on a sunny day).\nexternal_endpoints.period_secs = 60\nnat_cleanup.period_secs = 30\nbfd_manager.period_secs = 30\n# How frequently to collect hardware/software inventory from the whole system\n# (even if we don't have reason to believe anything has changed).\ninventory.period_secs = 600\n# Maximum number of past collections to keep in the database\ninventory.nkeep = 3\n# Disable inventory collection altogether (for emergencies)\ninventory.disable = false\nphantom_disks.period_secs = 30\nphysical_disk_adoption.period_secs = 30\n# Disable automatic disk adoption to avoid interfering with tests.\nphysical_disk_adoption.disable = true\ndecommissioned_disk_cleaner.period_secs = 60\n# Disable disk decommissioning cleanup to avoid interfering with tests.\ndecommissioned_disk_cleaner.disable = true\nblueprints.period_secs_load = 100\nblueprints.period_secs_execute = 600\nblueprints.period_secs_collect_crdb_node_ids = 600\nsync_service_zone_nat.period_secs = 30\nswitch_port_settings_manager.period_secs = 30\nregion_replacement.period_secs = 30\n# The driver task should wake up frequently, something like every 10 seconds.\n# however, if it's this low it affects the test_omdb_success_cases test output.\n# keep this 30 seconds, so that the test shows \"triggered by an explicit\n# signal\" instead of \"triggered by a periodic timer firing\"\nregion_replacement_driver.period_secs = 30\ninstance_watcher.period_secs = 30\nservice_firewall_propagation.period_secs = 300\nv2p_mapping_propagation.period_secs = 30\nabandoned_vmm_reaper.period_secs = 60\nsaga_recovery.period_secs = 600\nlookup_region_port.period_secs = 60\n# The purpose of the `instance-updater` background task is to ensure that update\n# sagas are always *eventually* started for instances whose database state has\n# changed, even if the update saga was not started by the Nexus replica handling\n# an update from sled-agent. This is to ensure that updates are performed even\n# in cases where a Nexus crashes or otherwise disappears between when the\n# updated VMM and migration state is written to CRDB and when the resulting\n# update saga actually starts executing. However, we would prefer update sagas\n# to be executed in a timely manner, so for integration tests, we don't want to\n# *rely* on the instance-updater background task for running these sagas.\n#\n# Therefore, disable the background task during tests.\ninstance_updater.disable = true\ninstance_updater.period_secs = 60\nregion_snapshot_replacement_start.period_secs = 30\nregion_snapshot_replacement_garbage_collection.period_secs = 30\nregion_snapshot_replacement_step.period_secs = 30\n\n[default_region_allocation_strategy]\n# we only have one sled in the test environment, so we need to use the\n# `Random` strategy, instead of `RandomWithDistinctSleds`\ntype = \"random\"\n"), keys: [], span: Some(0..0) } } }) }
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

This commit changes the load_test_config function to match the result
and use the panic! macro to format the error using fmt::Display if
the test config file cannot be parsed. This should make debugging
malformed configs way nicer.

I previously made the same change for the
gateway-test-utils::setup::load_test_config function in commit
3b89711

Presently, errors parsing the test Nexus config are printed using
`fmt::Debug`, due to the use of `expect()`. The `Debug` format for
`TomlError` is...pretty hard to read, since it looks like this:

<details>
<summary>EXTREMELY UGLY WALL OF TEXT, LOOK AT YOUR OWN RISK</summary>

```
--- STDERR:              nexus-reconfigurator-execution omicron_zones::test::test_clean_up_cockroach_zones ---
thread 'omicron_zones::test::test_clean_up_cockroach_zones' panicked at nexus/test-utils/src/lib.rs:186:10:
failed to load config.test.toml: LoadError { path: "tests/config.test.toml", kind: Parse(Error { inner: Error { inner: TomlError { message: "missing field `region_snapshot_replacement_finish`", raw: Some("#\n# Oxide API: configuration file for test suite\n#\n\n[console]\n# Directory for static assets. Absolute path or relative to CWD.\nstatic_dir = \"tests/static\"\nsession_idle_timeout_minutes = 480 # 8 hours\nsession_absolute_timeout_minutes = 1440 # 24 hours\n\n# List of authentication schemes to support.\n[authn]\nschemes_external = [\"spoof\", \"session_cookie\"]\n\n#\n# NOTE: for the test suite, if mode = \"file\", the file path MUST be the sentinel\n# string \"UNUSED\".  The actual path will be generated by the test suite for each\n# test.\n#\n[log]\nlevel = \"trace\"\nmode = \"file\"\nif_exists = \"fail\"\npath = \"UNUSED\"\n\n# Configuration for interacting with the timeseries database. This is overwritten\n# by the test suite once ClickHouse starts, with the actual address on which it\n# is listening.\n[timeseries_db]\naddress = \"[::1]:0\"\n\n# Tunable parameters used during tests\n[tunables]\n# Allow small subnets, so we can test IP address exhaustion easily / quickly\nmax_vpc_ipv4_subnet_prefix = 29\n\n[deployment]\n# Identifier for this instance of Nexus.\n# NOTE: The test suite always overrides this.\nid = \"913233fe-92a8-4635-9572-183f495429c4\"\nrack_id = \"c19a698f-c6f9-4a17-ae30-20d711b8f7dc\"\ntechport_external_server_port = 0\n\n# Nexus may need to resolve external hosts (e.g. to grab IdP metadata).\n# These are the DNS servers it should use.\nexternal_dns_servers = [\"1.1.1.1\", \"9.9.9.9\"]\n\n[deployment.dropshot_external]\n# NOTE: for the test suite, the port MUST be 0 (in order to bind to any\n# available port) because the test suite will be running many servers\n# concurrently.\nbind_address = \"127.0.0.1:0\"\nrequest_body_max_bytes = 8388608\n\n[deployment.dropshot_internal]\nbind_address = \"127.0.0.1:0\"\nrequest_body_max_bytes = 8388608\n\n#\n# NOTE: for the test suite, the internal DNS address will be replaced with one\n# that's started by the test runner.\n#\n[deployment.internal_dns]\ntype = \"from_subnet\"\nsubnet.net = \"fd00:1122:3344:0100::/56\"\n\n#\n# NOTE: for the test suite, the database URL will be replaced with one\n# appropriate for the database that's started by the test runner.\n#\n[deployment.database]\ntype = \"from_url\"\nurl = \"postgresql://[email protected]:0/omicron?sslmode=disable\"\n\n# Dendrite (dataplane daemon) API configuration\n[dendrite.switch0]\naddress = \"[::1]:12224\"\n\n[background_tasks]\ndns_internal.period_secs_config = 60\ndns_internal.period_secs_servers = 60\ndns_internal.period_secs_propagation = 60\ndns_internal.max_concurrent_server_updates = 5\ndns_external.period_secs_config = 60\ndns_external.period_secs_servers = 60\ndns_external.period_secs_propagation = 60\ndns_external.max_concurrent_server_updates = 5\nmetrics_producer_gc.period_secs = 60\n# How frequently we check the list of stored TLS certificates.  This is\n# approximately an upper bound on how soon after updating the list of\n# certificates it will take _other_ Nexus instances to notice and stop serving\n# them (on a sunny day).\nexternal_endpoints.period_secs = 60\nnat_cleanup.period_secs = 30\nbfd_manager.period_secs = 30\n# How frequently to collect hardware/software inventory from the whole system\n# (even if we don't have reason to believe anything has changed).\ninventory.period_secs = 600\n# Maximum number of past collections to keep in the database\ninventory.nkeep = 3\n# Disable inventory collection altogether (for emergencies)\ninventory.disable = false\nphantom_disks.period_secs = 30\nphysical_disk_adoption.period_secs = 30\n# Disable automatic disk adoption to avoid interfering with tests.\nphysical_disk_adoption.disable = true\ndecommissioned_disk_cleaner.period_secs = 60\n# Disable disk decommissioning cleanup to avoid interfering with tests.\ndecommissioned_disk_cleaner.disable = true\nblueprints.period_secs_load = 100\nblueprints.period_secs_execute = 600\nblueprints.period_secs_collect_crdb_node_ids = 600\nsync_service_zone_nat.period_secs = 30\nswitch_port_settings_manager.period_secs = 30\nregion_replacement.period_secs = 30\n# The driver task should wake up frequently, something like every 10 seconds.\n# however, if it's this low it affects the test_omdb_success_cases test output.\n# keep this 30 seconds, so that the test shows \"triggered by an explicit\n# signal\" instead of \"triggered by a periodic timer firing\"\nregion_replacement_driver.period_secs = 30\ninstance_watcher.period_secs = 30\nservice_firewall_propagation.period_secs = 300\nv2p_mapping_propagation.period_secs = 30\nabandoned_vmm_reaper.period_secs = 60\nsaga_recovery.period_secs = 600\nlookup_region_port.period_secs = 60\n# The purpose of the `instance-updater` background task is to ensure that update\n# sagas are always *eventually* started for instances whose database state has\n# changed, even if the update saga was not started by the Nexus replica handling\n# an update from sled-agent. This is to ensure that updates are performed even\n# in cases where a Nexus crashes or otherwise disappears between when the\n# updated VMM and migration state is written to CRDB and when the resulting\n# update saga actually starts executing. However, we would prefer update sagas\n# to be executed in a timely manner, so for integration tests, we don't want to\n# *rely* on the instance-updater background task for running these sagas.\n#\n# Therefore, disable the background task during tests.\ninstance_updater.disable = true\ninstance_updater.period_secs = 60\nregion_snapshot_replacement_start.period_secs = 30\nregion_snapshot_replacement_garbage_collection.period_secs = 30\nregion_snapshot_replacement_step.period_secs = 30\n\n[default_region_allocation_strategy]\n# we only have one sled in the test environment, so we need to use the\n# `Random` strategy, instead of `RandomWithDistinctSleds`\ntype = \"random\"\n"), keys: [], span: Some(0..0) } } }) }
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
```

</details>

This commit changes the `load_test_config` function to match the result
and use the `panic!` macro to format the error using `fmt::Display` if
the test config file cannot be parsed. This should make debugging
malformed configs way nicer.

I previously made the same change for the
`gateway-test-utils::setup::load_test_config` function in commit
3b89711
@sunshowers
Copy link
Contributor

Should we also display the chain of errors?

@hawkw
Copy link
Member Author

hawkw commented Sep 1, 2024

Should we also display the chain of errors?

Agh, I thought this would do that...i always forget which format specifier I need to print the whole chain, is it {:#}?

@davepacheco
Copy link
Collaborator

Should we also display the chain of errors?

Agh, I thought this would do that...i always forget which format specifier I need to print the whole chain, is it {:#}?

I think {:#} does for anyhow::Error but I'm not sure how universal that is. slog-error-chain is really nice (even if you're not using slog). It looks to me like nexus_config::LoadError already includes the cause's message in its own message, which means you'd get the "double-speak" problem mentioned in the slog-error-chain README. But on the other hand, that might mean you don't have to do anything at all here to get the error chain.

@hawkw
Copy link
Member Author

hawkw commented Sep 3, 2024

Should we also display the chain of errors?

Agh, I thought this would do that...i always forget which format specifier I need to print the whole chain, is it {:#}?

I think {:#} does for anyhow::Error but I'm not sure how universal that is. slog-error-chain is really nice (even if you're not using slog). It looks to me like nexus_config::LoadError already includes the cause's message in its own message, which means you'd get the "double-speak" problem mentioned in the slog-error-chain README. But on the other hand, that might mean you don't have to do anything at all here to get the error chain.

Ah, yup, since the LoadError type repeats the cause, I think the current format specifier is fine. FWIW, we use {} for config load errors in the gateway_test_utils::setup::load_test_config function, since the MGS config errors also include the immediate cause, which is why I used the same thing here.

@sunshowers
Copy link
Contributor

Yeah -- you can use display-error-chain, and we also have an inline error chain somewhere.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants