[test-utils] Pretty-print Nexus config parse error #6500

hawkw · 2024-08-31T16:23:29Z

Presently, errors parsing the test Nexus config are printed using
fmt::Debug, due to the use of expect(). The Debug format for
TomlError is...pretty hard to read, since it looks like this:

EXTREMELY UGLY WALL OF TEXT, LOOK AT YOUR OWN RISK

--- STDERR:              nexus-reconfigurator-execution omicron_zones::test::test_clean_up_cockroach_zones ---
thread 'omicron_zones::test::test_clean_up_cockroach_zones' panicked at nexus/test-utils/src/lib.rs:186:10:
failed to load config.test.toml: LoadError { path: "tests/config.test.toml", kind: Parse(Error { inner: Error { inner: TomlError { message: "missing field `region_snapshot_replacement_finish`", raw: Some("#\n# Oxide API: configuration file for test suite\n#\n\n[console]\n# Directory for static assets. Absolute path or relative to CWD.\nstatic_dir = \"tests/static\"\nsession_idle_timeout_minutes = 480 # 8 hours\nsession_absolute_timeout_minutes = 1440 # 24 hours\n\n# List of authentication schemes to support.\n[authn]\nschemes_external = [\"spoof\", \"session_cookie\"]\n\n#\n# NOTE: for the test suite, if mode = \"file\", the file path MUST be the sentinel\n# string \"UNUSED\".  The actual path will be generated by the test suite for each\n# test.\n#\n[log]\nlevel = \"trace\"\nmode = \"file\"\nif_exists = \"fail\"\npath = \"UNUSED\"\n\n# Configuration for interacting with the timeseries database. This is overwritten\n# by the test suite once ClickHouse starts, with the actual address on which it\n# is listening.\n[timeseries_db]\naddress = \"[::1]:0\"\n\n# Tunable parameters used during tests\n[tunables]\n# Allow small subnets, so we can test IP address exhaustion easily / quickly\nmax_vpc_ipv4_subnet_prefix = 29\n\n[deployment]\n# Identifier for this instance of Nexus.\n# NOTE: The test suite always overrides this.\nid = \"913233fe-92a8-4635-9572-183f495429c4\"\nrack_id = \"c19a698f-c6f9-4a17-ae30-20d711b8f7dc\"\ntechport_external_server_port = 0\n\n# Nexus may need to resolve external hosts (e.g. to grab IdP metadata).\n# These are the DNS servers it should use.\nexternal_dns_servers = [\"1.1.1.1\", \"9.9.9.9\"]\n\n[deployment.dropshot_external]\n# NOTE: for the test suite, the port MUST be 0 (in order to bind to any\n# available port) because the test suite will be running many servers\n# concurrently.\nbind_address = \"127.0.0.1:0\"\nrequest_body_max_bytes = 8388608\n\n[deployment.dropshot_internal]\nbind_address = \"127.0.0.1:0\"\nrequest_body_max_bytes = 8388608\n\n#\n# NOTE: for the test suite, the internal DNS address will be replaced with one\n# that's started by the test runner.\n#\n[deployment.internal_dns]\ntype = \"from_subnet\"\nsubnet.net = \"fd00:1122:3344:0100::/56\"\n\n#\n# NOTE: for the test suite, the database URL will be replaced with one\n# appropriate for the database that's started by the test runner.\n#\n[deployment.database]\ntype = \"from_url\"\nurl = \"postgresql://[email protected]:0/omicron?sslmode=disable\"\n\n# Dendrite (dataplane daemon) API configuration\n[dendrite.switch0]\naddress = \"[::1]:12224\"\n\n[background_tasks]\ndns_internal.period_secs_config = 60\ndns_internal.period_secs_servers = 60\ndns_internal.period_secs_propagation = 60\ndns_internal.max_concurrent_server_updates = 5\ndns_external.period_secs_config = 60\ndns_external.period_secs_servers = 60\ndns_external.period_secs_propagation = 60\ndns_external.max_concurrent_server_updates = 5\nmetrics_producer_gc.period_secs = 60\n# How frequently we check the list of stored TLS certificates.  This is\n# approximately an upper bound on how soon after updating the list of\n# certificates it will take _other_ Nexus instances to notice and stop serving\n# them (on a sunny day).\nexternal_endpoints.period_secs = 60\nnat_cleanup.period_secs = 30\nbfd_manager.period_secs = 30\n# How frequently to collect hardware/software inventory from the whole system\n# (even if we don't have reason to believe anything has changed).\ninventory.period_secs = 600\n# Maximum number of past collections to keep in the database\ninventory.nkeep = 3\n# Disable inventory collection altogether (for emergencies)\ninventory.disable = false\nphantom_disks.period_secs = 30\nphysical_disk_adoption.period_secs = 30\n# Disable automatic disk adoption to avoid interfering with tests.\nphysical_disk_adoption.disable = true\ndecommissioned_disk_cleaner.period_secs = 60\n# Disable disk decommissioning cleanup to avoid interfering with tests.\ndecommissioned_disk_cleaner.disable = true\nblueprints.period_secs_load = 100\nblueprints.period_secs_execute = 600\nblueprints.period_secs_collect_crdb_node_ids = 600\nsync_service_zone_nat.period_secs = 30\nswitch_port_settings_manager.period_secs = 30\nregion_replacement.period_secs = 30\n# The driver task should wake up frequently, something like every 10 seconds.\n# however, if it's this low it affects the test_omdb_success_cases test output.\n# keep this 30 seconds, so that the test shows \"triggered by an explicit\n# signal\" instead of \"triggered by a periodic timer firing\"\nregion_replacement_driver.period_secs = 30\ninstance_watcher.period_secs = 30\nservice_firewall_propagation.period_secs = 300\nv2p_mapping_propagation.period_secs = 30\nabandoned_vmm_reaper.period_secs = 60\nsaga_recovery.period_secs = 600\nlookup_region_port.period_secs = 60\n# The purpose of the `instance-updater` background task is to ensure that update\n# sagas are always *eventually* started for instances whose database state has\n# changed, even if the update saga was not started by the Nexus replica handling\n# an update from sled-agent. This is to ensure that updates are performed even\n# in cases where a Nexus crashes or otherwise disappears between when the\n# updated VMM and migration state is written to CRDB and when the resulting\n# update saga actually starts executing. However, we would prefer update sagas\n# to be executed in a timely manner, so for integration tests, we don't want to\n# *rely* on the instance-updater background task for running these sagas.\n#\n# Therefore, disable the background task during tests.\ninstance_updater.disable = true\ninstance_updater.period_secs = 60\nregion_snapshot_replacement_start.period_secs = 30\nregion_snapshot_replacement_garbage_collection.period_secs = 30\nregion_snapshot_replacement_step.period_secs = 30\n\n[default_region_allocation_strategy]\n# we only have one sled in the test environment, so we need to use the\n# `Random` strategy, instead of `RandomWithDistinctSleds`\ntype = \"random\"\n"), keys: [], span: Some(0..0) } } }) }
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

This commit changes the load_test_config function to match the result
and use the panic! macro to format the error using fmt::Display if
the test config file cannot be parsed. This should make debugging
malformed configs way nicer.

I previously made the same change for the
gateway-test-utils::setup::load_test_config function in commit
3b89711

Presently, errors parsing the test Nexus config are printed using `fmt::Debug`, due to the use of `expect()`. The `Debug` format for `TomlError` is...pretty hard to read, since it looks like this: <details> <summary>EXTREMELY UGLY WALL OF TEXT, LOOK AT YOUR OWN RISK</summary> ``` --- STDERR: nexus-reconfigurator-execution omicron_zones::test::test_clean_up_cockroach_zones --- thread 'omicron_zones::test::test_clean_up_cockroach_zones' panicked at nexus/test-utils/src/lib.rs:186:10: failed to load config.test.toml: LoadError { path: "tests/config.test.toml", kind: Parse(Error { inner: Error { inner: TomlError { message: "missing field `region_snapshot_replacement_finish`", raw: Some("#\n# Oxide API: configuration file for test suite\n#\n\n[console]\n# Directory for static assets. Absolute path or relative to CWD.\nstatic_dir = \"tests/static\"\nsession_idle_timeout_minutes = 480 # 8 hours\nsession_absolute_timeout_minutes = 1440 # 24 hours\n\n# List of authentication schemes to support.\n[authn]\nschemes_external = [\"spoof\", \"session_cookie\"]\n\n#\n# NOTE: for the test suite, if mode = \"file\", the file path MUST be the sentinel\n# string \"UNUSED\". The actual path will be generated by the test suite for each\n# test.\n#\n[log]\nlevel = \"trace\"\nmode = \"file\"\nif_exists = \"fail\"\npath = \"UNUSED\"\n\n# Configuration for interacting with the timeseries database. This is overwritten\n# by the test suite once ClickHouse starts, with the actual address on which it\n# is listening.\n[timeseries_db]\naddress = \"[::1]:0\"\n\n# Tunable parameters used during tests\n[tunables]\n# Allow small subnets, so we can test IP address exhaustion easily / quickly\nmax_vpc_ipv4_subnet_prefix = 29\n\n[deployment]\n# Identifier for this instance of Nexus.\n# NOTE: The test suite always overrides this.\nid = \"913233fe-92a8-4635-9572-183f495429c4\"\nrack_id = \"c19a698f-c6f9-4a17-ae30-20d711b8f7dc\"\ntechport_external_server_port = 0\n\n# Nexus may need to resolve external hosts (e.g. to grab IdP metadata).\n# These are the DNS servers it should use.\nexternal_dns_servers = [\"1.1.1.1\", \"9.9.9.9\"]\n\n[deployment.dropshot_external]\n# NOTE: for the test suite, the port MUST be 0 (in order to bind to any\n# available port) because the test suite will be running many servers\n# concurrently.\nbind_address = \"127.0.0.1:0\"\nrequest_body_max_bytes = 8388608\n\n[deployment.dropshot_internal]\nbind_address = \"127.0.0.1:0\"\nrequest_body_max_bytes = 8388608\n\n#\n# NOTE: for the test suite, the internal DNS address will be replaced with one\n# that's started by the test runner.\n#\n[deployment.internal_dns]\ntype = \"from_subnet\"\nsubnet.net = \"fd00:1122:3344:0100::/56\"\n\n#\n# NOTE: for the test suite, the database URL will be replaced with one\n# appropriate for the database that's started by the test runner.\n#\n[deployment.database]\ntype = \"from_url\"\nurl = \"postgresql://[email protected]:0/omicron?sslmode=disable\"\n\n# Dendrite (dataplane daemon) API configuration\n[dendrite.switch0]\naddress = \"[::1]:12224\"\n\n[background_tasks]\ndns_internal.period_secs_config = 60\ndns_internal.period_secs_servers = 60\ndns_internal.period_secs_propagation = 60\ndns_internal.max_concurrent_server_updates = 5\ndns_external.period_secs_config = 60\ndns_external.period_secs_servers = 60\ndns_external.period_secs_propagation = 60\ndns_external.max_concurrent_server_updates = 5\nmetrics_producer_gc.period_secs = 60\n# How frequently we check the list of stored TLS certificates. This is\n# approximately an upper bound on how soon after updating the list of\n# certificates it will take _other_ Nexus instances to notice and stop serving\n# them (on a sunny day).\nexternal_endpoints.period_secs = 60\nnat_cleanup.period_secs = 30\nbfd_manager.period_secs = 30\n# How frequently to collect hardware/software inventory from the whole system\n# (even if we don't have reason to believe anything has changed).\ninventory.period_secs = 600\n# Maximum number of past collections to keep in the database\ninventory.nkeep = 3\n# Disable inventory collection altogether (for emergencies)\ninventory.disable = false\nphantom_disks.period_secs = 30\nphysical_disk_adoption.period_secs = 30\n# Disable automatic disk adoption to avoid interfering with tests.\nphysical_disk_adoption.disable = true\ndecommissioned_disk_cleaner.period_secs = 60\n# Disable disk decommissioning cleanup to avoid interfering with tests.\ndecommissioned_disk_cleaner.disable = true\nblueprints.period_secs_load = 100\nblueprints.period_secs_execute = 600\nblueprints.period_secs_collect_crdb_node_ids = 600\nsync_service_zone_nat.period_secs = 30\nswitch_port_settings_manager.period_secs = 30\nregion_replacement.period_secs = 30\n# The driver task should wake up frequently, something like every 10 seconds.\n# however, if it's this low it affects the test_omdb_success_cases test output.\n# keep this 30 seconds, so that the test shows \"triggered by an explicit\n# signal\" instead of \"triggered by a periodic timer firing\"\nregion_replacement_driver.period_secs = 30\ninstance_watcher.period_secs = 30\nservice_firewall_propagation.period_secs = 300\nv2p_mapping_propagation.period_secs = 30\nabandoned_vmm_reaper.period_secs = 60\nsaga_recovery.period_secs = 600\nlookup_region_port.period_secs = 60\n# The purpose of the `instance-updater` background task is to ensure that update\n# sagas are always *eventually* started for instances whose database state has\n# changed, even if the update saga was not started by the Nexus replica handling\n# an update from sled-agent. This is to ensure that updates are performed even\n# in cases where a Nexus crashes or otherwise disappears between when the\n# updated VMM and migration state is written to CRDB and when the resulting\n# update saga actually starts executing. However, we would prefer update sagas\n# to be executed in a timely manner, so for integration tests, we don't want to\n# *rely* on the instance-updater background task for running these sagas.\n#\n# Therefore, disable the background task during tests.\ninstance_updater.disable = true\ninstance_updater.period_secs = 60\nregion_snapshot_replacement_start.period_secs = 30\nregion_snapshot_replacement_garbage_collection.period_secs = 30\nregion_snapshot_replacement_step.period_secs = 30\n\n[default_region_allocation_strategy]\n# we only have one sled in the test environment, so we need to use the\n# `Random` strategy, instead of `RandomWithDistinctSleds`\ntype = \"random\"\n"), keys: [], span: Some(0..0) } } }) } note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace ``` </details> This commit changes the `load_test_config` function to match the result and use the `panic!` macro to format the error using `fmt::Display` if the test config file cannot be parsed. This should make debugging malformed configs way nicer. I previously made the same change for the `gateway-test-utils::setup::load_test_config` function in commit 3b89711

sunshowers · 2024-08-31T20:42:33Z

Should we also display the chain of errors?

hawkw · 2024-09-01T22:09:28Z

Should we also display the chain of errors?

Agh, I thought this would do that...i always forget which format specifier I need to print the whole chain, is it {:#}?

davepacheco · 2024-09-03T16:27:05Z

Should we also display the chain of errors?

Agh, I thought this would do that...i always forget which format specifier I need to print the whole chain, is it {:#}?

I think {:#} does for anyhow::Error but I'm not sure how universal that is. slog-error-chain is really nice (even if you're not using slog). It looks to me like nexus_config::LoadError already includes the cause's message in its own message, which means you'd get the "double-speak" problem mentioned in the slog-error-chain README. But on the other hand, that might mean you don't have to do anything at all here to get the error chain.

hawkw · 2024-09-03T17:23:23Z

Should we also display the chain of errors?

Agh, I thought this would do that...i always forget which format specifier I need to print the whole chain, is it {:#}?

I think {:#} does for anyhow::Error but I'm not sure how universal that is. slog-error-chain is really nice (even if you're not using slog). It looks to me like nexus_config::LoadError already includes the cause's message in its own message, which means you'd get the "double-speak" problem mentioned in the slog-error-chain README. But on the other hand, that might mean you don't have to do anything at all here to get the error chain.

Ah, yup, since the LoadError type repeats the cause, I think the current format specifier is fine. FWIW, we use {} for config load errors in the gateway_test_utils::setup::load_test_config function, since the MGS config errors also include the immediate cause, which is why I used the same thing here.

sunshowers · 2024-09-03T20:42:50Z

Yeah -- you can use display-error-chain, and we also have an inline error chain somewhere.

hawkw requested review from sunshowers and bnaecker August 31, 2024 16:23

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[test-utils] Pretty-print Nexus config parse error #6500

[test-utils] Pretty-print Nexus config parse error #6500

hawkw commented Aug 31, 2024

sunshowers commented Aug 31, 2024

hawkw commented Sep 1, 2024

davepacheco commented Sep 3, 2024

hawkw commented Sep 3, 2024

sunshowers commented Sep 3, 2024

[test-utils] Pretty-print Nexus config parse error #6500

Are you sure you want to change the base?

[test-utils] Pretty-print Nexus config parse error #6500

Conversation

hawkw commented Aug 31, 2024

sunshowers commented Aug 31, 2024

hawkw commented Sep 1, 2024

davepacheco commented Sep 3, 2024

hawkw commented Sep 3, 2024

sunshowers commented Sep 3, 2024