[21.05] Sensu monitoring for EVPN control plane #1088
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This change adds a sensu check script for detecting sync issues between the EVPN control plane (i.e. the FRR daemons) and the kernel, along with supporting infrastructure.
The check script itself is
check_rib_integrity.py
which loads and compares either the IPv4 unicast RIB (for monitoring of the underlay network routing) or the EVPN RIB (for monitoring of the overlay network configuration, i.e. mostly MAC addresses). If there are mismatches between the state in either the kernel or FRR, then the script will return a critical result. In case of the check being critical, the script will output a list of problems detected (e.g. which IP addresses or MAC addresses are in the kernel but not FRR and vice versa, or which destinations for which the kernel is missing nexthops) in a space-separated format which should be straightforwardly machine-readable.Given that this script has to deal with relatively complex system network state, I've additionally expanded our FRR test suite to check that the sensu script detects the relevant faults correctly. In order to properly simulate an EVPN environment, I've written a
ping-on-tap.py
script, which opens a tap interface and then respond to ARP and ICMP echo on that interface. The test VMs then run one or more instances of this script so the kernel can dynamically learn MAC addresses from the python script on the tap interfaces, which are then exported to its peers by FRR.I also realised that we never actually added the FRR tests to the default Hydra tests (as far as I can tell), so I've fixed that as well.
PL-132595
@flyingcircusio/release-managers
Release process
Impact: internal.
Changelog: none.
PR release workflow (internal)
Design notes
on
oroff
. Example: rate limiting.Security implications