Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[21.05] Sensu monitoring for EVPN control plane #1088

Merged
merged 7 commits into from
Aug 26, 2024

Conversation

sysvinit
Copy link
Member

This change adds a sensu check script for detecting sync issues between the EVPN control plane (i.e. the FRR daemons) and the kernel, along with supporting infrastructure.

The check script itself is check_rib_integrity.py which loads and compares either the IPv4 unicast RIB (for monitoring of the underlay network routing) or the EVPN RIB (for monitoring of the overlay network configuration, i.e. mostly MAC addresses). If there are mismatches between the state in either the kernel or FRR, then the script will return a critical result. In case of the check being critical, the script will output a list of problems detected (e.g. which IP addresses or MAC addresses are in the kernel but not FRR and vice versa, or which destinations for which the kernel is missing nexthops) in a space-separated format which should be straightforwardly machine-readable.

Given that this script has to deal with relatively complex system network state, I've additionally expanded our FRR test suite to check that the sensu script detects the relevant faults correctly. In order to properly simulate an EVPN environment, I've written a ping-on-tap.py script, which opens a tap interface and then respond to ARP and ICMP echo on that interface. The test VMs then run one or more instances of this script so the kernel can dynamically learn MAC addresses from the python script on the tap interfaces, which are then exported to its peers by FRR.

I also realised that we never actually added the FRR tests to the default Hydra tests (as far as I can tell), so I've fixed that as well.

PL-132595

@flyingcircusio/release-managers

Release process

Impact: internal.

Changelog: none.

PR release workflow (internal)

  • PR has internal ticket
  • internal issue ID (PL-…) part of branch name
  • internal issue ID mentioned in PR description text
  • ticket is on Platform agile board
  • ticket state set to Pull request ready
  • if ticket is more urgent than within the next few days, directly contact a member of the Platform team

Design notes

  • Provide a feature toggle if the change might need to be adjusted/reverted quickly depending on context. Consider whether the default should be on or off. Example: rate limiting.
  • All customer-facing features and (NixOS) options need to be discoverable from documentation. Add or update relevant documentation such that hosted and guided customers can understand it as well.

Security implications

  • Security requirements defined? (WHERE)
    • This change adds monitoring for the VXLAN overlay network in order to improve observability when the control plane and data plane get out of sync.
  • Security requirements tested? (EVIDENCE)
    • Tested manually on a dev host.
    • Additional Hydra tests to automatically verify check script functionality.

All files inside this directory already have appropriate permissions
set. This allows processes with group frrvty to access the vty socket
without needing frr or superuser permissions.

PL-132595
This test artificially causes the FIB and RIB to get out of sync, and
checks that the sensu script correctly detects these problems.

PL-132595
@sysvinit sysvinit requested a review from ctheune August 26, 2024 07:39
Copy link
Contributor

@ctheune ctheune left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@ctheune ctheune merged commit fd93945 into fc-21.05-dev Aug 26, 2024
2 checks passed
@ctheune ctheune deleted the PL-132595-vxlan-monitor-control-plane branch August 26, 2024 08:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants