Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[21.05] zebra restart tuning and keepalived integration #1097

Merged

Conversation

sysvinit
Copy link
Member

@sysvinit sysvinit commented Sep 4, 2024

This change raises the startup backoff limits for zebra. The system default restart settings allow zebra to restart up to five time in 60 seconds, however in some circumstances where zebra has crashed due to the system being overloaded, it may take longer for the conditions causing zebra to crash to clear, so this change raises the limit to 20 times in 120 seconds.

Additionally, when zebra crashes on the primary router in a location, this should ideally cause the router to demote itself and let the secondary take over. This change adds a new check script to the keepalived configuration in all locations to monitor the liveness of zebra.

PL-132950

@flyingcircusio/release-managers

Release process

Impact: internal.

Changelog: none.

PR release workflow (internal)

  • PR has internal ticket
  • internal issue ID (PL-…) part of branch name
  • internal issue ID mentioned in PR description text
  • ticket is on Platform agile board
  • ticket state set to Pull request ready
  • if ticket is more urgent than within the next few days, directly contact a member of the Platform team

Design notes

  • Provide a feature toggle if the change might need to be adjusted/reverted quickly depending on context. Consider whether the default should be on or off. Example: rate limiting.
  • All customer-facing features and (NixOS) options need to be discoverable from documentation. Add or update relevant documentation such that hosted and guided customers can understand it as well.

Security implications

  • Security requirements defined? (WHERE)
    • Improve site reliability and service availability with better integration between FRR and keepalived.
  • Security requirements tested? (EVIDENCE)
    • Load test in DEV used to artificially cause zebra crashes and test keepalived demotion behaviour.

@sysvinit sysvinit force-pushed the PL-132950-frr-keepalived-crash-tuning-integration branch from 0a9ea0b to ae41ca0 Compare September 4, 2024 16:45
@ctheune ctheune self-requested a review September 6, 2024 09:29
Copy link
Contributor

@ctheune ctheune left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally fine, little adjustment: add a 10 sec restartsec option (and increase the startlimitinterval accordingly to 320s)

Give zebra a more generous limit for restarts after crashes. Allow up
to 20 restarts in 5 minutes, and wait 10 seconds between restart
attempts.

PL-132950
Routers are only capable of fulfilling the role of primary when zebra
is running and working correctly. Add a check to the keepalived
configuration which transitions to FAULT state if zebra is not
running.

PL-132950
@sysvinit sysvinit force-pushed the PL-132950-frr-keepalived-crash-tuning-integration branch from ae41ca0 to a3ed0f4 Compare September 19, 2024 11:20
@ctheune ctheune merged commit 93dd37f into fc-21.05-dev Sep 19, 2024
1 check passed
@ctheune ctheune deleted the PL-132950-frr-keepalived-crash-tuning-integration branch September 19, 2024 11:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants