Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rafttest: reduce waitLeader flakiness #195

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
41 changes: 23 additions & 18 deletions rafttest/node_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -35,7 +35,7 @@ func TestBasicProgress(t *testing.T) {
nodes = append(nodes, n)
}

waitLeader(nodes)
waitStableLeader(nodes)
Copy link
Contributor Author

@pav-kv pav-kv Apr 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe a better fix for this test, specifically, would be to relax the expectation in line 44 that all 100 entries are committed. In fact, we don't have any guarantee that any entry is going to be committed. If the leader election is completely unstable, and every single submitted entry that gets routed to a now-leader ends up wiped because there is already a higher-term leader, we won't commit anything.

We can require in this test, however, that most entries are committed (say, at least 70 out of 100), to demonstrate that there is liveness in this implementation.

Alternatively, once we have a leader, we can stop ticking the nodes, so that no re-election will occur and disrupt the 100 entries flow.


for i := 0; i < 100; i++ {
nodes[0].Propose(context.TODO(), []byte("somedata"))
Expand All @@ -59,7 +59,7 @@ func TestRestart(t *testing.T) {
nodes = append(nodes, n)
}

l := waitLeader(nodes)
l := waitStableLeader(nodes)
k1, k2 := (l+1)%5, (l+2)%5

for i := 0; i < 30; i++ {
Expand Down Expand Up @@ -97,7 +97,7 @@ func TestPause(t *testing.T) {
nodes = append(nodes, n)
}

waitLeader(nodes)
waitStableLeader(nodes)

for i := 0; i < 30; i++ {
nodes[0].Propose(context.TODO(), []byte("somedata"))
Expand All @@ -123,26 +123,31 @@ func TestPause(t *testing.T) {
}
}

func waitLeader(ns []*node) int {
var l map[uint64]struct{}
var lindex = -1

// waitStableLeader waits until there is a stable leader in the cluster. It
// heuristically assumes that there is a stable leader when there is a node in
// StateLeader among the highest-term nodes.
//
// Note that this function would not work properly in clusters with "network"
// partitions, in which a node can have the highest term, and yet never become a
// leader.
func waitStableLeader(ns []*node) int {
for {
l = make(map[uint64]struct{})

lead := -1
var maxTerm uint64
for i, n := range ns {
lead := n.Status().SoftState.Lead
if lead != 0 {
l[lead] = struct{}{}
if n.id == lead {
lindex = i
}
st := n.Status()
if st.Term > maxTerm {
lead = -1
maxTerm = st.Term
}
if st.RaftState == raft.StateLeader {
lead = i
}
}

if len(l) == 1 && lindex != -1 {
return lindex
if lead != -1 {
return lead
}
time.Sleep(time.Millisecond)
}
}

Expand Down
Loading