-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Verify timeline before lock acquire #24
Conversation
f3a9695
to
4a8b497
Compare
@@ -23,8 +23,10 @@ Feature: Check maintenance mode | |||
""" | |||
Then <lock_type> "<lock_host>" has holder "pgconsul_postgresql1_1.pgconsul_pgconsul_net" for lock "/pgconsul/postgresql/leader" | |||
When we set value "enable" for key "/pgconsul/postgresql/maintenance" in <lock_type> "<lock_host>" | |||
And we wait "10.0" seconds |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here and below we should wait til pgconsul's at all hosts read the maintenance node
4a8b497
to
3e43c52
Compare
logging.error('Some error with ZK.') | ||
# Actually we should never get here but checking it just in case. | ||
# Here we should end iteration and check and probably close primary | ||
# at the begin of primary_iter | ||
return None | ||
# If ZK does not have timeline info, write it. | ||
elif zk_state[self.zk.TIMELINE_INFO_PATH] is None: | ||
elif zk_state[self.zk.TIMELINE_INFO_PATH] is None and not just_check: | ||
logging.warning('Could not get timeline from ZK. Saving it.') | ||
self.zk.write(self.zk.TIMELINE_INFO_PATH, db_state['timeline']) | ||
# If there is a mismatch in timeline: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can end up with tli_res is False
(https://github.com/yandex/pgconsul/pull/24/files#diff-2e5ad92c43aa96cc3a9cef6c6aec998b216f1379c43b1f651013d25e55989312R934) while just_check is True.
In that case you will be doing modifications in Zookeeper while.
That is wrong because you don't have a lock and information that noone has it can be stale.
Somehow we can have a cluster with two primary hosts (<=1 is open) [A - with lower timeline and B - with greater timeline], and enabled maintenance.
Right after disabling maintenance, host A can take the leader_lock faster, than the preferred the primary host - B.
Host B will try to return to cluster, but it doesn't make any sense. Unfortunately, host B will wait for the recovery timeout - connecting to host A is impossible because it is closed by the local pgconsul. Therefore, the cluster will be unavailable until the recovery timeout has passed.
That is why we must verify the timeline before acquiring the leader lock.