Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Verify timeline before lock acquire #24

Merged
merged 2 commits into from
Jul 16, 2024

Conversation

munakoiso
Copy link
Contributor

Somehow we can have a cluster with two primary hosts (<=1 is open) [A - with lower timeline and B - with greater timeline], and enabled maintenance.

Right after disabling maintenance, host A can take the leader_lock faster, than the preferred the primary host - B.
Host B will try to return to cluster, but it doesn't make any sense. Unfortunately, host B will wait for the recovery timeout - connecting to host A is impossible because it is closed by the local pgconsul. Therefore, the cluster will be unavailable until the recovery timeout has passed.

That is why we must verify the timeline before acquiring the leader lock.

@munakoiso munakoiso requested a review from a team as a code owner July 12, 2024 08:44
@munakoiso munakoiso force-pushed the verify_timeline branch 2 times, most recently from f3a9695 to 4a8b497 Compare July 15, 2024 12:35
@@ -23,8 +23,10 @@ Feature: Check maintenance mode
"""
Then <lock_type> "<lock_host>" has holder "pgconsul_postgresql1_1.pgconsul_pgconsul_net" for lock "/pgconsul/postgresql/leader"
When we set value "enable" for key "/pgconsul/postgresql/maintenance" in <lock_type> "<lock_host>"
And we wait "10.0" seconds
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here and below we should wait til pgconsul's at all hosts read the maintenance node

@munakoiso munakoiso merged commit 8c3f94d into yandex:main Jul 16, 2024
30 checks passed
logging.error('Some error with ZK.')
# Actually we should never get here but checking it just in case.
# Here we should end iteration and check and probably close primary
# at the begin of primary_iter
return None
# If ZK does not have timeline info, write it.
elif zk_state[self.zk.TIMELINE_INFO_PATH] is None:
elif zk_state[self.zk.TIMELINE_INFO_PATH] is None and not just_check:
logging.warning('Could not get timeline from ZK. Saving it.')
self.zk.write(self.zk.TIMELINE_INFO_PATH, db_state['timeline'])
# If there is a mismatch in timeline:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can end up with tli_res is False (https://github.com/yandex/pgconsul/pull/24/files#diff-2e5ad92c43aa96cc3a9cef6c6aec998b216f1379c43b1f651013d25e55989312R934) while just_check is True.
In that case you will be doing modifications in Zookeeper while.
That is wrong because you don't have a lock and information that noone has it can be stale.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants