Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

statusengine service stuck after encountering "Mysql server has gone away" #18

Open
raghav8888 opened this issue Dec 20, 2018 · 3 comments

Comments

@raghav8888
Copy link

Hi @nook24

This morning I encountered an issue when I was trying to put some servers in maintenance. The job was submitted but the maintenance never started. Upon investigation further, I found that the statusengine service reported the error couple of hours back.

Dec 20 03:28:36 broker1 StatusengineWorker.php[5006]: statusengine-worker[26979]: [2006] MySQL server has gone away
Dec 20 03:28:36 broker1 statusengine-worker[26979]: [2006] MySQL server has gone away

It was when I restarted the statusengine service that it regained the connection back. Below are the logs from restart (just FYI)

Dec 20 05:14:11 broker1 systemd[1]: statusengine.service: State 'stop-sigterm' timed out. Killing.
Dec 20 05:14:11 broker1 systemd[1]: statusengine.service: Killing process 5006 (StatusengineWor) with signal SIGKILL.
Dec 20 05:14:11 broker1 systemd[1]: statusengine.service: Main process exited, code=killed, status=9/KILL
Dec 20 05:14:11 broker1 systemd[1]: statusengine.service: Failed with result 'timeout'.

Now I do see a similar entry which seems to have resolved on its own yesterday. Logs are below:

Dec 19 13:04:08 broker1 StatusengineWorker.php[5006]: statusengine-worker[5045]: [2006] MySQL server has gone away
Dec 19 13:04:08 broker1 statusengine-worker[5045]: [2006] MySQL server has gone away
Dec 19 13:20:37 broker1 StatusengineWorker.php[5006]: statusengine-worker[5006]: SQLSTATE[HY000] [2002] Connection timed out
Dec 19 13:20:37 broker1 StatusengineWorker.php[5006]: statusengine-worker[5006]: SQLSTATE[HY000] [2002] Connection timed out
Dec 19 13:20:37 broker1 statusengine-worker[5006]: SQLSTATE[HY000] [2002] Connection timed out
Dec 19 13:20:37 broker1 statusengine-worker[5006]: SQLSTATE[HY000] [2002] Connection timed out
Dec 19 14:20:44 broker1 statusengine-worker[26984]: [1213] Deadlock found when trying to get lock; try restarting transaction
Dec 19 14:20:44 broker1 StatusengineWorker.php[5006]: statusengine-worker[26984]: [1213] Deadlock found when trying to get lock; try restarting transaction
Dec 19 15:39:33 broker1 StatusengineWorker.php[5006]: statusengine-worker[5006]: Execute external command (via Query Handler): [1545233967] SCHEDULE_HOST_DOWNTIME;HOST1;1545233966;1702913966;1;0;157680000;user1;“NOT RELEASED TO OPERATIONS”

My questions:

  1. Is there a recovery mechanism for such a scenario? If yes, could you shed some light?
  2. I can place some kind of external monitoring to see if the service state and restart if Mysql error is seen, but again, is this something that you as a potential issue?

My MySQL is actually an Azure DB within the same Resource group and network and everything. I'll try to find out why these connections are failing (feel free if you have any ideas around this but this is ofcourse out of your scope to fix). Ref art: https://blogs.msdn.microsoft.com/azuresqldbsupport/2018/11/20/azure-database-for-mysql-server-has-gone-away/

@nook24
Copy link
Member

nook24 commented Dec 20, 2018

The basic concept is:

  • Reconnect to MySQL DB on MySQL server has gone away error. Code All other errors will just ignored. May be the worker should also reconnect on SQLSTATE[HY000] [2002] Connection timed out? What are you thoughts on this?
  • Respawn dead child processes in case of php fatal error or some like this. Code
  • Hope that systemd restart Statusengine Worker on doomsday Config

I think the worker should also retry to submit the data on [1213] Deadlock found when trying to get lock; try restarting transaction. I guess i will add this in holiday break.

What data was affected on your system?

  1. Status data got not written to MySQL anymore?
  2. External Commands set via Statusengine UI (like downtimes) where not send to Nagios/Naemon?
  3. both

@raghav8888
Copy link
Author

raghav8888 commented Dec 21, 2018

The data was not affected. I can see the entries of the service checks and everything from the duration when the SQL connection had gone away. My guess is gearman held onto the information and dumped it once the DB became available. Also, the downtimes got scheduled as soon as the service restarted which were triggered when the DB connection was unavailable.

I'm not much of a DBA so I'm gonna limit my comments on the things that I see on my setup for Azure Mysql. Here is a blog that talks about the errors. May be it would be worth checking the errors related to the intermittent disconnection section in the link and apply logic for those things?

As for the last part, the service restarts on failure but in my case it didnt fail. It noticed and error and kept running. May be a trigger can be initiated to restart the service on "MySQL has gone away" error. What harm could it bring?

@nook24
Copy link
Member

nook24 commented Dec 21, 2018

The data was not affected. I can see the entries of the service checks and everything from the duration when the SQL connection had gone away. My guess is gearman held onto the information and dumped it once the DB became available.

For me this sounds like that a child process (Statusengine Worker use pcntl_fork() to create childs) was dead and not restated by the parent. This would explain why you have not been affected by any data loss.

Have you checked journalctl for any uncaught exceptions?

I read the blog. From my experience Mysql server has gone away is most likely caused by TCP timeouts or "Already Closed connection with the database". But I never used Azure so maybe you also need to change some MySQL Server parameters in Azure?

I pushed a few changes to a new branch mysql-connection to catch Timeout and Deadlock errors. Hopefully this helps you to resolve the issue. Could you please test this in your environment?

I forced some Mysql server has gone away by killing the MySQL connection. This worked without any issues:
bildschirmfoto 2018-12-21 um 18 07 32

bildschirmfoto 2018-12-21 um 18 07 39

I also killed my MySQL server but this was also recovered by the Worker without any issues.

I would recommend to monitor the amount of processes to see if any child process die and don't get restarted like so:

define command{
   command_name		check_statusengine
   command_line		/opt/naemon/libexec/check_procs  -c 9:9 -a '/opt/statusengine/worker/bin/StatusengineWorker.php'
}

I addition you should monitor the gearman queue, to see if a queue is without a worker. You can modify this script https://gist.github.com/nook24/71e5752130d179231fb506af5eacd19b for example.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants