statusengine service stuck after encountering "Mysql server has gone away" #18

raghav8888 · 2018-12-20T08:42:14Z

This morning I encountered an issue when I was trying to put some servers in maintenance. The job was submitted but the maintenance never started. Upon investigation further, I found that the statusengine service reported the error couple of hours back.

Dec 20 03:28:36 broker1 StatusengineWorker.php[5006]: statusengine-worker[26979]: [2006] MySQL server has gone away
Dec 20 03:28:36 broker1 statusengine-worker[26979]: [2006] MySQL server has gone away

It was when I restarted the statusengine service that it regained the connection back. Below are the logs from restart (just FYI)

Dec 20 05:14:11 broker1 systemd[1]: statusengine.service: State 'stop-sigterm' timed out. Killing.
Dec 20 05:14:11 broker1 systemd[1]: statusengine.service: Killing process 5006 (StatusengineWor) with signal SIGKILL.
Dec 20 05:14:11 broker1 systemd[1]: statusengine.service: Main process exited, code=killed, status=9/KILL
Dec 20 05:14:11 broker1 systemd[1]: statusengine.service: Failed with result 'timeout'.

Now I do see a similar entry which seems to have resolved on its own yesterday. Logs are below:

Dec 19 13:04:08 broker1 StatusengineWorker.php[5006]: statusengine-worker[5045]: [2006] MySQL server has gone away
Dec 19 13:04:08 broker1 statusengine-worker[5045]: [2006] MySQL server has gone away
Dec 19 13:20:37 broker1 StatusengineWorker.php[5006]: statusengine-worker[5006]: SQLSTATE[HY000] [2002] Connection timed out
Dec 19 13:20:37 broker1 StatusengineWorker.php[5006]: statusengine-worker[5006]: SQLSTATE[HY000] [2002] Connection timed out
Dec 19 13:20:37 broker1 statusengine-worker[5006]: SQLSTATE[HY000] [2002] Connection timed out
Dec 19 13:20:37 broker1 statusengine-worker[5006]: SQLSTATE[HY000] [2002] Connection timed out
Dec 19 14:20:44 broker1 statusengine-worker[26984]: [1213] Deadlock found when trying to get lock; try restarting transaction
Dec 19 14:20:44 broker1 StatusengineWorker.php[5006]: statusengine-worker[26984]: [1213] Deadlock found when trying to get lock; try restarting transaction
Dec 19 15:39:33 broker1 StatusengineWorker.php[5006]: statusengine-worker[5006]: Execute external command (via Query Handler): [1545233967] SCHEDULE_HOST_DOWNTIME;HOST1;1545233966;1702913966;1;0;157680000;user1;“NOT RELEASED TO OPERATIONS”

My questions:

Is there a recovery mechanism for such a scenario? If yes, could you shed some light?
I can place some kind of external monitoring to see if the service state and restart if Mysql error is seen, but again, is this something that you as a potential issue?

My MySQL is actually an Azure DB within the same Resource group and network and everything. I'll try to find out why these connections are failing (feel free if you have any ideas around this but this is ofcourse out of your scope to fix). Ref art: https://blogs.msdn.microsoft.com/azuresqldbsupport/2018/11/20/azure-database-for-mysql-server-has-gone-away/

The text was updated successfully, but these errors were encountered:

nook24 · 2018-12-20T18:56:21Z

The basic concept is:

Reconnect to MySQL DB on MySQL server has gone away error. Code All other errors will just ignored. May be the worker should also reconnect on SQLSTATE[HY000] [2002] Connection timed out? What are you thoughts on this?
Respawn dead child processes in case of php fatal error or some like this. Code
Hope that systemd restart Statusengine Worker on doomsday Config

I think the worker should also retry to submit the data on [1213] Deadlock found when trying to get lock; try restarting transaction. I guess i will add this in holiday break.

What data was affected on your system?

Status data got not written to MySQL anymore?
External Commands set via Statusengine UI (like downtimes) where not send to Nagios/Naemon?
both

raghav8888 · 2018-12-21T04:45:51Z

The data was not affected. I can see the entries of the service checks and everything from the duration when the SQL connection had gone away. My guess is gearman held onto the information and dumped it once the DB became available. Also, the downtimes got scheduled as soon as the service restarted which were triggered when the DB connection was unavailable.

I'm not much of a DBA so I'm gonna limit my comments on the things that I see on my setup for Azure Mysql. Here is a blog that talks about the errors. May be it would be worth checking the errors related to the intermittent disconnection section in the link and apply logic for those things?

As for the last part, the service restarts on failure but in my case it didnt fail. It noticed and error and kept running. May be a trigger can be initiated to restart the service on "MySQL has gone away" error. What harm could it bring?

nook24 · 2018-12-21T17:54:08Z

The data was not affected. I can see the entries of the service checks and everything from the duration when the SQL connection had gone away. My guess is gearman held onto the information and dumped it once the DB became available.

For me this sounds like that a child process (Statusengine Worker use pcntl_fork() to create childs) was dead and not restated by the parent. This would explain why you have not been affected by any data loss.

Have you checked journalctl for any uncaught exceptions?

I read the blog. From my experience Mysql server has gone away is most likely caused by TCP timeouts or "Already Closed connection with the database". But I never used Azure so maybe you also need to change some MySQL Server parameters in Azure?

I pushed a few changes to a new branch mysql-connection to catch Timeout and Deadlock errors. Hopefully this helps you to resolve the issue. Could you please test this in your environment?

I forced some Mysql server has gone away by killing the MySQL connection. This worked without any issues:

I also killed my MySQL server but this was also recovered by the Worker without any issues.

I would recommend to monitor the amount of processes to see if any child process die and don't get restarted like so:

define command{
   command_name		check_statusengine
   command_line		/opt/naemon/libexec/check_procs  -c 9:9 -a '/opt/statusengine/worker/bin/StatusengineWorker.php'
}

I addition you should monitor the gearman queue, to see if a queue is without a worker. You can modify this script https://gist.github.com/nook24/71e5752130d179231fb506af5eacd19b for example.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

statusengine service stuck after encountering "Mysql server has gone away" #18

statusengine service stuck after encountering "Mysql server has gone away" #18

raghav8888 commented Dec 20, 2018

nook24 commented Dec 20, 2018

raghav8888 commented Dec 21, 2018 •

edited

Loading

nook24 commented Dec 21, 2018

statusengine service stuck after encountering "Mysql server has gone away" #18

statusengine service stuck after encountering "Mysql server has gone away" #18

Comments

raghav8888 commented Dec 20, 2018

nook24 commented Dec 20, 2018

raghav8888 commented Dec 21, 2018 • edited Loading

nook24 commented Dec 21, 2018

raghav8888 commented Dec 21, 2018 •

edited

Loading