Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Naemon reload fails silently randomly #443

Open
Bryce-Souers opened this issue Jan 17, 2024 · 5 comments
Open

Naemon reload fails silently randomly #443

Bryce-Souers opened this issue Jan 17, 2024 · 5 comments

Comments

@Bryce-Souers
Copy link

I have an issue where sometimes a reload will seem to work, and then it will crash with no messages telling why.

I ran omd reload over 30 times, and then I hit this issue.

naemon.log

[1705509045] SERVICE ALERT: <REDACTED>;check_ap_ipv6;CRITICAL;SOFT;3;PING CRITICAL [fda9:ba49:dcb2:32a5:213:5009:1:9fa2] - Packet loss = 100%
[1705509050] Caught 'Hangup', restarting...
[1705509052] Retention data successfully saved.
[1705509052] Event broker module 'NERD' deinitialized successfully.
[1705509052] npcdmod: If you don't like me, I will go out! Bye.
[1705509052] Event broker module '/omd/sites/ssnomd/lib/npcdmod4.o' deinitialized successfully.
[1705509052] livestatus: deinitializing
[1705509052] livestatus: Logfile cache: flushing complete cache.
[1705509052] Event broker module '/omd/sites/ssnomd/lib/naemon/livestatus.o' deinitialized successfully.
[1705509053] Naemon 1.4.0 starting... (PID=2823702)
[1705509053] Local time is Wed Jan 17 16:30:53 UTC 2024
[1705509053] LOG VERSION: 2.0
[1705509053] qh: Socket '/omd/sites/ssnomd/var/naemon/naemon.qh' successfully initialized
[1705509053] nerd: Channel hostchecks registered successfully
[1705509053] nerd: Channel servicechecks registered successfully
[1705509053] nerd: Fully initialized and ready to rock!

debug.log

[1705509052.736467] [016.0] [pid=2823702] Service 'check_2days_uptime' on host 'REDACTED' handle_service_check_event()...
[1705509052.742485] [064.0] [pid=2823702] Attempting to unload module 'NERD': flags=1, reason=2
[1705509052.742527] [064.0] [pid=2823702] Module 'NERD' unloaded successfully.
[1705509052.742578] [064.0] [pid=2823702] Attempting to unload module '/omd/sites/ssnomd/lib/npcdmod4.o': flags=1, reason=2
[1705509052.742693] [064.0] [pid=2823702] Module '/omd/sites/ssnomd/lib/npcdmod4.o' unloaded successfully.
[1705509052.742744] [064.0] [pid=2823702] Attempting to unload module '/omd/sites/ssnomd/lib/naemon/livestatus.o': flags=1, reason=2
[1705509052.744596] [064.0] [pid=2823702] Module '/omd/sites/ssnomd/lib/naemon/livestatus.o' unloaded successfully.
[1705509053.382125] [064.1] [pid=2823702] Making callbacks (type 2)...
[1705509053.382187] [064.1] [pid=2823702] Making callbacks (type 2)...
[1705509053.382220] [064.1] [pid=2823702] Making callbacks (type 2)...
[1705509053.382251] [064.1] [pid=2823702] Making callbacks (type 2)...
[1705509053.382270] [4096.0] [pid=2823702] wproc: Successfully registered manager as @wproc with query handler
[1705509053.589750] [4096.0] [pid=2823702] wproc: Registry request: name=Core Worker 2897369;pid=2897369
[1705509053.590014] [4096.0] [pid=2823702] wproc: Registry request: name=Core Worker 2897370;pid=2897370
[1705509053.590135] [4096.0] [pid=2823702] wproc: Registry request: name=Core Worker 2897371;pid=2897371
[1705509053.629639] [4096.0] [pid=2823702] wproc: Registry request: name=Core Worker 2897372;pid=2897372
[1705509053.666278] [4096.0] [pid=2823702] wproc: Registry request: name=Core Worker 2897374;pid=2897374
[1705509053.672590] [4096.0] [pid=2823702] wproc: Registry request: name=Core Worker 2897373;pid=2897373

omd status

[ssnomd@lasssnpr01nag02 ~]$ omd status
rrdcached:      running
npcd:           running
naemon:         stopped
apache:         running
xinetd:         running
crontab:        running
-----------------------
Overall state:  partially running

ps aux

[ssnomd@lasssnpr01nag02 ~]$ ps aux
USER         PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root           1  0.0  0.0  11928  2848 ?        Ss   Jan16   0:00 bash /root/start.sh
root         451  0.0  0.0  22964  2936 ?        Ss   Jan16   0:00 /usr/sbin/crond
root        1383  0.0  0.0  11928  2720 ?        S    Jan16   0:00 /bin/sh /usr/sbin/apachectl -D FOREGROUND
root        1387  0.0  0.0 267260 14672 ?        S    Jan16   0:05 /usr/sbin/httpd -D FOREGROUND
apache      1400  0.0  0.0 268696  8424 ?        S    Jan16   0:00 /usr/sbin/httpd -D FOREGROUND
apache      1401  0.0  0.0 270072  8760 ?        S    Jan16   0:01 /usr/sbin/httpd -D FOREGROUND
apache      1402  0.0  0.0 2704096 12792 ?       Sl   Jan16   0:36 /usr/sbin/httpd -D FOREGROUND
apache      1403  0.0  0.0 2572960 10416 ?       Sl   Jan16   0:35 /usr/sbin/httpd -D FOREGROUND
apache      1418  0.0  0.0 2572960 12984 ?       Sl   Jan16   0:35 /usr/sbin/httpd -D FOREGROUND
apache    390476  0.0  0.0 2507424 10444 ?       Sl   Jan16   0:28 /usr/sbin/httpd -D FOREGROUND
root     2711500  0.0  0.0  19248  3616 pts/1    Ss   14:44   0:00 bash
root     2711611  0.0  0.0 115204  7284 pts/1    S    14:44   0:00 sudo su - ssnomd
root     2711612  0.0  0.0 109184  6164 pts/1    S    14:44   0:00 su - ssnomd
ssnomd   2711613  0.0  0.0  22508  4400 pts/1    S    14:44   0:00 -bash
root     2739377  0.0  0.0  19248  3688 pts/2    Ss   14:55   0:00 bash
root     2739563  0.0  0.0 115204  7272 pts/2    S    14:55   0:00 sudo su - ssnomd
root     2739564  0.0  0.0 109184  6332 pts/2    S    14:55   0:00 su - ssnomd
ssnomd   2739566  0.0  0.0  22496  4428 pts/2    S    14:55   0:00 -bash
root     2766725  0.0  0.0  19248  3600 pts/0    Ss   15:06   0:00 bash
root     2766786  0.0  0.0 109184  6172 pts/0    S    15:06   0:00 su - ssnomd
ssnomd   2766787  0.0  0.0  22388  4436 pts/0    S+   15:06   0:00 -bash
root     2808475  0.0  0.0  19248  3716 pts/3    Ss   15:22   0:00 bash
root     2817334  0.0  0.0  19248  3684 pts/4    Ss   15:27   0:00 bash
root     2817353  0.0  0.0 115204  7492 pts/4    S    15:27   0:00 sudo su - ssnomd
root     2817354  0.0  0.0 109184  6184 pts/4    S    15:27   0:00 su - ssnomd
ssnomd   2817355  0.0  0.0  22492  4396 pts/4    S+   15:27   0:00 -bash
root     2819589  0.0  0.0 115204  7400 pts/3    S    15:40   0:00 sudo su - ssnomd
root     2819590  0.0  0.0 109184  6112 pts/3    S    15:40   0:00 su - ssnomd
ssnomd   2819591  0.0  0.0  22388  4416 pts/3    S+   15:40   0:00 -bash
ssnomd   2823608  0.0  0.0 317892  8072 ?        Ssl  16:01   0:01 /omd/sites/ssnomd/bin/rrdcached -t 4 -w 3600 -z 1800 -f 7200 -s ssnomd -m 660 -l unix:/omd/sites/ssnomd/tmp/run/rrdcached.sock -p /omd/sites/ssnomd/tmp/rrdcached.pid -j /omd/sites/ssnomd/var/rrdcached -o /omd/sites/ssnomd/var/log/rrd
ssnomd   2823640  0.1  0.0 156184   380 ?        S    16:01   0:03 /omd/sites/ssnomd/bin/npcd -d -f /omd/sites/ssnomd/etc/pnp4nagios/npcd.cfg
ssnomd   2823760  0.0  0.0 118732  5936 ?        Ss   16:01   0:00 /usr/sbin/httpd -f /omd/sites/ssnomd/etc/apache/apache.conf
ssnomd   2823762  0.0  0.1 105612 30812 ?        S    16:01   0:00 perl -x /omd/sites/ssnomd/share/thruk/script/thruk_auth
ssnomd   2823763  0.0  0.0 118884  2596 ?        S    16:01   0:00 /usr/sbin/httpd -f /omd/sites/ssnomd/etc/apache/apache.conf
ssnomd   2823764  0.0  0.0 119400  5728 ?        S    16:01   0:00 /usr/sbin/httpd -f /omd/sites/ssnomd/etc/apache/apache.conf
ssnomd   2823832  0.0  0.4 162832 76892 ?        S    16:01   0:02 perl /omd/sites/ssnomd/share/thruk/script/thruk_fastcgi.pl
ssnomd   2823839  0.0  0.0 119640  5752 ?        S    16:01   0:00 /usr/sbin/httpd -f /omd/sites/ssnomd/etc/apache/apache.conf
ssnomd   2823920  0.0  0.0  55064  3400 ?        Ss   16:01   0:00 /omd/sites/ssnomd/var/tmp/xinetd -pidfile /omd/sites/ssnomd/tmp/run/xinetd.pid -filelog /omd/sites/ssnomd/var/log/xinetd.log -f /omd/sites/ssnomd/etc/xinetd.conf
ssnomd   2852767  0.0  0.0 118868  4860 ?        S    16:13   0:00 /usr/sbin/httpd -f /omd/sites/ssnomd/etc/apache/apache.conf
ssnomd   2895736  0.0  0.0  26000  1540 pts/1    S+   16:30   0:00 /usr/bin/coreutils --coreutils-prog-shebang=tail /usr/bin/tail -n 500 -f var/log/naemon.log
ssnomd   2898643  200  0.0  47644  3848 pts/2    R+   16:54   0:00 ps aux

Any ideas for what could be going wrong here?

@michaelsbach
Copy link

I suspect there is an issue with the worker process creation, as that's the last thing the debug.log is showing and it hangs there and then the process dies

@sni what say you?

@sni
Copy link
Contributor

sni commented Jan 19, 2024

looks like naemon crashed during reload, i saw that occasionally in the wild as well but it wasn't easily reproducible.
Which omd / naemon version is that in this case? It also might help to enable core file, this
might reveal the origin of the issue.

@nook24
Copy link
Member

nook24 commented Jan 19, 2024

From the output @Bryce-Souers provided this is [1705509053] Naemon 1.4.0 starting... (PID=2823702)

So this is a version before the worker re-spwan feature which got introduced into 1.4.2.
Where there any changes whining livestatus? There is also an other open issue #428

@Bryce-Souers
Copy link
Author

Bryce-Souers commented Jan 19, 2024

looks like naemon crashed during reload, i saw that occasionally in the wild as well but it wasn't easily reproducible. Which omd / naemon version is that in this case? It also might help to enable core file, this might reveal the origin of the issue.

@sni

What do you mean by enable core file?

I have debugging on max as far as I know.

@Bryce-Souers
Copy link
Author

I'd also like to add that this is running using the v5 of the omd DockerHub image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants