Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Potential race condition in starting nut-server #277

Open
val-kulkov opened this issue Mar 24, 2016 · 12 comments
Open

Potential race condition in starting nut-server #277

val-kulkov opened this issue Mar 24, 2016 · 12 comments
Labels
service/daemon start/stop General subject for starting and stopping NUT daemons (drivers, server, monitor); also BG/FG/Debug

Comments

@val-kulkov
Copy link

I was having the same issues with my Tripp Lite SMART1500LCDT as the ones described in PR #122. Unfortunately, there has been little progress with PR #122 since it was created almost two years ago. So I thought perhaps I should try a different strategy.

I replaced the generic USB cable that came with my Tripp Lite SMART1500LCDT unit with a StarTech Certified USB cable. It worked. Replacing the cable and restarting nut-server resulted in establishing communication with my UPS. I have been watching the connection between my UPS unit and nut-server for a few weeks now and it works reasonably well, losing the connection maybe once a week or so. Compared to what I had with the Tripp Lite-provided generic USB cable, that's a huge progress.

However, I came across another issue. I found that nut-server fails to initialize properly at boot time in about 50% of the cases. Once the computer is booted, a manual restart of nut-server through "/etc/init.d/nut-server restart" fixes the problem.

It looks like there may be a race condition when nut-server is initialized at boot time. Until the appropriate access permissions are set for the USB device by /lib/udev/rules.d/52-nut-usbups.rules, nut-server cannot initialize properly and upsd won't start.

I looked into /etc/init.d/nut-server and found a number of FIXME messages. When I realized that a proper fix would take more time for me than I can afford at this time, I decided to simply describe this issue here hoping that someone will assume responsibility for fixing the issue while I cannot -- at this time.

In my case, adding "sleep 20" into the "start" section of the initialization script fixed the problem. It is very much a Band-Aid solution, of course. But it worked.

@clepple
Copy link
Member

clepple commented Apr 9, 2016

Which distribution is this, and what version of the NUT package?

You might also check if setting maxretry and retrydelay in ups.conf would fix the problem (the default is to try once).

Reference: http://networkupstools.org/docs/man/ups.conf.html#_global_directives

@val-kulkov
Copy link
Author

My observations were made on a Ubuntu 14.04.4 LTS system, nut-server: 2.7.1-1ubuntu1. By trial and error, I found that the following settings in my ups.conf appear to result in a relatively reliable performance:

maxretry = 3
pollinterval = 10

The value of pollinterval has been set to 10 to address the known issue with usbhid-ups (named "Repetitive timeout and staleness") where some UPS models appear to become unresponsive with the default polling frequency, producing errors like:

usbfs: USBDEVFS_CONTROL failed cmd usbhid-ups

I am going to experiment a little more with maxretry values. Perhaps it may be possible to avoid adding "sleep 20" to the startup script and increase the value of maxretry instead.

@val-kulkov
Copy link
Author

I have been observing my system for a few weeks now, with maxretry = 30. The startup succeeds from cold boot in more than 50% of cases. In some cases however, upsd does not start even with maxretry = 30.

I found that when upsd does not start from cold boot, I have one of the two possible situations:

  1. upsd may still be started with "sudo /etc/init.d/nut-server stop ; sleep 5 ; sudo /etc/init.d/nut-server start". Sometimes upsd starts properly on the first attempt, sometimes it takes a few more attempts to start upsd, like two or three attempts.
  2. upsd will not start at all. The restart sequence in (1) does not start upsd even after a hundred attempts or so.

The following is a part of syslog that describes the situation in (1). Here, upsd failed to start successfully from the cold boot:

May 8 10:11:49 vk upsmon[1410]: Startup successful
May 8 10:11:49 vk upsmon[1412]: UPS [Smart1500LCDI@localhost]: connect failed: Connection failure: Connection refused
May 8 10:11:49 vk upsmon[1412]: Communications with UPS Smart1500LCDI@localhost lost
May 8 10:11:50 vk usbhid-ups[1438]: Startup successful
May 8 10:11:50 vk upsd[1439]: not listening on 192.168.15.6 port 3493
May 8 10:11:50 vk upsd[1439]: listening on 127.0.0.1 port 3493
May 8 10:11:50 vk upsd[1439]: no listening interface available

What's interesting here is that usbhid-ups reports a successful startup, but then upsd fails to start with "no listening interface available". A subsequent manual restart with "sudo /etc/init.d/nut-server stop ; sleep 5 ; sudo /etc/init.d/nut-server start" brings up upsd with no hiccup:

May 8 10:20:05 vk usbhid-ups[1438]: Signal 15: exiting
May 8 10:20:11 vk usbhid-ups[2876]: Startup successful
May 8 10:20:11 vk upsd[2877]: listening on 192.168.15.6 port 3493
May 8 10:20:11 vk upsd[2877]: listening on 127.0.0.1 port 3493
May 8 10:20:11 vk upsd[2877]: Connected to UPS [Smart1500LCDI]: usbhid-ups-Smart1500LCDI
May 8 10:20:11 vk upsd[2878]: Startup successful
May 8 10:20:15 vk upsd[2878]: User [email protected] logged into UPS [Smart1500LCDI]

Could it be the case that upsd fails to start with "no listening interface available" while it takes some time for the usbhid-ups driver to register a listening interface? Lines 293-314 in upsdrvctl.c suggest that if exec_error does not change on an attempt to fork a new process, it is assumed that the driver command has succeeded and then the startup sequence proceeds as if the driver has been fully loaded. However, if the driver still needs some more time to register listening interfaces while upsd assumes that the driver has been loaded (at line 305 in upsdrvctl.c) but then determines that no listening interface is available, upsd will fail to start. I wonder if this is what I am observing with my system.

@clepple
Copy link
Member

clepple commented May 8, 2016

Could it be the case that upsd fails to start with "no listening interface available" while it takes some time for the usbhid-ups driver to register a listening interface?

Actually, the listening (TCP) interface and the driver are on opposite sides of upsd, architecturally speaking. The driver does not need to start for upsd to start, so there must be something else thwarting upsd.

One of the preconditions for upsd (implied, I guess) is that the network interfaces for all of the LISTEN addresses have been completely brought up. Typically, this is done by listing a dependency on the network subsystem in the upsd startup script. ("# Required-Start: $local_fs $syslog $network $remote_fs udev" in /etc/init.d/nut-server?)

The logging for upsd socket creation seems to need debug level 3 or greater, and I haven't checked to see if that goes to syslog as well as stderr (setting the debug flags prevents the daemon from backgrounding, which ripples through to other behaviors.)

@val-kulkov
Copy link
Author

@clepple : my observations have been made on upsd v.2.7.1, which is the up-to-date upsd version in Ubuntu 14.04 repos. I am going to upgrade my Ubuntu system from 14.04 to 16.04 shortly and make more observations on what will likely be a newer upsd version.

If the issue persists with the newer upsd version, I am willing to clone this upsd repo and compile upsd locally in order to run tests involving debug logging.

Lastly, here is the excerpt from my today's syslog. Note that at 09:11:27, I executed "sudo /etc/init.d/nut-server start". Before running this command, I checked the running processes and found that "/lib/nut/usbhid-ups -a Smart1500LCDI" was running.

Therefore, it does indeed seem that upsd sometimes misses some dependency at start time. In my particular case, upsd starts properly 6-8 times out of 10. When it does not start properly, I can almost always manually start upsd by running "nut-server start". There was one particular instance when "nut-server start" did not help, but I am unable to reproduce this issue. I will keep observing.

May 13 09:08:20 vk kernel: [    1.977937] usb 2-1.2: Product: TRIPP LITE UPS
May 13 09:08:20 vk kernel: [    2.491583] hid-generic 0003:09AE:3016.0007: hiddev0,hidraw6: USB HID v1.11 Device [Tripp Lite TRIPP LITE UPS] on usb-0000:00:1d.0-1.2/input0
May 13 09:08:20 vk upsmon[1392]: Startup successful
May 13 09:08:20 vk upsmon[1394]: UPS [Smart1500LCDI@localhost]: connect failed: Connection failure: Connection refused
May 13 09:08:20 vk upsmon[1394]: Communications with UPS Smart1500LCDI@localhost lost
May 13 09:08:20 vk usbhid-ups[1418]: Startup successful
May 13 09:08:20 vk upsd[1419]: not listening on 192.168.15.6 port 3493
May 13 09:08:20 vk upsd[1419]: listening on 127.0.0.1 port 3493
May 13 09:08:20 vk upsd[1419]: no listening interface available
May 13 09:08:25 vk upsmon[1394]: UPS [Smart1500LCDI@localhost]: connect failed: Connection failure: Connection refused
May 13 09:08:25 vk upsmon[1394]: UPS Smart1500LCDI@localhost is unavailable
May 13 09:08:30 vk upsmon[1394]: UPS [Smart1500LCDI@localhost]: connect failed: Connection failure: Connection refused
May 13 09:11:27 vk usbhid-ups[1418]: Signal 15: exiting
May 13 09:11:37 vk kernel: [  203.599967] usb 2-1.2: usbfs: USBDEVFS_CONTROL failed cmd usbhid-ups rqt 129 rq 6 len 878 ret -110
May 13 09:11:47 vk kernel: [  213.728535] usb 2-1.2: usbfs: USBDEVFS_CONTROL failed cmd usbhid-ups rqt 129 rq 6 len 878 ret -110
May 13 09:11:53 vk usbhid-ups[3191]: Startup successful
May 13 09:11:53 vk upsd[3192]: listening on 192.168.15.6 port 3493
May 13 09:11:53 vk upsd[3192]: listening on 127.0.0.1 port 3493
May 13 09:11:53 vk upsd[3192]: Connected to UPS [Smart1500LCDI]: usbhid-ups-Smart1500LCDI
May 13 09:11:53 vk upsd[3193]: Startup successful
May 13 09:11:54 vk upsd[3193]: User [email protected] logged into UPS [Smart1500LCDI]
May 13 09:11:49 vk upsmon[1394]: message repeated 40 times: [ UPS [Smart1500LCDI@localhost]: connect failed: Connection failure: Connection refused]
May 13 09:11:54 vk upsmon[1394]: Communications with UPS Smart1500LCDI@localhost established

@clepple
Copy link
Member

clepple commented May 13, 2016

If the issue persists with the newer upsd version, I am willing to clone this upsd repo and compile upsd locally in order to run tests involving debug logging.

16.04's version isn't that much newer: http://packages.ubuntu.com/xenial/nut-server http://packages.ubuntu.com/xenial/nut-server (2.7.2)

But you could try this PPA before building from source: https://code.launchpad.net/~dobey/+recipe/nut-daily https://code.launchpad.net/~dobey/+recipe/nut-daily

@matib12
Copy link

matib12 commented Nov 5, 2016

Hi, val-kulkov I have the same symptoms on Raspbian Jessie. It is happening on the single core and multicore version of the RPi board. "no listening interface available" error appears during autostart.
What I have found useful is starting the interface you are using in LISTEN during startup (in the line auto) and with the static IP configuration. Any of these options by itself doesn't solve the problem.

@val-kulkov
Copy link
Author

@matib12 : I am not sure I understand what exactly you are trying to do. The problem, as I understand it, is likely related to the communications between the UPS device and the Linux kernel over the USB. This is the entry from my May 13, 2016 syslog (see a few posts above for the full syslog quote) that suggests that the problem may have its origin in the kernel-device communications:

May 13 09:11:37 vk kernel: [ 203.599967] usb 2-1.2: usbfs: USBDEVFS_CONTROL failed cmd usbhid-ups rqt 129 rq 6 len 878 ret -110

By the way, the problem completely disappeared once I upgraded my workstation to Ubuntu 16.04. What's more, Ubuntu 16.04 detects my UPS unit as a UPS device automatically and offers power management options for it. There must have been some changes to the kernel code or usbhid-ups or both that fixed the problem. My current kernel version is 4.4.0-45-generic.

@clepple
Copy link
Member

clepple commented Nov 7, 2016

@val-kulkov to be fair, your original posts did describe the upsd LISTEN problem that @matib12 worked around (see errors reported by upsmon), as well as the USB issue.

A few timeouts (ret -110) are normal. Continuous timeouts and device disconnections point to an issue between the UPS and the kernel, and it sounds like you have covered most of the physical layer potential problems (cables, hub, etc.)

While Ubuntu 16.04 did seem to run longer before the SMART1500LCDT disconnected, the problem has not gone away completely. I also rebuilt NUT with the libusb-1.0 branch, and pushed packages to this PPA, and it is no worse: https://launchpad.net/~clepple/+archive/ubuntu/nut

@daphatty
Copy link

FWIW, this issue is still present nearly three years later. My setup is a bit evolved (Raspberry Pi 3, Raspbian Buster, nut version 2.7.4, upshid-ups driver) but the behavior is identical. On boot-up, the LISTEN directive only works on localhost but not on the static IP. Manually restarting (stop/start) the nut-server service fixes the problem.

@jimklimov
Copy link
Member

jimklimov commented Nov 14, 2021

Is #749 a different aspect of this issue in general?

These two are not identical, since this issue happens with init scripts and that with (faulty earlier) systemd dependencies, but the systemic root cause seems to be similar.

@jimklimov jimklimov added the service/daemon start/stop General subject for starting and stopping NUT daemons (drivers, server, monitor); also BG/FG/Debug label Nov 14, 2021
@daphatty
Copy link

Is #749 a different aspect of this issue in general?

These two are not identical, since this issue happens with init scripts and that with (faulty earlier) systemd dependencies, but the systemic root cause seems to be similar.

It’s certainly possible but I’ve long since decommissioned the environment where I encountered this issue and moved on. I have no way to easily verify whether or not the network was initialized when NUT server attempted to do so.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
service/daemon start/stop General subject for starting and stopping NUT daemons (drivers, server, monitor); also BG/FG/Debug
Projects
None yet
Development

No branches or pull requests

5 participants