Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug: Connectivity is lost once gluetun container is restarted #641

Open
rakbladsvalsen opened this issue Sep 24, 2021 · 64 comments
Open

Bug: Connectivity is lost once gluetun container is restarted #641

rakbladsvalsen opened this issue Sep 24, 2021 · 64 comments

Comments

@rakbladsvalsen
Copy link

rakbladsvalsen commented Sep 24, 2021

Is this urgent?: No (kinda it is, since this causes complete connection loss if this "bug" happens)

Host OS: Tested on both Fedora 34 and (up-to-date) Arch Linux ARM (32bit/RPi 4B)

CPU arch or device name: amd64 & armv7

What VPN provider are you using: NordVPN

What are you using to run your container?: Docker Compose

What is the version of the program

x64 & armv7: Running version latest built on 2021-09-23T17:23:28Z (commit 985cf7b)

Steps to reproduce issue:

  1. Using recommended docker-compose.yml, configure gluetun and another container (in my case, xyz, though it can be something like qbittorrent or whatever you want) to use gluetun's network stack. Publish xyz' ports through gluetun's network stack.
  2. Either: a) restart gluetun using good ol' docker restart gluetun, or b) manually cause a temporary network problem in such way that gluetun container dies/exits. Then restart gluetun.
  3. Now try to use xyz through its published ports: you'll receive a connection refused error, unless you restart xyz service again. You can also -exec it into the container and run curl/wget/ping/etc:

Expected behavior:
xyz should have internet connectivity through gluetun's network stack and be accesible through gluetun's published/exposed ports, even if gluetun is restarted. This is, unfortunately not the case: xyz's network stack just dies, no data in, no data out.

Additional notes:

  1. I did use FIREWALL_OUTBOUND_SUBNETS - didn't make a difference.
  2. I noticed quite interesting stuff once gluetun is restarted: a) Routing entries from containers using network_mode: service:gluetun completely disappear. b) Restarting gluetun doesn't bring back original routing tables. c) NetworkMode seems to be okay.

Terminal example

# At this point, gluetun has been manually restarted. Then I exec -it'd into an affected container that was using gluetun's network stack:
/app # ip ro sh 
/app # 
[root@fedora pepe]# docker restart xyz
[root@fedora pepe]# docker exec -it xyz /bin/sh 
/app # ip ro sh
0.0.0.0/1 via 10.8.1.1 dev tun0 
default via 172.17.0.1 dev eth0 
10.8.1.0/24 dev tun0 scope link  src 10.8.1.4 
37.120.209.219 via 172.17.0.1 dev eth0 
128.0.0.0/1 via 10.8.1.1 dev tun0 
172.17.0.0/16 dev eth0 scope link  src 172.17.0.2 

Brief docker inspect output from affected container

# snip
            "NetworkMode": "container:f77af999d9de92af66094dd9db0f854f1a2da9ceabddc47239bc5b89f577247f",
            "PortBindings": {},
            "RestartPolicy": {
                "Name": "unless-stopped",
                "MaximumRetryCount": 0
            },

f77[...] is gluetun's container ID.

Full gluetun logs:

2021/09/24 16:39:47 INFO Alpine version: 3.14.2
2021/09/24 16:39:47 INFO OpenVPN 2.4 version: 2.4.11
2021/09/24 16:39:47 INFO OpenVPN 2.5 version: 2.5.2
2021/09/24 16:39:47 INFO Unbound version: 1.13.2
2021/09/24 16:39:47 INFO IPtables version: v1.8.7
2021/09/24 16:39:47 INFO Settings summary below:
|--VPN:
   |--Type: openvpn
   |--OpenVPN:
      |--Version: 2.5
      |--Verbosity level: 1
      |--Network interface: tun0
      |--Run as root: enabled
   |--Nordvpn settings:
      |--Regions: mexico, sweden
      |--OpenVPN selection:
         |--Protocol: udp
|--DNS:
   |--Plaintext address: 1.1.1.1
   |--DNS over TLS:
      |--Unbound:
          |--DNS over TLS providers:
              |--Cloudflare
          |--Listening port: 53
          |--Access control:
              |--Allowed:
                  |--0.0.0.0/0
                  |--::/0
          |--Caching: enabled
          |--IPv4 resolution: enabled
          |--IPv6 resolution: disabled
          |--Verbosity level: 1/5
          |--Verbosity details level: 0/4
          |--Validation log level: 0/2
          |--Username: 
      |--Blacklist:
         |--Blocked categories: malicious
         |--Additional IP networks blocked: 13
      |--Update: every 24h0m0s
|--Firewall:
   |--Outbound subnets: 192.168.0.0/24
|--Log:
   |--Level: INFO
|--System:
   |--Process user ID: 1000
   |--Process group ID: 1000
   |--Timezone: REDACTED
|--Health:
   |--Server address: 127.0.0.1:9999
   |--Address to ping: github.com
   |--VPN:
      |--Initial duration: 6s
      |--Addition duration: 5s
|--HTTP control server:
   |--Listening port: 8000
   |--Logging: enabled
|--Public IP getter:
   |--Fetch period: 12h0m0s
   |--IP file: /tmp/gluetun/ip
|--Github version information: enabled
2021/09/24 16:39:47 INFO routing: default route found: interface eth0, gateway 172.17.0.1
2021/09/24 16:39:47 INFO routing: local ethernet link found: eth0
2021/09/24 16:39:47 INFO routing: local ipnet found: 172.17.0.0/16
2021/09/24 16:39:47 INFO routing: default route found: interface eth0, gateway 172.17.0.1
2021/09/24 16:39:47 INFO routing: adding route for 0.0.0.0/0
2021/09/24 16:39:47 INFO firewall: firewall disabled, only updating allowed subnets internal list
2021/09/24 16:39:47 INFO routing: default route found: interface eth0, gateway 172.17.0.1
2021/09/24 16:39:47 INFO routing: adding route for 192.168.0.0/24
2021/09/24 16:39:47 INFO TUN device is not available: open /dev/net/tun: no such file or directory; creating it...
2021/09/24 16:39:47 INFO firewall: enabling...
2021/09/24 16:39:47 INFO firewall: enabled successfully
2021/09/24 16:39:47 INFO dns over tls: using plaintext DNS at address 1.1.1.1
2021/09/24 16:39:47 INFO healthcheck: listening on 127.0.0.1:9999
2021/09/24 16:39:47 INFO http server: listening on :8000
2021/09/24 16:39:47 INFO firewall: setting VPN connection through firewall...
2021/09/24 16:39:47 INFO openvpn: OpenVPN 2.5.2 armv7-alpine-linux-musleabihf [SSL (OpenSSL)] [LZO] [LZ4] [EPOLL] [MH/PKTINFO] [AEAD] built on May  4 2021
2021/09/24 16:39:47 INFO openvpn: library versions: OpenSSL 1.1.1l  24 Aug 2021, LZO 2.10
2021/09/24 16:39:47 INFO openvpn: TCP/UDP: Preserving recently used remote address: [AF_INET]86.106.103.27:1194
2021/09/24 16:39:47 INFO openvpn: UDP link local: (not bound)
2021/09/24 16:39:47 INFO openvpn: UDP link remote: [AF_INET]86.106.103.27:1194
2021/09/24 16:39:48 WARN openvpn: 'link-mtu' is used inconsistently, local='link-mtu 1633', remote='link-mtu 1634'
2021/09/24 16:39:48 WARN openvpn: 'comp-lzo' is present in remote config but missing in local config, remote='comp-lzo'
2021/09/24 16:39:48 INFO openvpn: [se-nl8.nordvpn.com] Peer Connection Initiated with [AF_INET]86.106.103.27:1194
2021/09/24 16:39:49 INFO openvpn: TUN/TAP device tun0 opened
2021/09/24 16:39:49 INFO openvpn: /sbin/ip link set dev tun0 up mtu 1500
2021/09/24 16:39:49 INFO openvpn: /sbin/ip link set dev tun0 up
2021/09/24 16:39:49 INFO openvpn: /sbin/ip addr add dev tun0 10.8.8.14/24
2021/09/24 16:39:49 INFO openvpn: Initialization Sequence Completed
2021/09/24 16:39:49 INFO dns over tls: downloading DNS over TLS cryptographic files
2021/09/24 16:39:50 INFO healthcheck: healthy!
2021/09/24 16:39:53 INFO dns over tls: downloading hostnames and IP block lists
2021/09/24 16:40:11 INFO dns over tls: init module 0: validator
2021/09/24 16:40:11 INFO dns over tls: init module 1: iterator
2021/09/24 16:40:11 INFO dns over tls: start of service (unbound 1.13.2).
2021/09/24 16:40:12 INFO dns over tls: generate keytag query _ta-4a5c-4f66. NULL IN
2021/09/24 16:40:13 INFO dns over tls: generate keytag query _ta-4a5c-4f66. NULL IN
2021/09/24 16:40:16 INFO dns over tls: ready
2021/09/24 16:40:18 INFO vpn: You are running on the bleeding edge of latest!
2021/09/24 16:40:19 INFO ip getter: Public IP address is 213.232.87.176 (Netherlands, North Holland, Amsterdam)

docker-compose.yml:

  gluetun:
    image: qmcgaw/gluetun
    container_name: gluetun
    restart: unless-stopped
    cap_add:
      - NET_ADMIN
    ports:
      - 4533:4533 #navidrome
    environment:
      - OPENVPN_USER=REDACTED
      - OPENVPN_PASSWORD=REDACTED
      - VPNSP=nordvpn
      - VPN_TYPE=openvpn
      - REGION=REDACTED
      - TZ=REDACTED
      - FIREWALL_OUTBOUND_SUBNETS=192.168.0.0/24

# navidrome (can be literally anything else)
  navidrome:
    image: deluan/navidrome:develop
    container_name: navidrome
    restart: unless-stopped
    environment:
      - PGID=1000
      - PUID=1000
    volumes:
      - dockervolume:/music:ro
    network_mode: service:gluetun
    depends_on:
      - gluetun

Nonetheless I'd like to thank you for creating gluetun. I'd be more than happy to help you fix this issue if this is a gluetun bug. Hopefully it's a misconfiguration in my side.

@qdm12
Copy link
Owner

qdm12 commented Sep 24, 2021

Hey there! Thanks for the detailed issue!

It is a well known Docker problem I need to workaround. Let's keep this opened for now although there is at least one duplicate issue about this problem somewhere in the issues.

Note this only happens if gluetun is updated and uses a different image (afaik).

For now, you might want to have all your gluetun and connected containers in a single docker-compose.yml and docker-compose down && docker-compose up -d them (what I do).

I'm developing https://github.com/qdm12/deunhealth and should add a feature tailored for this problem soon (give it 1-5 days), feel free to subscribe to releases on that side repo. That way it would watch your containers and restart your connected containers if gluetun gets updated & restarted.

@rakbladsvalsen
Copy link
Author

Thank you for the answer @qdm12.

It does seem to be indeed a Docker problem just as you said and unfortunately they seem a bit reluctant to discuss possible solutions for the issue, unfortunately. :(

For the time being, there's a temporary ugly, brutal, but 100% working fix. Maybe it would be worth mentioning it in the wiki/docker-compose.yml example? Although there are some gotchas, since it completely replaces the original healthcheck command, and some images don't include either curl or wget. Currently I'm probing example.com every minute on child containers attached to gluetun's network stack and so far so good.

I just subscribed to deunhealth, seems promising and probably even better than things like autoheal due to the network fix thing. I'll make sure to check it out in a week (or earlier, as you deem appropiate) and provide feedback/do some testing.

@qdm12
Copy link
Owner

qdm12 commented Sep 27, 2021

Similar conversation in #504 to be concluded.

@ksbijvank
Copy link

ksbijvank commented Oct 1, 2021

I have the same thing, when i restart Gluetun, it doesn't want to start the containers within the same network_mode. Only difference is that i configured it with: network_mode: 'container:VPN'.

I think when i restart or recreate the Gluetun container it gets a different ID.

What would be the solution to this problem?

@oester
Copy link

oester commented Oct 4, 2021

Stumbled across this issue while researching ways to restart dependent containers once gluetun is recreated with a new image (via Watchtower). https://github.com/qdm12/deunhealth seems like it might work, but I wanted to make sure I understand the use case.

If I have a number of services with:
network_mode: container:gluetun

However, when the gluetun container restarts, the dependent containers don't actually end up gettin marked unhealthy, they just lose connectivity.

I'm wondering if you've updated deunhealth yet to include this function.

@qdm12
Copy link
Owner

qdm12 commented Oct 4, 2021

No sorry, but I'll get to it soon.

Ideally, there is a way to re-attach the disconnected containers to gluetun without restarting them (I guess with Docker's Go API since I doubt the docker cli supports such thing). That would work by marking each connected container with a label to indicate this network re-attachment.

If there isn't, I'll setup something to cascade the restart from gluetun to connected containers, probably using labels to avoid any surprise (mark gluetun as a parent container with a unique id, and mark all connected containers as child containers with that same id).

@rakbladsvalsen
Copy link
Author

For the time being, if anyone wants a dirty, cheap solution, here's my current setup:

  autoheal:
   ... snip ...
  literallyanything:
    image: blahblah
    container_name: blahblah
    network_mode: service:gluetun
    restart: unless-stopped
    healthcheck:
      test: "curl -sf https://example.com  || exit 1"
      interval: 1m
      timeout: 10s
      retries: 1

This will only work with containers where curl is already preinstalled. There are docker images that include wget but not curl, in which case you can replace test command with wget --no-verbose --tries=1 --spider https://example.com/ || exit 1. You can also use qdm12's deunhealth instead of autoheal.

@oester
Copy link

oester commented Nov 10, 2021

Any progress or resolution to this, either in gluetun or deunhealth?

@qdm12
Copy link
Owner

qdm12 commented Nov 10, 2021

I have bits and pieces for it, but I am moving country + visiting family + starting a new job right now, so it might take at least 2 weeks for me to finish it up, sorry about that. But it's at the top of my OSS things-to-do list, so it won't be forgotten 😉

@nfribeiro
Copy link

I'd also like to thank you for creating gluetun and to say this is a very good project.
Any progress on this?

@pau1h
Copy link

pau1h commented Mar 7, 2023

Any update on this by any chance?

@iewl
Copy link

iewl commented Mar 8, 2023 via email

@Manfred73
Copy link

Any news or progress on this issue?

@vdrover
Copy link

vdrover commented May 9, 2023

following

@knorrre
Copy link

knorrre commented May 26, 2023

Since I also have this problem, I would like to report it here and find out if and how it continues. Thank you!

@karserasl
Copy link

Having the same behavior.
When gluetun is recreated, every other container in the same network_mode needs to also restart

@goluftwaffe
Copy link

goluftwaffe commented Apr 8, 2024

This seems to still be a problem with no 100% satisfying solution. I also have the problem that when I restart the server, I have to manually docker compose up the gluetun stack because otherwise the other services never launch with the error cannot join network of a non running container.

@qdm12 is this something you are planning to have a satisfying solution for from your side, or should we be looking for solutions elsewhere. Docker Compose have pretty much said it's not their problem, and directed us towards Moby, but I will admit I don't really have enough of an understanding of all the moving parts here to be able to tell exactly what Moby would actually need to add/fix to make this work.

Let me know if there's something I can do to help or an issue I can upvote, but I don't really want to spend time reading about all the moving parts around docker to understand exactly who needs to do what if I don't have to 😅.

Thanks for gluetun, other than this one inconvenience it has been excellent for the last couple months.

I'm using this guy https://github.com/cascandaliato/docker-restarter and it has been great for me!

@vdrover
Copy link

vdrover commented Apr 8, 2024

The health check workaround works flawlessly for me.

@ioqy
Copy link

ioqy commented Apr 9, 2024

This seems to still be a problem with no 100% satisfying solution. I also have the problem that when I restart the server, I have to manually docker compose up the gluetun stack because otherwise the other services never launch with the error cannot join network of a non running container.

I built myself a systemd service which runs 30 seconds after the docker service has started and starts all containers with the cannot join network of a non running container error message: https://github.com/ioqy/docker-start-failed-gluetun-containers

@Enduriel
Copy link

@ioqy This seems like a fundamentally better approach to me, thanks for the link.

@begunfx
Copy link

begunfx commented May 1, 2024

This can easily be solved using a native healthcheck in docker. There's no need for a third party party application to monitor the health of your containers. For example, if gluetun is restarted, dependent containers will lose network connectivity. To get around this, your healthcheck can periodically monitor externally connectivity, then kill the main pid (1) if no connectivity, thus killing the container, but if you use a restart: always configuration, docker will recreate that container which then reconnects to gluetun. This will happen in an infinite loop until connectivity is re-established to google.com in the below example. Here's a sample docker-compose configuration:

version: "3"

services:
  mycontainer:
    image: namespace/myimage
    container_name: mycontainer
    restart: always
    healthcheck:
      test: "curl -sfI -o /dev/null --connect-timeout 10 --retry 3 --retry-delay 10 --retry-all-errors https://www.google.com/robots.txt || kill 1"
      interval: 1m
      timeout: 1m
    network_mode: "service:gluetun"
    depends_on:
      - gluetun

  gluetun:
    image: qmcgaw/gluetun
    container_name: gluetun
    cap_add:
      - NET_ADMIN
    devices:
      - /dev/net/tun:/dev/net/tun
    ports:
      - 8888:8888/tcp # HTTP proxy
      - 8388:8388/tcp # Shadowsocks
      - 8388:8388/udp # Shadowsocks
      - 8080:8080/tcp # gluetun
    volumes:
      - ${INSTALL_DIRECTORY}/config/gluetun:/config
    environment:
      - VPN_ENDPOINT_IP=${VPN_ENDPOINT_IP}
      - VPN_ENDPOINT_PORT=${VPN_ENDPOINT_PORT}
      - VPN_SERVICE_PROVIDER=${VPN_SERVICE}
      - VPN_TYPE=wireguard
      - WIREGUARD_PUBLIC_KEY=${WIREGUARD_PUBLIC_KEY}
      - WIREGUARD_PRIVATE_KEY=${WIREGUARD_PRIVATE_KEY}
      - WIREGUARD_ADDRESSES=${WIREGUARD_ADDRESSES}
      - DNS_ADDRESS=${DNS_ADDRESS}
      - UPDATER_PERIOD=12h
    restart: always

⚠️ Note: This assumes you have curl available in the running container dependent on gluetun

I tried this but the containers didn't restart. Am I missing something? Do all the different services dependent on gluetun have to be in the same docker compose file for this to work?

I can confirm that "curl" is working in this container

This is an example of one of my containers compose files using this "work around"

version: "2.1"
services:
  flaresolverr:
    # DockerHub mirror flaresolverr/flaresolverr:latest
    image: ghcr.io/flaresolverr/flaresolverr:latest
    container_name: flaresolverr
    environment:
      - LOG_LEVEL=${LOG_LEVEL:-info}
      - LOG_HTML=${LOG_HTML:-false}
      - CAPTCHA_SOLVER=${CAPTCHA_SOLVER:-none}
      - TZ=America/Los Angeles
    network_mode: "container:gluetun"
    healthcheck:
     test: "curl -sfI -o /dev/null --connect-timeout 10 --retry 3 --retry-delay 10 --retry-all-errors https://www.google.com/robots.txt || kill 1"
     interval: 1m
     timeout: 1m
    restart: always

@vdrover
Copy link

vdrover commented May 1, 2024

You're missing some spaces in your healthcheck. Indenting must be exact. I had the same issue until i fixed those indents on the healthcheck.

@begunfx
Copy link

begunfx commented May 1, 2024

Thanks for the help/feedback @vdrover. I ran all my containers through vs code as docker compose files and ran "format document". Hopefully that should fix it. I'll restart gluetun and see if that makes a difference.

@begunfx
Copy link

begunfx commented May 1, 2024

@qdm12. I would recommend adding the results of this thread to the wiki. It seems like a pretty important issue that should be flagged to users when setting up this project.

@ToFu2244
Copy link

ToFu2244 commented May 2, 2024

I do not think it is a good idea to curl a public website every minute or so. This causes unnecessary traffic for you and for the website (even if they have the bandwidth, as Google surely does) - especially considering that there may be multiple containers connected to gluetun, all doing the same checks every minute. I solved this by simply checking the healthcheck address of gluetun itself.

  • For this to work, you will need to set the health check to listen to 0.0.0.0:9999 (Wiki)
  • For the healthcheck command, I used netcat instead of curl: nc -z localhost 9999 || kill 1 (it's a bit faster than curl)

@oester
Copy link

oester commented May 2, 2024

Could you provide an example healthcheck?

@ToFu2244
Copy link

ToFu2244 commented May 2, 2024

To use the example from above:

version: "2.1"
services:
  flaresolverr:
    # DockerHub mirror flaresolverr/flaresolverr:latest
    image: ghcr.io/flaresolverr/flaresolverr:latest
    container_name: flaresolverr
    environment:
      - LOG_LEVEL=${LOG_LEVEL:-info}
      - LOG_HTML=${LOG_HTML:-false}
      - CAPTCHA_SOLVER=${CAPTCHA_SOLVER:-none}
      - TZ=America/Los Angeles
    network_mode: "container:gluetun"
    healthcheck:
     test: "nc -z localhost 9999 || kill 1"
     interval: 1m
     timeout: 1m
    restart: always

This requires netcat (nc) to be present in the container. Since flaresolverr is connected with container:gluetun, the target server can be specified as localhost, and port 9999 is the one you have to configure with the environment variable of the gluetun container itself (HEALTH_SERVER_ADDRESS="0.0.0.0:9999"). But the healthcheck can be copied "as is" without modification to any container you connect to gluetun.

It can be tested from the Docker host by running the following command inside the container you have connected to gluetun:
sudo docker exec mycontainer nc -z -v localhost 9999

@cybermcm
Copy link

cybermcm commented May 2, 2024

test: "nc -z localhost 9999 || kill 1"

@Babadabupi: excellent idea, thank you

@begunfx
Copy link

begunfx commented May 2, 2024

@Babadabupi or someone else here. Can you please provide some guidance on how to add linux commands/tools so they are accessible via commands in Docker? curl works, but netcat isn't found and I can't seem to see how to get it accessible. I'm running Docker (and Portainer) on a Synology server running the latest DSM (7.2 I think). Thanks!

@cybermcm
Copy link

cybermcm commented May 2, 2024

@begunfx: If you can't use the command inside the container, then nc isn't installed inside. You can add it (e.g. debian based container, if you have another Linux image then change bash and apt if necessary):
docker exec [containtername] bash
apt-get update & apt-get install netcat
just remeber if you update the container then the installed netcat will be gone. So you can write a bash script to to this or you are building your own container with netcat installed by default

@begunfx
Copy link

begunfx commented May 2, 2024

Thanks for the response, insight and suggestions @cybermcm. I need to have netcat available to all my containers that connect to gluetun permanently. So what would you recommend for this? Is there a way to have a docker compose file execute a bash script? I found this link that talks about it a bit, but I'm not too clear if this is the best way to go: https://stackoverflow.com/questions/57840820/run-a-shell-script-from-docker-compose-command-inside-the-container

Or is there a way I can just install a container that has netcat in it and have other docker containers use it to run netcat? I am able to install Ubuntu as a docker container, but not clear on how to share resources. I did set the Ubuntu container to use the Gluetun container as its network so all my containers that need access to it are in the same network - from what I understand (but this didn't work).

Update: I did file the following docker compose command that seems to do the trick to execute a shell script:
command: /bin/bash -c "init_project.sh"

I found it at this post:
https://forums.docker.com/t/cant-use-command-in-compose-yaml/127427

Something like that?

@ToFu2244
Copy link

ToFu2244 commented May 3, 2024

For the time being, if anyone wants a dirty, cheap solution, here's my current setup:

  autoheal:
   ... snip ...
  literallyanything:
    image: blahblah
    container_name: blahblah
    network_mode: service:gluetun
    restart: unless-stopped
    healthcheck:
      test: "curl -sf https://example.com  || exit 1"
      interval: 1m
      timeout: 10s
      retries: 1

This will only work with containers where curl is already preinstalled. There are docker images that include wget but not curl, in which case you can replace test command with wget --no-verbose --tries=1 --spider https://example.com/ || exit 1. You can also use qdm12's deunhealth instead of autoheal.

Of course, if curl (or wget) is available, you can still use it to achieve the same end result. For example, with the commands mentioned by @rakbladsvalsen:
curl -fs localhost:9999 || kill 1
wget --no-verbose --tries=1 --spider localhost:9999 || kill 1

I would advise against installing anything inside the container. It is possible, but they are meant to be ephemeral.

@begunfx
Copy link

begunfx commented May 3, 2024

Thanks @Babadabupi. using curl or wget when available makes things a lot easier than trying to add netcat. I added @qdm12 deunhealth container as well. Really appreciate the feedback and suggestions. Thanks to everyone on this thread!

@begunfx
Copy link

begunfx commented May 3, 2024

Okay. So I added the deunhealth container and the healthcheck suggestions here. For some reason if I update the gluetun container the other containers in the gluetun network stop but don't start again. Is it because the gluetun stack is restarting from an update and not an unhealthy state? If so, is there a way to correct this? Or does it make more sense to leave as is if the exit status is normal?

To add a little more to this: After the dependent containers stop, I have to re-run their stacks (2x) to get them to start up again and deploy. I'm assuming it's because something changed with gluetun when it restarted - possibly a network IP address etc.

This is my gluetun container setup:

services:
  gluetun:
    image: qmcgaw/gluetun:latest
    container_name: gluetun
    cap_add:
      - NET_ADMIN
    volumes:
      - /volume1/docker/gluetun:/gluetun
    environment:
      - deunhealth.restart.on.unhealthy=true
      - HEALTH_SERVER_ADDRESS="0.0.0.0:9999"
      - HEALTH_TARGET_ADDRESS="cloudflare.com:443"
      - VPN_SERVICE_PROVIDER=private internet access
      - OPENVPN_USER=${OPENVPN_USER}
      - OPENVPN_PASSWORD=${OPENVPN_PASSWORD}
      - SERVER_REGIONS=US California,Us Las Vegas,Us Seattle,US West,US West Streaming Optimized,US Silicon Valley
    ports:
      - 8191:8191 #flaresolverr

This is one of my dependent containers setup:

version: "2.1"
services:
  flaresolverr:
    # DockerHub mirror flaresolverr/flaresolverr:latest
    image: ghcr.io/flaresolverr/flaresolverr:latest
    container_name: flaresolverr

    environment:
      - LOG_LEVEL=${LOG_LEVEL:-info}
      - LOG_HTML=${LOG_HTML:-false}
      - CAPTCHA_SOLVER=${CAPTCHA_SOLVER:-none}
      - TZ=America/Los Angeles
    network_mode: "container:gluetun"
    healthcheck:
      test: "curl -fs localhost:9999 || kill 1"
      interval: 1m
      timeout: 1m
    restart: always

Just want to make sure I'm not missing something in my current setup. Thanks!

@GordonFreemanK
Copy link

GordonFreemanK commented Jul 31, 2024

Is there a way to work around this that doesn't require restarting containers?
What if I create a network just for gluetun and the containers I want to connect through gluetun, set a fixed IP within the network for at least the gluetun container, then use iptables inside the dependent service to ensure outbound traffic always goes through that fixed IP?

@Xitee1
Copy link

Xitee1 commented Aug 5, 2024

Wouldn't it be possible to reconnect gluetun without restarting the container to fix the problem?

Anyways, I have yet another workaround (because I don't like having custom health checks or having another container that restarts the other ones).
It's based on @ThorpeJosh answer and on this reddit comment. It uses docker events to execute a command that restarts all containers that depend on gluetun whenever the gluetun container status changes to healthy.

Here's the script:

#!/bin/bash

# delay to let everything start in order to prevent restarting everything right after boot (because after boot the container state will also change to "healthy")
sleep 120

# Listen for docker "healthy" for the gluetun container. If state changes to healthy (which means gluetun has a connection again), restart the dependent containers based on the label filter.
docker events --filter 'event=health_status' | while read line; do if [[ ${line} = *"container health_status: healthy"* ]] && [[ ${line} = *"com.docker.compose.service=gluetun"* ]]; then docker restart $(docker ps -q --filter "label=com.docker.compose.depends_on=gluetun:service_started:true"); fi; done

@enchained
Copy link

enchained commented Aug 17, 2024

then kill the main pid (1) if no connectivity, thus killing the container
healthcheck: test: "...something... || kill 1"

What would be the best way to use timeout before kill, in a way docker stop -t does it?

docker stop has a default timeout of 10s, this means that it sends SIGTERM, waits 10s, then sends SIGKILL. My qbittorrent usually takes more than 10s to save state, which results in hours of restoring on the next start. If I stop it with -t 120, it takes about 20-40s to stop (it doesn't wait whole 120 which is convenient).

How would I achieve the same behavior in bash?

UPD: Looks like kill already sends only SIGTERM by default? Does it mean it has an infinite timeout, and restart will happen once it is fully done with exiting?

UPD2: Found this:
HEALTHCHECK --interval=5m --timeout=2m --start-period=45s \ CMD curl -f --retry 6 --max-time 5 --retry-delay 10 --retry-max-time 60 "http://localhost:8080/health" || bash -c 'kill -s 15 -1 && (sleep 10; kill -s 9 -1)'

The important step to understand here is that the retry logic is self-contained in the curl command, the Docker retry here actually is mandatory but useless. Then if the curl HTTP request fails 3 times, then kill is executed. First it sends a SIGTERM to all the processes in the container, to allow them to gracefully stop, then after 10 seconds it sends a SIGKILL to completely kill all the processes in the container. It must be noted that when the PID1 of a container dies, then the container itself dies and the restart policy is invoked.
Gotchas: kill behaves differently in bash than in sh. In bash you can use -1 to signal all the processes with PID greater than 1 to die.

Could be worth adding --retry-connrefused as option to the curl command. Otherwise if the server isn't up for some reason curl will fail on first try

@enchained
Copy link

I have a container monitoring gluetun for ip leaks or dns failures as sometimes it locks up, leaks ip, etc.

@ThorpeJosh Could you please elaborate on this? What do you use for such monitoring?

@enchained
Copy link

enchained commented Aug 17, 2024

Since gluetun is now able to auto-heal most minor issues without fully restarting itself, looks like this issue is still relevant only in those specific cases:

  • A. gluetun container was restarted manually - solved by using all-in-one compose for restarts, or see B for other solutions.
  • B. gluetun encountered major issue and was not able to auto-heal, so it restarted itself (recent port forwarding loop crashed is why I am here) - most of the solutions will help, with the exception of 1, 9 and 10.
  • C. gluetun container was recreated (watchtower) - see solutions: docker-restarter (7) and Notifiarr/dockwatch (11), or follow work in progress on deunhealth (6) and gluetun self-update feature (9).
  • D. host machine was rebooted and there’s cannot join network of a non running container error on start - see solutions: autoheal (5) and especially ioqy (9). Other solutions that could help in theory: depends_on (1), bash script on cron (2), unhealthy_container_restarter (4), deanhealth (6), docker-restarter (7), Notifiarr/dockwatch (11). Also, anyone knows, does using compose help this, or services still start independently on reboot?

I'm collecting all the info I could find in one place. Please feel free to correct it, cause I'm not very experienced in this. In the future I hope gluetun wiki could include some of this information.

Basic info

Healthchecks:

Many solutions require a healthcheck. Either set on all child containers, or on a single container with a main purpose to oversee the network condition. The latter is especially useful if you have a distroless container that can't run a healthcheck, or if you have a lot of containers under gluetun. Solutions that do not require a healthcheck: bash script on cron (2), or alternative solution based on 4.

Commonly a curl of some website is performed for a healthcheck. But, as mentioned here, this causes unnecessary traffic for you and for the website, since it fetches every minute from every container. Checking localhost:9999 (gluetun health server) instead might be more effective in our case.

Depending on container image, some commands may not work. Collected everything from the thread:

  • nc -z localhost 9999 (nc is a bit faster than curl)
  • curl -sfI -o /dev/null --connect-timeout 10 --retry 3 --retry-delay 10 --retry-connrefused --retry-all-errors <URL> (retries avoid minor network issues)
  • wget --no-verbose --tries=1 --spider <URL>
  • maybe you can also try ping?

Exiting containers with grace

Some containers, like qbittorrent and databases, might be sensitive to default docker stop/restart/compose down/etc. Docker commands have default timeout of 10 seconds, so if that's not enough for you, use -t N argument to specify a grace period, i.e. max seconds it will wait. I use -t 120 with qbittorrent, but it does not introduce a noticeable delay since it usually quits much faster. It can be applied to most docker command related to stopping containers.

Some ways use kill 1 to shut down the container. With the kill, SIGTERM is sent by default, so this might not be an issue, but I did non test it. Here I mentioned a way to somewhat reproduce docker stop signals while using kill:
kill -s 15 -1 && (sleep 10; kill -s 9 -1)

First it sends a SIGTERM to all the processes in the container, to allow them to gracefully stop, then after 10 seconds it sends a SIGKILL to completely kill all the processes in the container.

Notifications

As a bonus - I found dolce - it can notify you about container events to email, discord, telegram, slack, mattermost and apprise. Useful to track if the fix is working for you as intended.
One of the solutions - willfarrell/autoheal - also has notifications set up via webhook (to use with Discord etc.)

Solutions list, from basic to advanced:

Transparent in-house solutions:

1. Natively

  • (pending) --exit-on-unhealthy was planned, but something did not work out. But maybe someday.

  • (needs research) There's also existing support for restart on depends_on. It allows to declare dependent service which, when restarted, need to also trigger configured service to be restarted. Typically applies to containers sharing namespaces. But not sure how useful it might be, cause (from here):

If I understand well this feature, Docker won't restart the container if gluetun is unhealthy, but only if gluetun is restarted by a compose operation:

restart (https://docs.docker.com/compose/compose-file/05-services/#long-syntax-1): When set to true Compose restarts this service after it updates the dependency service. This applies to an explicit restart controlled by a Compose operation, and excludes automated restart by the container runtime after the container dies. Introduced in Docker Compose version 2.17.0.


2. Xitee1's workaround

It avoids having custom health checks or having another container that restarts the other ones.
It uses docker events to execute a command that restarts all containers that depend on gluetun whenever the gluetun container status changes to healthy.
It is a bash script that needs something to run it on host, cronjob for example. Looks like it needs you to set this on child services:

depends_on:
  gluetun:
    service_started: true

or you can use any other docker compose related labels from docker inspect <container-name> for your script filter.

The script:

#!/bin/bash

# delay to let everything start in order to prevent restarting everything right after boot (because after boot the container state will also change to "healthy")
sleep 120

# Listen for docker "healthy" for the gluetun container. If state changes to healthy (which means gluetun has a connection again), restart the dependent containers based on the label filter.
docker events --filter 'event=health_status' | while read line; do if [[ ${line} = *"container health_status: healthy"* ]] && [[ ${line} = *"com.docker.compose.service=gluetun"* ]]; then docker restart $(docker ps -q --filter "label=com.docker.compose.depends_on=gluetun:service_started:true"); fi; done

3. Healthcheck || kill internally + restart always

Follow up your healthcheck with || kill 1, like suggested here and here. Then add restart: always policy to the same container. Then it will restart every time its healthcheck fails.

This one also doesn't have extra dependencies or access to the docker socket. But I think in some cases restarting via docker from the outside might be more clean if container require something specific. You can also see modified kill command under the Basic info - Exiting containers with grace section above.


The ones using some external docker image and docker socket:

4. Basic custom container to oversee others (from stackoverflow)

It has access to docker socket, checks list of all containers for unhealthy every 60 seconds, then runs docker restart. You can apply a stop timeout to it with -t if you need to. And customize to do whatever you like.

unhealthy_container_restarter:
  image: docker:cli
  network_mode: none
  cap_drop:
    - ALL
  volumes: [ "/var/run/docker.sock:/var/run/docker.sock" ]
  command: [ "/bin/sh", "-c", "while true; do sleep 60; docker ps -q -f health=unhealthy | xargs --no-run-if-empty docker restart; done" ]
  restart: unless-stopped

In theory, if you have a lot of containers, you can add this one to gluetun network and check connection to it (localhost:9999) instead of doing any healthchecks, then restart everything on the same network except gluetun.


5. willfarrell/autoheal

This container is keeping tabs on health states of all or only labeled containers. It uses docker socket, checks at intervals, and has customizable timings (including the stop timeout I mentioned before). It also has a webhook (useful for Discord notifications). It might have issues with auto-updating containers (watchtower), so better update and restart manually or restart all within the same compose at once.


6. qdm12/deunhealth

For those wondering about the differences with willfarrell/autoheal, it's listed here. In short, it's safer cause there's no OS (based on scratch) and no network. It streams events so there is no check period, it automagically detects unhealthy containers at the same time as the Docker daemon does. It also needs a label to be added to a child container to work.

  • Faster issue detection via docker events does make it more sensitive than willfarrell/autoheal, which can lead to issues if you apply it to gluetun itself (or some other container that has minor short unhealthy events), which having periodical checks instead might benefit for some. But I wouldn't try applying deunhealth to gluetun anyway since it already heals and restarts fine on its own.
  • It also has issues with watchtower, there's a solution work-in-progress.
  • It also does not have a way to set a stop timeout, so I made a request.

Its roadmap also has nice features like "Trigger mechanism such that a container restart triggers other restarts" and "Inject pre-build binary doing a DNS lookup to containers labeled for it and that do not have a healthcheck built in (useful for scratch based images without healthcheck especially)" so you might want to follow its releases in case qdm12 will someday resume working on it.


7. cascandaliato/docker-restarter

Container with access to docker socket, restart containers based on events:

  • vpn container crashes or restarts on its own
  • vpn container gets replaced due to watchtower update (issue mentioned above)
  • torrent container becomes unhealthy

It has customizable timings, but I'm not sure whether it just checks periodically or also listens to docker events, since there's two different restart scenarios (dependency and unhealthy).
Currently there is no way to set a stop timeout, so I made a request.


9. Self update to avoid Docker restarts

(WIP) It was a planned gluetun feature to solve the watchtower issue.


10. ioqy/docker-start-failed-gluetun-containers

A systemd service (installs on host OS) which runs 30 seconds after the docker service has started and starts all containers with the cannot join network of a non running container error message. Mentioned here. Resolves the issue that happens when you restart the server and have to manually docker compose up the gluetun stack because otherwise the other services never launch with the error cannot join network of a non running container.


11. Notifiarr/dockwatch

Dockwatch is a container with docker socket access. It has a nice web ui to manage container updates and container-related notifications (via Notifiarr). It can auto-update or just check and notify. It can restart unhealthy containers, and automatically recognize if containers depend on specific network containers, for example Gluetun:

  • Restart Gluetun -> restart dependencies
  • Stop Gluetun -> stop dependencies
  • Update Gluetun -> re-create dependencies with updated network mode attached

Mentioned here.

@nathang21
Copy link

nathang21 commented Aug 17, 2024

That is a really nice summary thanks for organizing!

I just wanted to add another option that's been working for me:
#11 dockwatch: https://github.com/Notifiarr/dockwatch

It's a helpful tool similar to Watchtower, but it has native logic for VPN updates/restarts and specifically mentions gluetun. Somehow it knows when it updates, which other containers also need to be updated. In my case I have 3, and it appears to work correctly.

I use watchtower for all my auto updates except gluetun, and use dockwatch for gluetun updates. You can test it manually first with the UI before turning on auto updates.

@gmillerd
Copy link

How do people expect to restart the container ... which they have their container's stack networking routed through ... and still maintain connectivity? How were they doing this on bare metal? Then suggesting patterns that upgrade and auto-restart containers, how is that a fix?

If you want docker to support some sort of load balanced / ha network_mode: service ... that's your RFE ... for docker.

@aaomidi
Copy link

aaomidi commented Aug 18, 2024

The issue is that after the container is restarted, the child containers do not regain connectivity.

No one here is talking about a HA network model.

@ThorpeJosh
Copy link

I have a container monitoring gluetun for ip leaks or dns failures as sometimes it locks up, leaks ip, etc.

@ThorpeJosh Could you please elaborate on this? What do you use for such monitoring?

I have a container attached to gluetun that fetches gluetuns public ip constantly. If ip is leaked, or internet is unreachable then it restarts gluetun.

Its just a bash script that has access to Docker daemon (via proxy). It also sends notifications/alerts via apprise and for internet connection issues it has time outs and retry-backoff strategies too.

I've now moved to using opnsense firewall to block and log when gluetun leaks traffic outside of its wireguard tunnel.

@ioqy
Copy link

ioqy commented Aug 28, 2024

@ThorpeJosh did you ever encounter any leaks with gluetun?

Unless the firewall rules that glutun is setting are wrong, a leak should not be possible and your checks are unnecessary because glutun already goes unhealthy when it looses internet connection.

@ThorpeJosh
Copy link

@ThorpeJosh did you ever encounter any leaks with gluetun?

Unless the firewall rules that glutun is setting are wrong, a leak should not be possible and your checks are unnecessary because glutun already goes unhealthy when it looses internet connection.

Yes, but nothing in the last 9-12 months. There were a couple of instances in the past shortly after gluetun start up where attached services could get internet access outside of the tunnel. I remember reading similar gh issues from others at the time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests