Today, lemmy.amxl.com suffered an outage because the rootful Lemmy podman container crashed out, and wouldn’t restart.

Fixing it turned out to be more complicated than I expected, so I’m documenting the steps here in case anyone else has a similar issue with a podman container.

I tried restarting it, but got an unexpected error the internal IP address (which I hand assign to containers) was already in use, despite the fact it wasn’t running.

I create my Lemmy services with podman-compose, so I deleted the Lemmy services with podman-compose down, and then re-created them with podman-compose up - that usually fixes things when they are really broken. But this time, I got a message like:

level=error msg=“"IPAM error: requested ip address 172.19.10.11 is already allocated to container ID 36e1a622f261862d592b7ceb05db776051003a4422d6502ea483f275b5c390f2"”

The only problem is that the referenced container actually didn’t exist at all in the output of podman ps -a - in other words, podman thought the IP address was in use by a container that it didn’t know anything about! The IP address has effectively been ‘leaked’.

After digging into the internals, and a few false starts trying to track down where the leaked info was kept, I found it was kept in a BoltDB file at /run/containers/networks/ipam.db - that’s apparently the ‘IP allocation’ database. Now, the good thing about /run is it is wiped on system restart - although I didn’t really want to restart all my containers just to fix Lemmy.

BoltDB doesn’t come with a lot of tools, but you can install a TUI editor like this: go install github.com/br0xen/boltbrowser@latest.

I made a backup of /run/containers/networks/ipam.db just in case I screwed it up.

Then I ran sudo ~/go/bin/boltbrowser /run/containers/networks/ipam.db to open the DB (this will lock the DB and stop any containers starting or otherwise changing IP statuses until you exit).

I found the networks that were impacted, and expanded the bucket (BoltDB has a hierarchy of buckets, and eventually you get key/value pairs) for those networks, and then for the CIDR ranges the leaked IP was in. In that list, I found a record with a value equal to the container that didn’t actually exist. I used D to tell boltbrowser to delete that key/value pair. I also cleaned up under ids - where this time the key was the container ID that no longer existed - and repeated for both networks my container was in.

I then exited out of boltbrowser with q.

After that, I brought my Lemmy containers back up with podman-compose up -d - and everything then worked cleanly.

    • JASN_DE@lemmy.world
      link
      fedilink
      English
      arrow-up
      3
      ·
      4 hours ago

      Works well for me, although with Docker. The labeling can be a bit non-intuitive at first, but it’s really solid.

      Also, I simply love autodiscovery.

    • Justin@lemmy.jlh.name
      link
      fedilink
      English
      arrow-up
      3
      ·
      4 hours ago

      I’ve inherited it on production systems before, automated service discovery and certificate renewal is definitely what admins should have in 2025. I thought the label/annotation system it used on Docker had some ergonomics/documentation issues, but nothing serious.

      It feels like it’s more meant for Docker/Podman though. On Kubernetes I use cert-manager and Gateway API+Project Contour. It does seem like Traefik has support for Gateway API too, so it’s probably a good choice for Kubernetes too?

        • Justin@lemmy.jlh.name
          link
          fedilink
          English
          arrow-up
          1
          ·
          3 hours ago

          Ah, interesting. What kind of customization are you using CoreDNS for? If you don’t have Ingress/Gateway API for your HTTP traffic, Traefik is likely a good option for adopting it.

          • horse_battery_staple@lemmy.world
            link
            fedilink
            English
            arrow-up
            1
            ·
            edit-2
            2 hours ago

            Coredns and an nginx reverse proxy are handling DNS, failover, and some other redirect. However it’s not ideal as it’s a custom implementation a previous engineer setup.