pot: [BUG] Pot nomad logic leaves dying jails behind [suggested fix]

Describe the bug When running pots under nomad, dying jails are left behind forever.

Example output from my test jailhost:

# jls -d
   JID  IP Address      Hostname                      Path
     1                  testwww1_accb2b46-b3ff-b1b4-2 /opt/pot/jails/testwww1_accb2b46-b3ff-b1b4-2de2-aa78efd8864a/m
     2                  testwww1_81126bd6-eea4-f624-f /opt/pot/jails/testwww1_81126bd6-eea4-f624-fc5b-dc1400624245/m
     3                  testwww1_1d6bc5f5-99bb-aab3-e /opt/pot/jails/testwww1_1d6bc5f5-99bb-aab3-e860-a6ab2a985017/m
     4                  testwww1_1d6bc5f5-99bb-aab3-e /opt/pot/jails/testwww1_1d6bc5f5-99bb-aab3-e860-a6ab2a985017/m
     5                  testwww1_95735a23-a906-b82d-4 /opt/pot/jails/testwww1_95735a23-a906-b82d-4255-4ab3b10858b1/m
     6                  testwww1_b2d5f3ec-e576-15e2-9 /opt/pot/jails/testwww1_b2d5f3ec-e576-15e2-965b-12183644e79a/m
     7                  testwww1_54719fef-9e0f-9c45-4 /opt/pot/jails/testwww1_54719fef-9e0f-9c45-4deb-2ec525ce6bf3/m
     8                  testwww1_4d7344e5-df72-c3d4-8 /opt/pot/jails/testwww1_4d7344e5-df72-c3d4-8988-44092068b6b3/m
     9                  testwww1_1fab71b1-d296-505c-a /opt/pot/jails/testwww1_1fab71b1-d296-505c-a08e-8060da916135/m
    10                  testwww1_102b9ef8-50f9-b896-f /opt/pot/jails/testwww1_102b9ef8-50f9-b896-f89d-fc4715def9e7/m
    11                  testwww1_108bc5ab-0ff4-9bd8-9 /opt/pot/jails/testwww1_108bc5ab-0ff4-9bd8-96b4-0400639c6de0/m
    12                  testwww1_e691ef0c-b490-8792-5 /opt/pot/jails/testwww1_e691ef0c-b490-8792-50d1-cf734f3c15cf/m
    13                  testwww1_d5ebb5cf-3f13-826c-c /opt/pot/jails/testwww1_d5ebb5cf-3f13-826c-c140-2242a03fe4d8/m
    14                  testwww1_754910e9-7b3b-359d-7 /opt/pot/jails/testwww1_754910e9-7b3b-359d-7d1c-e333c00b881a/m
    15                  testwww1_a4a40168-a950-91ba-2 /opt/pot/jails/testwww1_a4a40168-a950-91ba-2f44-cdb4995fc145/m
    16                  testwww1_3af58a84-79ce-fd90-8 /opt/pot/jails/testwww1_3af58a84-79ce-fd90-829b-a8b53b1d09cc/m
    17                  testwww1_3af58a84-79ce-fd90-8 /opt/pot/jails/testwww1_3af58a84-79ce-fd90-829b-a8b53b1d09cc/m
    18                  testwww1_882467c3-80c7-9d75-3 /opt/pot/jails/testwww1_882467c3-80c7-9d75-39c4-644c0f939cfb/m
    19                  testwww1_2ba86885-ce95-9347-3 /opt/pot/jails/testwww1_2ba86885-ce95-9347-35c5-c22c347714e3/m
    20                  testwww1_b6d1b9fe-57dc-3a30-6 /opt/pot/jails/testwww1_b6d1b9fe-57dc-3a30-649d-dfa21879f015/m
    21                  testwww1_8376faa2-6851-07b0-5 /opt/pot/jails/testwww1_8376faa2-6851-07b0-5307-9b32d3a4266a/m

To Reproduce Steps to reproduce the behavior:

Create a pot that runs a couple of services (e.g., postgresql, nginx, and something else)
Start the pot using nomad
Stop the pot using nomad
Repeat the last two steps a couple of times
Perceive dying jails using jls -d. Usually these would be around for a couple of minutes, but they’re staying around forever.

Expected behavior Dying jails disappear after a while.

Additional context I suspect the problem stems from the logic how the nomad-pot-driver and pot interact.

The logic seems to be:

nomad-pot-driver calls pot start
pot start runs jail, which uses exec.start=/tmp/tinirc which runs some program in the foreground (e.g. nginx, or in a more complex setup, simply tail -f /dev/null, as services run in the background)
nomad stop service makes the nomad-pot-driver call pot stop
pot stop removes the jail (that is still in the process of starting(!)), then it sleeps one second, then it removes epair interfaces.
Meanwhile, the still running pot start process is done starting the jail (which was stopped while it was still starting), sleeps for one second, then it runs pot stop again and destroys the epair interface.

My suspicion is that this overlapping of start and stop causes some resource leakage (of some network resource), which causes the jails to stay in “dying” forever.

This is the code in question:

        jail -c -J "/tmp/${_pname}.jail.conf" $_param exec.start="$_cmd"
        sleep 1
        if ! _is_pot_running "$_pname" ; then
                start-cleanup "$_pname" "${_iface}"
                if [ "$_persist" = "NO" ]; then
                        return 0
                else
                        return 1
                fi
        fi

If I change

tinirc to be non-blocking (just starts some daemons) and
the code to start a jail like below:

        jail -c -J "/tmp/${_pname}.jail.conf" $_param exec.start="$_cmd"
        sleep 1
        if ! _is_pot_running "$_pname" ; then
                start-cleanup "$_pname" "${_iface}"
                if [ "$_persist" = "NO" ]; then
                        return 0
                else
                        return 1
                fi
        fi
        jexec "$_pname" tail -f /dev/null

the jails left behind dying will actually disappear after a while, my theory being, that stopping a fully started jail is preventing the resource leak (or something in the code destroying interfaces).

For my images this works just fine (as my /tmp/tinirc file would end in “tail -f /dev/null” anyway and I’m starting services in it, that keep the jail up and running).

In a more generalized setup, where users might just want to start one little thing, a different approach could be used, e.g.:

Run “sleep 10&” in /tmp/tinirc
Run the actual command instead of “tail -f /dev/null” using jexec

Running the actual command could also be done using tinirc, by making it accept a parameter - this would probably also keep it backwards compatible with old tinirc scripts that are blocking (as it would never reach jexec when running those).

Example jail start code (untested):

        jail -c -J "/tmp/${_pname}.jail.conf" $_param exec.start="$_cmd init"
        sleep 1
        if ! _is_pot_running "$_pname" ; then
                start-cleanup "$_pname" "${_iface}"
                if [ "$_persist" = "NO" ]; then
                        return 0
                else
                        return 1
                fi
        fi
        jexec "$_pname" $_cmd

Example tinirc script (untested):

export "PATH=/usr/local/bin:/usr/local/sbin:/usr/bin:/usr/sbin:/sbin:/bin"
export "HOME=/"
export "LANG=C.UTF-8"
export "MM_CHARSET=UTF-8"
export "PWD=/"
export "RC_PID=24"
export "NOMAD_GROUP_NAME=group1"
export "NOMAD_MEMORY_LIMIT=64"
export "NOMAD_CPU_LIMIT=200"
export "BLOCKSIZE=K"
export "NOMAD_TASK_NAME=www1"
export _POT_NAME=testwww1_1191b963-a72a-b6be-a3eb-0a6201b10ac2
export _POT_IP=10.192.0.16

case $1 in
  init)
    ifconfig epair0b inet 10.192.0.16 netmask 255.192.0.0
    route add default 10.192.0.1
    ifconfig lo0 inet 127.0.0.1 alias
    sleep 10&
    exit 0
    ;;
esac

# could also be given an explicit "run" option
exec /usr/local/bin/somecommand

The code isn’t complete this way of course (as it will break “normal” pots), but I hope the concept makes sense.

About this issue

Original URL
State: closed
Created 3 years ago
Comments: 19 (1 by maintainers)

Commits related to this issue

Garbage collect POSIX shared memory on stop Addresses parts of #150 — committed to grembo/pot by grembo 3 years ago
Garbage collect POSIX shared memory on stop Addresses parts of #150. As `posixshmcontrol` doesn't support jails, we simply use the path of the shared memory segment to determine if it should be coll... — committed to grembo/pot by grembo 3 years ago
Simplify pot start procedure This aims to address parts of #150. This changes the jail start procedure from executing the start command directly to starting a background sleep process and then using... — committed to grembo/pot by grembo 3 years ago
Garbage collect POSIX shared memory on stop Addresses parts of #150. As `posixshmcontrol` doesn't support jails, we simply use the path of the shared memory segment to determine if it should be coll... — committed to grembo/pot by grembo 3 years ago
Garbage collect POSIX shared memory on stop (#153) Addresses parts of #150. As `posixshmcontrol` doesn't support jails, we simply use the path of the shared memory segment to determine if it sho... — committed to bsdpot/pot by grembo 3 years ago
Simplify pot start procedure This aims to address parts of #150. This changes the jail start procedure from executing the start command directly to starting a background sleep process and then using... — committed to grembo/pot by grembo 3 years ago
Simplify pot start procedure (#154) * Simplify pot start procedure This aims to address parts of #150. This changes the jail start procedure from executing the start command directly to starti... — committed to bsdpot/pot by grembo 3 years ago

Most upvoted comments

Here a proposal on how we can implement your fix without breaking anything. When I want to have pots treated differently. I use attributes. You can define an attribute and, if true, run the jexec "$_pname" tail -f /dev/null that you need

pizzamig on Jun 16, 2021