finit: condition not reasserted after crash.

If a service crashes the ready condition does not get set. Causing other daemons that rely on it to be paused.

# Start Service 
<service/running> is set
<service/ready> is set

# Crash the service in the background using kill command or similar
<service/running> is cleared
<service/ready> is cleared 

service_retry() will eventually print "Successfully restarted crashing service"
svc_set_state() clears all the conditions but then only sets "running". 

Which results in:
<service/running> is set
<service/ready> is cleared

This also occurs for systemd ready flags (the use case that I care about):

<service/running> is set
<service/ready> is set

# Service crashes. 
<service/running> is cleared
<service/ready> is cleared 

# Daemon asserts READY=1
<service/ready> gets set

# Service restart timeout triggers 
svc_set_state() clears all the conditions and only reapplies <service/running>

The fix isn’t obvious to me on this one. I feel like the service is going to need to keep track of its ready state somewhere, or perhaps if svc->notify then don’t clear the ready on a svc_set_state(RUNNING).

Edit: Wild thought. You could add an additional SVC_STATE too. `Halted->Waiting->Preparing (only valid for notif services) -> Running"

About this issue

  • Original URL
  • State: closed
  • Created a year ago
  • Comments: 20 (20 by maintainers)

Commits related to this issue

Most upvoted comments

Messed around my MyLinux tonight to hopefully help reproduce this without having to deal with censoring logs. Running master finit.

Minimal code to reproduce:

# cat services.conf 
service notify:systemd /usr/bin/main
service <service/main/ready> /usr/bin/second

# cat main.c
#include <signal.h>
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>

void notify(void) {
    const char *sock = getenv("NOTIFY_SOCKET");
    int do_notify;

    if (sock)
        do_notify = atoi(sock);
    
    if (do_notify)
        write(do_notify, "READY=1\n", 8);
}

void sig_handler(int signum) {
    notify();
}

void main() {
    signal(SIGHUP, sig_handler);
    notify();
    while (1) {
        sleep(10);
    }
}

# cat second.c 
#include <signal.h>
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>

void main() {
    while (1) {
        sleep(10);
    }
}

Some logs that show off the behaviour.

# initctl ; initctl cond
PID  IDENT    STATUS   RUNLEVELS    DESCRIPTION                                                                                         
85   main     running  [--234-----] main
96   second   running  [--234-----] second
PID  IDENT    STATUS  CONDITION (+ ON, ~ FLUX, - OFF)                                                                                   
96   second   on      <+service/main/ready>
# 
# 
# initctl reload
# 
# initctl ; initctl cond
PID  IDENT    STATUS   RUNLEVELS    DESCRIPTION                                                                                         
85   main     running  [--234-----] main
96   second   paused   [--234-----] second
PID  IDENT    STATUS  CONDITION (+ ON, ~ FLUX, - OFF)                                                                                   
96   second   flux    <~service/main/ready>
# 
# kill -HUP 85
# initctl ; initctl cond
PID  IDENT    STATUS   RUNLEVELS    DESCRIPTION                                                                                         
85   main     running  [--234-----] main
96   second   running  [--234-----] second
PID  IDENT    STATUS  CONDITION (+ ON, ~ FLUX, - OFF)                                                                                   
96   second   on      <+service/main/ready>

Now this is a little tricker to show over text. But if you spam initctl cond you can see ready gets set, then clears after the 2 or 5 second timeout.

# killall main
# killall main
# killall main

#### Condition goes down correctly.
# initctl ; initctl cond
PID  IDENT    STATUS   RUNLEVELS    DESCRIPTION                                                                                         
0    main     restart  [--234-----] main
0    second   waiting  [--234-----] second
PID  IDENT    STATUS  CONDITION (+ ON, ~ FLUX, - OFF)                                                                                   
0    second   off     <-service/main/ready>

##### Condition gets set correctly
# initctl ; initctl cond
PID  IDENT    STATUS   RUNLEVELS    DESCRIPTION                                                                                         
132  main     running  [--234-----] main
133  second   running  [--234-----] second
PID  IDENT    STATUS  CONDITION (+ ON, ~ FLUX, - OFF)                                                                                   
133  second   on      <+service/main/ready>

### After the X000ms timeout, the condition gets cleared
# initctl ; initctl cond
PID  IDENT    STATUS   RUNLEVELS    DESCRIPTION                                                                                         
132  main     running  [--234-----] main
0    second   waiting  [--234-----] second
PID  IDENT    STATUS  CONDITION (+ ON, ~ FLUX, - OFF)                                                                                   
0    second   off     <-service/main/ready>

### If the ready notification gets retriggered it continues to work as normal
# kill -HUP 132
# initctl ; initctl cond
PID  IDENT    STATUS   RUNLEVELS    DESCRIPTION                                                                                         
132  main     running  [--234-----] main
138  second   running  [--234-----] second
PID  IDENT    STATUS  CONDITION (+ ON, ~ FLUX, - OFF)                                                                                   
138  second   on      <+service/main/ready>