rancher: Improve error logs in health check containers when health check is paused when ipsec is not in "active" state.

Rancher server - Build from v1.6-development

Steps to reproduce the problem: Upgrade ipsec service.

When ipsec service is getting upgraded , health check is paused resulting in following error message:

12/14/2017 10:01:28 AMtime="2017-12-14T18:01:28Z" level=error msg="Failed to report status 25e575c6-3bd8-480a-aa17-c3dc6ff7cf04_e79ef58d-370e-4f1e-9b51-7538c96a391d_1=DOWN 1/2: Bad response from [http://104.198.156.63:8080/v2-beta/serviceevents], statusCode [409]. Status [409 Conflict]. Body: [{\"id\":\"320fc91a-0152-4592-abe8-2c5be68c8220\",\"type\":\"error\",\"links\":{},\"actions\":{},\"status\":409,\"code\":\"Conflict\",\"message\":\"Conflict\",\"detail\":null,\"baseType\":\"error\"}]"
12/14/2017 10:01:28 AMtime="2017-12-14T18:01:28Z" level=error msg="Failed to report status 25e575c6-3bd8-480a-aa17-c3dc6ff7cf04_98dac6e4-12c6-4c41-bbe5-a51f346e7910_1=DOWN: Bad response from [http://104.198.156.63:8080/v2-beta/serviceevents], statusCode [409]. Status [409 Conflict]. Body: [{\"id\":\"d0a91a13-3648-4a53-a622-1b432cd4dcfc\",\"type\":\"error\",\"links\":{},\"actions\":{},\"status\":409,\"code\":\"Conflict\",\"message\":\"Conflict\",\"detail\":null,\"baseType\":\"error\"}]"
12/14/2017 10:01:28 AMtime="2017-12-14T18:01:28Z" level=info msg="25e575c6-3bd8-480a-aa17-c3dc6ff7cf04_5441afe7-4bb8-4820-9814-cda2667cbc15_1=INIT"
12/14/2017 10:01:28 AMtime="2017-12-14T18:01:28Z" level=info msg="25e575c6-3bd8-480a-aa17-c3dc6ff7cf04_0dc1f85a-4fda-4080-b851-6dfc259602ec_1=INIT"
12/14/2017 10:01:30 AMtime="2017-12-14T18:01:30Z" level=error msg="Failed to report status 25e575c6-3bd8-480a-aa17-c3dc6ff7cf04_98dac6e4-12c6-4c41-bbe5-a51f346e7910_1=DOWN 1/2: Bad response from [http://104.198.156.63:8080/v2-beta/serviceevents], statusCode [409]. Status [409 Conflict]. Body: [{\"id\":\"c26d3887-ea3f-408b-8254-68033004cfa0\",\"type\":\"error\",\"links\":{},\"actions\":{},\"status\":409,\"code\":\"Conflict\",\"message\":\"Conflict\",\"detail\":null,\"baseType\":\"error\"}]"

These error messages can be improved to be reflect the state rather than a 409 conflict error

About this issue

Original URL
State: closed
Created 7 years ago
Reactions: 4
Comments: 20

Most upvoted comments

I’ve faced similar issues.

Solution:

Review the address on Rancher UI -> Hosts, check if the ip of the hosts correct or not.
Add Host, filling Step 4 with the docker server’s IP
run the docker agent with Step 5 command.

In my case, our environment had been destroied in some reason, then we are going to recreate all the compoent. I’ve tried to start the agent without specified agent located server’s IP, when the second hosts set up, the issue comes up. Then I check the logs of the agent, found that DETECTED_CATTLE_AGENT_IP is wrong, and the IP of the host on the Rancher UI shows the same one for each hosts, and it might lead to the rancher trying to edit the same record in the DB, that’ll caused the dead lock.

P.S. If you’re facing the same issue, you could specified the IP for fixing in time, then contact to your network administrator to troubleshoot the network settings.

Stark-X on Apr 4, 2018

Not quite sure if another repro matters, but I can easily reproduce the issue using one physical (ubuntu 17.10) machine running v1.6.15 on the host and two docker-machine vm nodes. The following code creates kvm vms, but virtualbox works equally.

docker-machine create \
    --driver kvm \
    --kvm-cpu-count 2 \
    --kvm-memory 1092 \
    --kvm-boot2docker-url https://github.com/boot2docker/boot2docker/releases/download/v17.12.1-ce/boot2docker.iso \
    kvm-1712-1
    
docker-machine create \
    --driver kvm \
    --kvm-cpu-count 2 \
    --kvm-memory 1092 \
    --kvm-boot2docker-url https://github.com/boot2docker/boot2docker/releases/download/v17.12.1-ce/boot2docker.iso \
    kvm-1712-2

Start from an empty environment and create the first custom host, “docker-machine ssh kvm-1712-1” and paste the command line from rancher.

The services go up and everything is green.

Now add the second node.

ipsec/healthcheck go Up/Down/Initializing/Unhealthy

ipsec-router logs show entries such as:

3/22/2018 3:49:51 PMtime="2018-03-22T14:49:51Z" level=info msg="samonitor: expected SA for host: 172.17.42.1, but not found."
3/22/2018 3:49:51 PMtime="2018-03-22T14:49:51Z" level=error msg="samonitor: error initiating missing SA child-172.17.42.1: unsuccessful Initiate: CHILD_SA config 'child-172.17.42.1' not found"

healthcheck logs go:

3/22/2018 3:55:35 PMtime="2018-03-22T14:55:35Z" level=error msg="Failed to report status 8734367f-1c29-4d3b-8214-e939fc06f058_5a30ec2b-3a98-481b-be9e-11138b7fbff0_1=DOWN: Bad response from [http://.../v1/serviceevents], statusCode [409]. Status [409 Conflict]. Body: [{\"id\":\"16d637fd-f0eb-487d-b94a-2c6511597b4b\",\"type\":\"error\",\"links\":{},\"actions\":{},\"status\":409,\"code\":\"Conflict\",\"message\":\"Conflict\",\"detail\":null,\"baseType\":\"error\"}]"

deas on Mar 22, 2018

Updated system packages (apt, since I’m on Ubuntu) and updated to:

Docker 17.12.0-ce
rancher-agent v1.2.8
rancher-server v1.6.14

still have the problem.

I can afford to reboot the machine and restart docker, but unfortunately I can’t afford more than 3 nodes.

thatbudakguy on Mar 1, 2018

I’m using DHCP and ran into this issue as well. (Though my DHCP address has not changed)

Restarting the network stack in ROS resolved the issue I was having with the health check service.

ssmithTaylor on Apr 22, 2018

fwiw I have the same setup as @mjaverto : server is a DNS entry that resolves to private IP; hosts are all private IPs in same subnet. have rebuilt hosts from bare VM several times and re-registered to no avail; server also. >1 host will reproduce the issue, but exactly one host will run fine (doesn’t matter which).

thatbudakguy on Apr 5, 2018