AzureTRE: Nexus container doesn't start causing Linux VM's to fail deployment
Describe the bug
When deploying Nexus, all the steps run as expected, but the Nexus container does not start.
After some debugging, I found the following in /var/log/cloud-config-output.log
file:
// OMITTED MESSAGES RELATED TO PACKAGES INSTALLATION
Processing triggers for mime-support (3.60ubuntu1) ...
Processing triggers for ureadahead (0.100.0-21) ...
Unable to find image 'sonatype/nexus3:latest' locally
docker: Error response from daemon: Get "https://registry-1.docker.io/v2/": EOF.
See 'docker run --help'.
Checking for Nexus admin password file...
ERROR - Timeout while waiting for nexus-data/admin.password to be created
[
{
"environmentName": "AzureCloud",
"id": "88888888-6258-ABCD-XXXX-UUUKKKKKKKKK",
"isDefault": true,
"name": "N/A(tenant level account)",
"state": "Enabled",
"tenantId": "abababab-XXXX-YYYY-4444-1234567890ab",
"user": {
"assignedIdentityInfo": "MSIResource-/subscriptions/11111111-2222-333-4444-5555555555/resourceGroups/rg-miguel07dev/providers/Microsoft.ManagedI
dentity/userAssignedIdentities/id-nexus-miguel07dev",
"name": "userAssignedIdentity",
"type": "servicePrincipal"
}
}
]
Getting cert and cert password from Keyvault...
Checking for nexus-data/keystores directory...
ERROR - Timeout while waiting for Nexus to create nexus-data/keystores
Checking for ./nexus_repos_config directory...
Found config file: /tmp/nexus_repos_config/almalinux_proxy_conf.json. Sending to Nexus...
Response received from Nexus: 000
Response received from Nexus: 000
Response received from Nexus: 000
// THIS MESSAGE REPEATED SEVERAL TIMES
The most important line is:
docker: Error response from daemon: Get "https://registry-1.docker.io/v2/": EOF.
I’ve been trying some solutions and tests found in Internet. The most effective one was to edit the file templates/shared_services/sonatype-nexus-vm/terraform/cloud-config.yaml
so that the runcmd
section looks like this:
01 runcmd:
02 - export DEBIAN_FRONTEND=noninteractive
03 # Give the Nexus process write permissions on the folder mounted as persistent volume
04 - chown -R 200 /etc/nexus-data
05 - systemctl restart docker.service
06 - sleep 60
07 # Run the nexus container with mapped volume for nexus config
08 - docker run -d -p 80:8081 -p 443:8443 -p 8083:8083 -v /etc/nexus-data:/nexus-data
09 --restart always
10 --name nexus
11 --log-driver local
12 sonatype/nexus3
13 # Reset the admin password of Nexus to the one created by TF and stored in KeyVault
14 - bash /tmp/reset_nexus_password.sh "${NEXUS_ADMIN_PASSWORD}"
15 # Invoke Nexus SSL configuration (which will also be ran as CRON daily to renew cert)
16 - bash /etc/cron.daily/configure_nexus_ssl.sh
17 # Configure Nexus repositories
18 - bash /tmp/configure_nexus_repos.sh "${NEXUS_ADMIN_PASSWORD}"
It means that Docker daemon is restarted and an 1 minute delay is added until the Nexus Docker is started (lines 05 and 06).
Steps to reproduce
- Deploy Nexus shared service
- Check if Nexus container is running
About this issue
- Original URL
- State: closed
- Created a year ago
- Comments: 26 (21 by maintainers)
I think if we had a more robust script that was being run by cloud-init that could help resolve some of these issues. Also taking account of maybe moving the files away from the
/tmp
folder.@migldasilva rather than sleeping maybe we have a script that tried to pull, and if fails, restarts, and tries? The sleep fees very much like a hack.
I’d rather know what the underlying issue was, but know that might take time.
The VM is currently running 18.04, maybe a first step would be to try with a newer version of Ubuntu.
Hmmm. Re-running cloud init didn’t entirely fix my broken nexus.
The service came up and the web UI was available, but a number of other things (for example
pip install
on VMs within workspaces) failed due to configuration/setup not having properly run.I’ve tried to redeploy and am hitting various other issues (docker run is fine, but other setup steps are not).
As mentioned by @marrobi, our hope is that this was caused by a transient issue (either related to the update of the nexus container on docker hub, or related to docker’s registry APIs). Local testing suggests that fresh deployments of the nexus server work fine.
Let’s keep an eye on this.
I’ve just confirmed that re-running the cloud-init scripts as described in the docs that @marrobi shared has successfully run on a VM that was previously seeing this issue, and that the nexus server is now working as expected.
@SvenAelterman this could be what you are hitting.