AzureTRE: Nexus container doesn't start causing Linux VM's to fail deployment

Describe the bug

When deploying Nexus, all the steps run as expected, but the Nexus container does not start.

After some debugging, I found the following in /var/log/cloud-config-output.log file:

// OMITTED MESSAGES RELATED TO PACKAGES INSTALLATION 

Processing triggers for mime-support (3.60ubuntu1) ...
Processing triggers for ureadahead (0.100.0-21) ...
Unable to find image 'sonatype/nexus3:latest' locally
docker: Error response from daemon: Get "https://registry-1.docker.io/v2/": EOF.
See 'docker run --help'.
Checking for Nexus admin password file...
ERROR - Timeout while waiting for nexus-data/admin.password to be created
[
  {
    "environmentName": "AzureCloud",
    "id": "88888888-6258-ABCD-XXXX-UUUKKKKKKKKK",
    "isDefault": true,
    "name": "N/A(tenant level account)",
    "state": "Enabled",
    "tenantId": "abababab-XXXX-YYYY-4444-1234567890ab",
    "user": {
      "assignedIdentityInfo": "MSIResource-/subscriptions/11111111-2222-333-4444-5555555555/resourceGroups/rg-miguel07dev/providers/Microsoft.ManagedI
dentity/userAssignedIdentities/id-nexus-miguel07dev",
      "name": "userAssignedIdentity",
      "type": "servicePrincipal"
    }
  }
]
Getting cert and cert password from Keyvault...
Checking for nexus-data/keystores directory...
ERROR - Timeout while waiting for Nexus to create nexus-data/keystores
Checking for ./nexus_repos_config directory...
Found config file: /tmp/nexus_repos_config/almalinux_proxy_conf.json. Sending to Nexus...
Response received from Nexus: 000
Response received from Nexus: 000
Response received from Nexus: 000

// THIS MESSAGE REPEATED SEVERAL TIMES

The most important line is:

docker: Error response from daemon: Get "https://registry-1.docker.io/v2/": EOF.

I’ve been trying some solutions and tests found in Internet. The most effective one was to edit the file templates/shared_services/sonatype-nexus-vm/terraform/cloud-config.yaml so that the runcmd section looks like this:

01 runcmd:
02  - export DEBIAN_FRONTEND=noninteractive
03  # Give the Nexus process write permissions on the folder mounted as persistent volume
04  - chown -R 200 /etc/nexus-data
05  - systemctl restart docker.service
06  - sleep 60
07  # Run the nexus container with mapped volume for nexus config
08  - docker run -d -p 80:8081 -p 443:8443 -p 8083:8083 -v /etc/nexus-data:/nexus-data
09    --restart always
10    --name nexus
11    --log-driver local
12    sonatype/nexus3
13  # Reset the admin password of Nexus to the one created by TF and stored in KeyVault
14  - bash /tmp/reset_nexus_password.sh "${NEXUS_ADMIN_PASSWORD}"
15  # Invoke Nexus SSL configuration (which will also be ran as CRON daily to renew cert)
16  - bash /etc/cron.daily/configure_nexus_ssl.sh
17  # Configure Nexus repositories
18  - bash /tmp/configure_nexus_repos.sh "${NEXUS_ADMIN_PASSWORD}"

It means that Docker daemon is restarted and an 1 minute delay is added until the Nexus Docker is started (lines 05 and 06).

Steps to reproduce

Deploy Nexus shared service
Check if Nexus container is running

About this issue

Original URL
State: closed
Created a year ago
Comments: 26 (21 by maintainers)

Most upvoted comments

@marrobi Thanks for the comments. I do agree with you about avoiding “hacks”. 😄

I could work on such script. On the other hand, Ubuntu 20.04 presents the same problem.

I think if we had a more robust script that was being run by cloud-init that could help resolve some of these issues. Also taking account of maybe moving the files away from the /tmp folder.

marrobi on Aug 4, 2023

@migldasilva rather than sleeping maybe we have a script that tried to pull, and if fails, restarts, and tries? The sleep fees very much like a hack.

I’d rather know what the underlying issue was, but know that might take time.

The VM is currently running 18.04, maybe a first step would be to try with a newer version of Ubuntu.

marrobi on Aug 4, 2023

Hmmm. Re-running cloud init didn’t entirely fix my broken nexus.

The service came up and the web UI was available, but a number of other things (for example pip install on VMs within workspaces) failed due to configuration/setup not having properly run.

I’ve tried to redeploy and am hitting various other issues (docker run is fine, but other setup steps are not).

martinpeck on Aug 3, 2023

As mentioned by @marrobi, our hope is that this was caused by a transient issue (either related to the update of the nexus container on docker hub, or related to docker’s registry APIs). Local testing suggests that fresh deployments of the nexus server work fine.

Let’s keep an eye on this.

I’ve just confirmed that re-running the cloud-init scripts as described in the docs that @marrobi shared has successfully run on a VM that was previously seeing this issue, and that the nexus server is now working as expected.

martinpeck on Aug 3, 2023

@SvenAelterman this could be what you are hitting.

marrobi on Aug 3, 2023