skypilot: Issues with AWS inferentia: Disk is not mounted, reusage of existing venv, job keeps queueing.

Here is my sky setup, using 0.4.1 and awscliv2 to submit. CC @mmcclean-aws & Team at Annapurna

...

resources:
  cloud: aws
  # AWS inferentia, including neuronx
  # https://github.com/skypilot-org/skypilot/issues/2686#issuecomment-1754067953
  instance_type: inf2.8xlarge
  # get pytorch image
  #      aws ec2 describe-images --region us-west-2 --owners amazon --filters 'Name=name,Values=Deep Learning AMI Neuron PyTorch *' 'Name=state,Values=available' --query 'reverse(sort_by(Images, &CreationDate))'

  image_id: ami-0a1063844e84bee6a
  # region: us-east-1 # ami-0c43538b49cfc5642 is the image for east-1
  region: us-west-2 # ami-0a1063844e84bee6a is the image for west-2

  disk_size: 1024

Issues:

Background: I am trying to get this example running. I uesed the plain neuron 2.15.9 + torch ami image. https://github.com/aws-neuron/aws-neuron-samples/blob/4777b96bc32639242009bd5ecdea0a718a272348/torch-neuronx/transformers-neuronx/inference/meta-llama-2-13b-sampling.ipynb

I got the model TinyLLama-1.1B running with the example. 💯 For more productive usage, we encountered some issues. Especially, when loading models > 10GB:

Issue 1: Disk is not mounted. (Solved)

The root disk of the default requested instance is 35gb. If I request disk_size: 1024, this disk is not mounted, aka. df -h just shows 35gb, the other 1024 are missing. This resulted in issues with any llama-7B+ sizes.

Solution:

https://us-east-1.console.aws.amazon.com/ec2/home?region=us-east-1#Images:visibility=public-images;search=:huggingface-neuron;v=3;$case=tags:false\,client:false;$regex=tags:false\,client:false

hf-images did a better job.

  image_id: ami-04dd1be93bedbc674 # us-west-2
  region: us-west-2

Issue 2: Conflicting environments when installing into aws_neuron_venv_pytorch (solved with hf-image)

The dependencies on AWS inferentia are somehow not straigtforward, hence the ami- image. I noticed that there is issues with PermissionError when installing in the existing env. Beyond, there are issue with parallen jobs that are queued, going through the setup: | stage. Also there are double activated envs, (base) (aws_neuron_venv_pytorch) michael@machine, quite confusing.

# part of setup: |
conda deactivate
# free permission issues when 
sudo chmod -R 777 /opt/aws_neuron_venv_pytorch/*
source /opt/aws_neuron_venv_pytorch/bin/activate
pip install transformers-neuronx sentencepiece

Any suggestions would be welcome.

About this issue

  • Original URL
  • State: closed
  • Created 6 months ago
  • Comments: 15

Most upvoted comments

After running this from a separate terminal, skypilot in the other terminal connected instantly. Nice. The issue is very transient, I tried to reproduce this and I really was only able once to start up the instance.

ssh -T -i '~/.ssh/sky-key' ubuntu@54.187.141.180 -o StrictHostKeyChecking=no -o ConnectTimeout=10s -o UserKnownHostsFile=/dev/null -o IdentitiesOnly=yes -o ExitOnForwardFailure=yes -o ServerAliveInterval=5 -o ServerAliveCountMax=3 uptime

@concretevitamin Thanks for the support until here, I’ll downgrade to 0.4.1, and live with the non-starting job. Might open a new issue another time.