skypilot: Issues with AWS inferentia: Disk is not mounted, reusage of existing venv, job keeps queueing.
Here is my sky setup, using 0.4.1 and awscliv2 to submit. CC @mmcclean-aws & Team at Annapurna
...
resources:
cloud: aws
# AWS inferentia, including neuronx
# https://github.com/skypilot-org/skypilot/issues/2686#issuecomment-1754067953
instance_type: inf2.8xlarge
# get pytorch image
# aws ec2 describe-images --region us-west-2 --owners amazon --filters 'Name=name,Values=Deep Learning AMI Neuron PyTorch *' 'Name=state,Values=available' --query 'reverse(sort_by(Images, &CreationDate))'
image_id: ami-0a1063844e84bee6a
# region: us-east-1 # ami-0c43538b49cfc5642 is the image for east-1
region: us-west-2 # ami-0a1063844e84bee6a is the image for west-2
disk_size: 1024
Issues:
Background: I am trying to get this example running. I uesed the plain neuron 2.15.9 + torch ami image.
https://github.com/aws-neuron/aws-neuron-samples/blob/4777b96bc32639242009bd5ecdea0a718a272348/torch-neuronx/transformers-neuronx/inference/meta-llama-2-13b-sampling.ipynb
I got the model TinyLLama-1.1B running with the example. 💯 For more productive usage, we encountered some issues. Especially, when loading models > 10GB:
Issue 1: Disk is not mounted. (Solved)
The root disk of the default requested instance is 35gb. If I request disk_size: 1024, this disk is not mounted, aka. df -h just shows 35gb, the other 1024 are missing. This resulted in issues with any llama-7B+ sizes.
Solution:
hf-images did a better job.
image_id: ami-04dd1be93bedbc674 # us-west-2
region: us-west-2
Issue 2: Conflicting environments when installing into aws_neuron_venv_pytorch (solved with hf-image)
The dependencies on AWS inferentia are somehow not straigtforward, hence the ami- image. I noticed that there is issues with PermissionError when installing in the existing env. Beyond, there are issue with parallen jobs that are queued, going through the setup: | stage.
Also there are double activated envs, (base) (aws_neuron_venv_pytorch) michael@machine, quite confusing.
# part of setup: |
conda deactivate
# free permission issues when
sudo chmod -R 777 /opt/aws_neuron_venv_pytorch/*
source /opt/aws_neuron_venv_pytorch/bin/activate
pip install transformers-neuronx sentencepiece
Any suggestions would be welcome.
About this issue
- Original URL
- State: closed
- Created 6 months ago
- Comments: 15
After running this from a separate terminal, skypilot in the other terminal connected instantly. Nice. The issue is very transient, I tried to reproduce this and I really was only able once to start up the instance.
@concretevitamin Thanks for the support until here, I’ll downgrade to
0.4.1, and live with the non-starting job. Might open a new issue another time.