alpa: OOM while serving language models
Please describe the bug Deploy Bloom-7b1 with Alpa and KubeRay. Got OOM while Bloom-7b1 is inferencing.
1:actor_name:DeviceMeshGroupManager
22022-10-26 11:54:12 | INFO | stdout | Load model alpa/bloom-7b1 ... (This can take several minutes for very large models)
32022-10-26 11:54:12 | INFO | stdout | - Compile executables for encoder_chunk_sizes=[1, 64].
42022-10-26 11:54:19,113 WARNING worker.py:1805 -- Using blocking ray.get inside async actor. This blocks the event loop. Please use `await` on object ref with asyncio.gather if you want to yield execution to the event loop instead.
52022-10-26 11:54:47 | INFO | stdout | elapsed: 34.99 second.
62022-10-26 11:54:47 | INFO | stdout | - Load parameters.
72022-10-26 11:55:02,213 ERROR worker.py:94 -- Unhandled error (suppress with 'RAY_IGNORE_UNHANDLED_ERRORS=1'): [36mray::MeshHostWorker.load_bloom_params_worker_func()[39m (pid=501, ip=10.1.181.198, repr=<alpa.device_mesh.MeshHostWorker object at 0x7f76f5f29b20>)
8 File "/home/ray/src/llm-serving/examples/llm_serving/model/bloom_model.py", line 821, in load_bloom_params_worker_func
9 load_param(param_prefix + "self_attention.query_key_value.bias",
10 File "/home/ray/src/llm-serving/examples/llm_serving/model/bloom_model.py", line 797, in load_param
11 self.put_buffers(uuid, datas)
12 File "/home/ray/anaconda3/lib/python3.8/site-packages/alpa/device_mesh.py", line 178, in put_buffers
13 arys[batch_id][device_id] = (self.backend.buffer_from_pyval(
14jaxlib.xla_extension.XlaRuntimeError: RESOURCE_EXHAUSTED: Out of memory while trying to allocate 24576 bytes.
152022-10-26 11:55:12,900 ERROR worker.py:94 -- Unhandled error (suppress with 'RAY_IGNORE_UNHANDLED_ERRORS=1'): [36mray::MeshHostWorker.put_buffers()[39m (pid=501, ip=10.1.181.198, repr=<alpa.device_mesh.MeshHostWorker object at 0x7f76f5f29b20>)
16 File "/home/ray/anaconda3/lib/python3.8/site-packages/alpa/device_mesh.py", line 178, in put_buffers
17 arys[batch_id][device_id] = (self.backend.buffer_from_pyval(
18jaxlib.xla_extension.XlaRuntimeError: RESOURCE_EXHAUSTED: Out of memory while trying to allocate 33554432 bytes.
192022-10-26 11:55:22,936 ERROR worker.py:94 -- Unhandled error (suppress with 'RAY_IGNORE_UNHANDLED_ERRORS=1'): [36mray::MeshHostWorker.put_buffers()[39m (pid=501, ip=10.1.181.198, repr=<alpa.device_mesh.MeshHostWorker object at 0x7f76f5f29b20>)
20 File "/home/ray/anaconda3/lib/python3.8/site-packages/alpa/device_mesh.py", line 178, in put_buffers
21 arys[batch_id][device_id] = (self.backend.buffer_from_pyval(
22jaxlib.xla_extension.XlaRuntimeError: RESOURCE_EXHAUSTED: Out of memory while trying to allocate 33554432 bytes.
23
Please describe the expected behavior
System information and environment
- OS Platform and Distribution: Linux Ubuntu 18.04.6 LTS:
- Python version: 3.8.5
- CUDA version: 11.7
- NCCL version: 2.10.3
- cupy version: cupy-cuda111==11.2.0
- GPU model and memory: NVIDIA GeForce RTX 2080 Ti, 11GB
- Alpa version: git+https://github.com/alpa-projects/alpa.git@a38bfde29e2c1ece5faf5bc59cc4189dde852091
- Ray version: 2.0.0
- KubeRay version: v0.3.0
- JAX version: jax==0.3.15, jaxlib==0.3.15+cuda111.cudnn805
- k8s version: 1.22
- 6-node cluster, with 2 GPUs on each node
- Use NFS for PVC
To Reproduce Steps to reproduce the behavior:
- Follow this link to install kuberay operator on k8s cluster
- Build Docker image to capture runtime environment. I put the details of the Dockerfile and build context below.
- Prepare the YAML file for RayJob by substituting the
<My Docker Image>
to the image tag - May also need to set up the
imagePullSecrets
is necessary - Create
RayJob
throughkubectl apply -f <RayJob yaml>
. I put the content of the YAML below. - (Optional) Do port-forward
component-alpa-service-raycluster-xxx-head-svc
port 8265 to monitor the RayJob status through Ray dashboard, wherexxx
is some auto-generated ID. - Do port-forward
component-alpa-service-raycluster-xxx-head-svc
on port 8899 - Send some query to the endpoint by
curl -d '{"prompt":"Hello world, ","max_tokens":"128","temperature":"0.7","top_p":"0.5","model":"default"}' localhost:8899/completions
Screenshots If applicable, add screenshots to help explain your problem.
Code snippet to reproduce the problem
Additional information
- There is the YAML for
RayJob
apiVersion: ray.io/v1alpha1
kind: RayJob
metadata:
annotations:
meta.helm.sh/release-name: component
meta.helm.sh/release-namespace: alpa-opt-service
creationTimestamp: "2022-10-28T15:40:49Z"
generation: 1
labels:
app.kubernetes.io/managed-by: Helm
name: component-alpa-service
namespace: alpa-opt-service
resourceVersion: "42668386"
selfLink: /apis/ray.io/v1alpha1/namespaces/alpa-opt-service/rayjobs/component-alpa-service
uid: fe665887-bbcc-4f57-9e09-3f96050e3688
spec:
entrypoint: python start.py --model $MODEL_NAME --path $MODEL_PATH --tokenizer $TOKENIZER_NAME
--torch-device $TORCH_DEVICE
rayClusterSpec:
headGroupSpec:
rayStartParams:
block: "true"
dashboard-host: 0.0.0.0
node-ip-address: $MY_POD_IP
num-gpus: "1"
object-store-memory: "100000000"
port: "6379"
replicas: 1
serviceType: ClusterIP
template:
metadata:
labels:
app.kubernetes.io/instance: component
app.kubernetes.io/name: kuberay
groupName: headgroup
rayCluster: raycluster-alpa-serving
rayNodeType: head
spec:
containers:
- env:
- name: MY_POD_IP
valueFrom:
fieldRef:
fieldPath: status.podIP
- name: MODEL_NAME
value: alpa/bloom-7b1
- name: MODEL_PATH
value: /models
- name: TOKENIZER_NAME
value: bigscience/bloom-7b1
- name: TORCH_DEVICE
value: cpu
image: <My Docker Image>
imagePullPolicy: IfNotPresent
name: ray-head
ports:
- containerPort: 6379
name: gcs-server
protocol: TCP
- containerPort: 8265
name: dashboard
protocol: TCP
- containerPort: 10001
name: client
protocol: TCP
- containerPort: 8000
name: serve
protocol: TCP
- containerPort: 8899
name: alpa-service
protocol: TCP
resources:
limits:
cpu: "4"
memory: 16Gi
nvidia.com/gpu: "1"
requests:
cpu: "4"
memory: 16Gi
imagePullSecrets:
- name: <My secret>
rayVersion: 2.0.0
workerGroupSpecs:
- groupName: gpu
maxReplicas: 5
minReplicas: 2
rayStartParams:
block: "true"
node-ip-address: $MY_POD_IP
num-gpus: "1"
replicas: 1
template:
spec:
containers:
- env:
- name: RAY_DISABLE_DOCKER_CPU_WARNING
value: "1"
- name: TYPE
value: worker
- name: CPU_REQUEST
valueFrom:
resourceFieldRef:
containerName: machine-learning
resource: requests.cpu
- name: CPU_LIMITS
valueFrom:
resourceFieldRef:
containerName: machine-learning
resource: limits.cpu
- name: MEMORY_LIMITS
valueFrom:
resourceFieldRef:
containerName: machine-learning
resource: limits.memory
- name: MEMORY_REQUESTS
valueFrom:
resourceFieldRef:
containerName: machine-learning
resource: requests.memory
- name: MY_POD_NAME
valueFrom:
fieldRef:
fieldPath: metadata.name
- name: MY_POD_IP
valueFrom:
fieldRef:
fieldPath: status.podIP
image: <My Docker Image>
imagePullPolicy: IfNotPresent
lifecycle:
preStop:
exec:
command:
- /bin/sh
- -c
- ray stop
name: machine-learning
resources:
limits:
cpu: "4"
memory: 16Gi
nvidia.com/gpu: "1"
requests:
cpu: "4"
memory: 16Gi
imagePullSecrets:
- name: <My secret>
initContainers:
- command:
- sh
- -c
- until nslookup $RAY_IP.$(cat /var/run/secrets/kubernetes.io/serviceaccount/namespace).svc.cluster.local;
do echo waiting for myservice; sleep 2; done
image: busybox:1.28
name: init-myservice
status:
dashboardURL: component-alpa-service-raycluster-8d9qc-head-svc.alpa-opt-service.svc.cluster.local:8265
endTime: "1970-01-01T00:00:00Z"
jobDeploymentStatus: Running
jobId: component-alpa-service-dw66f
jobStatus: RUNNING
message: Job is currently running.
rayClusterName: component-alpa-service-raycluster-8d9qc
rayClusterStatus:
availableWorkerReplicas: 2
desiredWorkerReplicas: 1
endpoints:
alpa-service: "8899"
client: "10001"
dashboard: "8265"
gcs-server: "6379"
serve: "8000"
lastUpdateTime: "2022-10-28T15:40:54Z"
maxWorkerReplicas: 5
minWorkerReplicas: 2
state: ready
startTime: "2022-10-28T15:40:58Z"
- Details of Docker image
- The build context contains
- Dockerfile
- requirements.txt
- start.py
- Content of Dockerfile
FROM rayproject/ray:2.0.0-py38-cu111 RUN python -m pip install --no-cache-dir --upgrade pip COPY requirements.txt . RUN python -m pip install --no-cache-dir -r requirements.txt RUN python -m pip install -f https://alpa-projects.github.io/wheels.html jaxlib==0.3.15+cuda111.cudnn805 RUN python -m pip install ray==2.0.0 COPY start.py .
- Content of requriements.txt
--extra-index-url https://download.pytorch.org/whl/cu111 torch cupy-cuda111 git+https://github.com/alpa-projects/alpa.git@a38bfde29e2c1ece5faf5bc59cc4189dde852091 -e git+https://github.com/alpa-projects/alpa.git@a38bfde29e2c1ece5faf5bc59cc4189dde852091#egg=llm_serving&subdirectory=examples transformers==4.23.1 fastapi uvicorn omegaconf jinja2
- Content of
start.py
- The build context contains
import argparse
import ray
from alpa.serve import run_controller, CONTROLLER_NAME
from llm_serving.service.constants import (
NUM_BEAMS, NUM_RETURN_SEQ, USE_RECAPTCHA)
from llm_serving.launch_model_worker import LangaugeModelWorker
if __name__ == "__main__":
parser = argparse.ArgumentParser()
parser.add_argument("--model", type=str, default="alpa/opt-125m")
parser.add_argument("--path", type=str, default="~/opt_weights/")
parser.add_argument("--host", type=str, default="0.0.0.0")
parser.add_argument("--port", type=str, default=8899)
parser.add_argument("--torch-device", type=str, default="cpu")
parser.add_argument("--tokenizer", type=str)
parser.add_argument("--no-recaptcha", action="store_true")
parser.add_argument("--register-name", type=str, default="default")
parser.add_argument("--ssl-keyfile", type=str)
parser.add_argument("--ssl-certfile", type=str)
args = parser.parse_args()
ray.init()
try:
controller = ray.get_actor(CONTROLLER_NAME)
except ValueError:
controller = run_controller(args.host, args.port, "/",
args.ssl_keyfile, args.ssl_certfile)
group_id = 0
controller.launch_mesh_group_manager.remote(group_id)
t = controller.register_model.remote(
args.register_name, LangaugeModelWorker,
(args.model, args.path, args.torch_device, args.tokenizer, NUM_BEAMS, NUM_RETURN_SEQ,
False if args.no_recaptcha else USE_RECAPTCHA),
override=True)
ray.get(t)
t = controller.create_replica.remote(args.register_name, group_id)
ray.get(t)
while True:
pass
About this issue
- Original URL
- State: closed
- Created 2 years ago
- Comments: 27 (23 by maintainers)
Hi @pascalwhoop, no worries!
num_pp_stages
is the number of pipeline stages (pp stands for pipeline).On execution, all hidden layers are divided by stages in the pipeline (like layers 1-n to stage 1, layers n-m to stage 2, etc) and parallel the model (inference/training) using the pipeline. To make the pipeline effectively, we need to divide resources given to Alpa to match each stage in the pipeline. All resources are managed by Ray (basically all the GPUs) and Ray runs on a node manner (Ray consists of worker nodes and head node where each node has certain GPU).
In language model serving (OPT/Bloom), the number of pipeline stages is determined by the code below: https://github.com/alpa-projects/alpa/blob/98df634fdf97c82f016195f74a4d4965420a7d17/examples/llm_serving/model/wrapper.py#L446-L449
I’m not an expert on Ray that I’m not familiar with how Ray computes the value (devices in a mesh, devices per node, etc). For the part in Alpa, please refer to this file in Alpa for the definition of
get_global_cluster()
and how the attributes are calculated (this file is long, butnum_hosts
andnum_devices
are what you might focus on and please notice the assumption made there like all nodes are identical). For the concepts of nodes, worker, please check these using Ray documentation.If you would like to see some examples on how to use Alpa to parallel, please check this section for parallel training in Alpa documentation.
Please feel free to ask me regarding any issue with serving Bloom model on Alpa while for other issue with Alpa, I think Hao (zhisbug), Lianmin (merrymercy), and zhuohan (zhuohan123) might have more insights as creators of the project.
Happy weekend~
Some possible next step I saw that might help in this case is to look into how alpa plans the parallelization in this setup. As mentioned in issue #891 , I feel Inspect the parallelization strategy will be useful to debug (To look into how alpa planned the parallelization and how the assertion failure comes from).