alpa: OOM while serving language models

Please describe the bug Deploy Bloom-7b1 with Alpa and KubeRay. Got OOM while Bloom-7b1 is inferencing.

1:actor_name:DeviceMeshGroupManager
22022-10-26 11:54:12 | INFO | stdout | Load model alpa/bloom-7b1 ... (This can take several minutes for very large models)
32022-10-26 11:54:12 | INFO | stdout |  - Compile executables for encoder_chunk_sizes=[1, 64].
42022-10-26 11:54:19,113	WARNING worker.py:1805 -- Using blocking ray.get inside async actor. This blocks the event loop. Please use `await` on object ref with asyncio.gather if you want to yield execution to the event loop instead.
52022-10-26 11:54:47 | INFO | stdout | elapsed: 34.99 second.
62022-10-26 11:54:47 | INFO | stdout |  - Load parameters.
72022-10-26 11:55:02,213	ERROR worker.py:94 -- Unhandled error (suppress with 'RAY_IGNORE_UNHANDLED_ERRORS=1'): [36mray::MeshHostWorker.load_bloom_params_worker_func()[39m (pid=501, ip=10.1.181.198, repr=<alpa.device_mesh.MeshHostWorker object at 0x7f76f5f29b20>)
8  File "/home/ray/src/llm-serving/examples/llm_serving/model/bloom_model.py", line 821, in load_bloom_params_worker_func
9    load_param(param_prefix + "self_attention.query_key_value.bias",
10  File "/home/ray/src/llm-serving/examples/llm_serving/model/bloom_model.py", line 797, in load_param
11    self.put_buffers(uuid, datas)
12  File "/home/ray/anaconda3/lib/python3.8/site-packages/alpa/device_mesh.py", line 178, in put_buffers
13    arys[batch_id][device_id] = (self.backend.buffer_from_pyval(
14jaxlib.xla_extension.XlaRuntimeError: RESOURCE_EXHAUSTED: Out of memory while trying to allocate 24576 bytes.
152022-10-26 11:55:12,900	ERROR worker.py:94 -- Unhandled error (suppress with 'RAY_IGNORE_UNHANDLED_ERRORS=1'): [36mray::MeshHostWorker.put_buffers()[39m (pid=501, ip=10.1.181.198, repr=<alpa.device_mesh.MeshHostWorker object at 0x7f76f5f29b20>)
16  File "/home/ray/anaconda3/lib/python3.8/site-packages/alpa/device_mesh.py", line 178, in put_buffers
17    arys[batch_id][device_id] = (self.backend.buffer_from_pyval(
18jaxlib.xla_extension.XlaRuntimeError: RESOURCE_EXHAUSTED: Out of memory while trying to allocate 33554432 bytes.
192022-10-26 11:55:22,936	ERROR worker.py:94 -- Unhandled error (suppress with 'RAY_IGNORE_UNHANDLED_ERRORS=1'): [36mray::MeshHostWorker.put_buffers()[39m (pid=501, ip=10.1.181.198, repr=<alpa.device_mesh.MeshHostWorker object at 0x7f76f5f29b20>)
20  File "/home/ray/anaconda3/lib/python3.8/site-packages/alpa/device_mesh.py", line 178, in put_buffers
21    arys[batch_id][device_id] = (self.backend.buffer_from_pyval(
22jaxlib.xla_extension.XlaRuntimeError: RESOURCE_EXHAUSTED: Out of memory while trying to allocate 33554432 bytes.
23

Please describe the expected behavior

System information and environment

OS Platform and Distribution: Linux Ubuntu 18.04.6 LTS:
Python version: 3.8.5
CUDA version: 11.7
NCCL version: 2.10.3
cupy version: cupy-cuda111==11.2.0
GPU model and memory: NVIDIA GeForce RTX 2080 Ti, 11GB
Alpa version: git+https://github.com/alpa-projects/alpa.git@a38bfde29e2c1ece5faf5bc59cc4189dde852091
Ray version: 2.0.0
KubeRay version: v0.3.0
JAX version: jax==0.3.15, jaxlib==0.3.15+cuda111.cudnn805
k8s version: 1.22
6-node cluster, with 2 GPUs on each node
Use NFS for PVC

To Reproduce Steps to reproduce the behavior:

Follow this link to install kuberay operator on k8s cluster
Build Docker image to capture runtime environment. I put the details of the Dockerfile and build context below.
Prepare the YAML file for RayJob by substituting the <My Docker Image> to the image tag
May also need to set up the imagePullSecrets is necessary
Create RayJob through kubectl apply -f <RayJob yaml>. I put the content of the YAML below.
(Optional) Do port-forward component-alpa-service-raycluster-xxx-head-svc port 8265 to monitor the RayJob status through Ray dashboard, where xxx is some auto-generated ID.
Do port-forward component-alpa-service-raycluster-xxx-head-svc on port 8899
Send some query to the endpoint by curl -d '{"prompt":"Hello world, ","max_tokens":"128","temperature":"0.7","top_p":"0.5","model":"default"}' localhost:8899/completions

Screenshots If applicable, add screenshots to help explain your problem.

Code snippet to reproduce the problem

Additional information

There is the YAML for RayJob

apiVersion: ray.io/v1alpha1
kind: RayJob
metadata:
  annotations:
    meta.helm.sh/release-name: component
    meta.helm.sh/release-namespace: alpa-opt-service
  creationTimestamp: "2022-10-28T15:40:49Z"
  generation: 1
  labels:
    app.kubernetes.io/managed-by: Helm
  name: component-alpa-service
  namespace: alpa-opt-service
  resourceVersion: "42668386"
  selfLink: /apis/ray.io/v1alpha1/namespaces/alpa-opt-service/rayjobs/component-alpa-service
  uid: fe665887-bbcc-4f57-9e09-3f96050e3688
spec:
  entrypoint: python start.py --model $MODEL_NAME --path $MODEL_PATH --tokenizer $TOKENIZER_NAME
    --torch-device $TORCH_DEVICE
  rayClusterSpec:
    headGroupSpec:
      rayStartParams:
        block: "true"
        dashboard-host: 0.0.0.0
        node-ip-address: $MY_POD_IP
        num-gpus: "1"
        object-store-memory: "100000000"
        port: "6379"
      replicas: 1
      serviceType: ClusterIP
      template:
        metadata:
          labels:
            app.kubernetes.io/instance: component
            app.kubernetes.io/name: kuberay
            groupName: headgroup
            rayCluster: raycluster-alpa-serving
            rayNodeType: head
        spec:
          containers:
          - env:
            - name: MY_POD_IP
              valueFrom:
                fieldRef:
                  fieldPath: status.podIP
            - name: MODEL_NAME
              value: alpa/bloom-7b1
            - name: MODEL_PATH
              value: /models
            - name: TOKENIZER_NAME
              value: bigscience/bloom-7b1
            - name: TORCH_DEVICE
              value: cpu
            image: <My Docker Image>
            imagePullPolicy: IfNotPresent
            name: ray-head
            ports:
            - containerPort: 6379
              name: gcs-server
              protocol: TCP
            - containerPort: 8265
              name: dashboard
              protocol: TCP
            - containerPort: 10001
              name: client
              protocol: TCP
            - containerPort: 8000
              name: serve
              protocol: TCP
            - containerPort: 8899
              name: alpa-service
              protocol: TCP
            resources:
              limits:
                cpu: "4"
                memory: 16Gi
                nvidia.com/gpu: "1"
              requests:
                cpu: "4"
                memory: 16Gi
          imagePullSecrets:
          - name: <My secret>
    rayVersion: 2.0.0
    workerGroupSpecs:
    - groupName: gpu
      maxReplicas: 5
      minReplicas: 2
      rayStartParams:
        block: "true"
        node-ip-address: $MY_POD_IP
        num-gpus: "1"
      replicas: 1
      template:
        spec:
          containers:
          - env:
            - name: RAY_DISABLE_DOCKER_CPU_WARNING
              value: "1"
            - name: TYPE
              value: worker
            - name: CPU_REQUEST
              valueFrom:
                resourceFieldRef:
                  containerName: machine-learning
                  resource: requests.cpu
            - name: CPU_LIMITS
              valueFrom:
                resourceFieldRef:
                  containerName: machine-learning
                  resource: limits.cpu
            - name: MEMORY_LIMITS
              valueFrom:
                resourceFieldRef:
                  containerName: machine-learning
                  resource: limits.memory
            - name: MEMORY_REQUESTS
              valueFrom:
                resourceFieldRef:
                  containerName: machine-learning
                  resource: requests.memory
            - name: MY_POD_NAME
              valueFrom:
                fieldRef:
                  fieldPath: metadata.name
            - name: MY_POD_IP
              valueFrom:
                fieldRef:
                  fieldPath: status.podIP
            image: <My Docker Image>
            imagePullPolicy: IfNotPresent
            lifecycle:
              preStop:
                exec:
                  command:
                  - /bin/sh
                  - -c
                  - ray stop
            name: machine-learning
            resources:
              limits:
                cpu: "4"
                memory: 16Gi
                nvidia.com/gpu: "1"
              requests:
                cpu: "4"
                memory: 16Gi
          imagePullSecrets:
          - name: <My secret>
          initContainers:
          - command:
            - sh
            - -c
            - until nslookup $RAY_IP.$(cat /var/run/secrets/kubernetes.io/serviceaccount/namespace).svc.cluster.local;
              do echo waiting for myservice; sleep 2; done
            image: busybox:1.28
            name: init-myservice
status:
  dashboardURL: component-alpa-service-raycluster-8d9qc-head-svc.alpa-opt-service.svc.cluster.local:8265
  endTime: "1970-01-01T00:00:00Z"
  jobDeploymentStatus: Running
  jobId: component-alpa-service-dw66f
  jobStatus: RUNNING
  message: Job is currently running.
  rayClusterName: component-alpa-service-raycluster-8d9qc
  rayClusterStatus:
    availableWorkerReplicas: 2
    desiredWorkerReplicas: 1
    endpoints:
      alpa-service: "8899"
      client: "10001"
      dashboard: "8265"
      gcs-server: "6379"
      serve: "8000"
    lastUpdateTime: "2022-10-28T15:40:54Z"
    maxWorkerReplicas: 5
    minWorkerReplicas: 2
    state: ready
  startTime: "2022-10-28T15:40:58Z"

Details of Docker image

The build context contains

Dockerfile
requirements.txt
start.py
Content of Dockerfile

FROM rayproject/ray:2.0.0-py38-cu111

RUN python -m pip install --no-cache-dir --upgrade pip
COPY requirements.txt .
RUN python -m pip install --no-cache-dir  -r requirements.txt
RUN python -m pip install -f https://alpa-projects.github.io/wheels.html jaxlib==0.3.15+cuda111.cudnn805
RUN python -m pip install ray==2.0.0

COPY start.py .

Content of requriements.txt

--extra-index-url https://download.pytorch.org/whl/cu111
torch
cupy-cuda111

git+https://github.com/alpa-projects/alpa.git@a38bfde29e2c1ece5faf5bc59cc4189dde852091 
-e git+https://github.com/alpa-projects/alpa.git@a38bfde29e2c1ece5faf5bc59cc4189dde852091#egg=llm_serving&subdirectory=examples
transformers==4.23.1
fastapi
uvicorn
omegaconf
jinja2

Content of start.py

import argparse
import ray
from alpa.serve import run_controller, CONTROLLER_NAME
from llm_serving.service.constants import (
    NUM_BEAMS, NUM_RETURN_SEQ, USE_RECAPTCHA)
from llm_serving.launch_model_worker import LangaugeModelWorker

if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    parser.add_argument("--model", type=str, default="alpa/opt-125m")
    parser.add_argument("--path", type=str, default="~/opt_weights/")
    parser.add_argument("--host", type=str, default="0.0.0.0")
    parser.add_argument("--port", type=str, default=8899)
    parser.add_argument("--torch-device", type=str, default="cpu")
    parser.add_argument("--tokenizer", type=str)
    parser.add_argument("--no-recaptcha", action="store_true")
    parser.add_argument("--register-name", type=str, default="default")
    parser.add_argument("--ssl-keyfile", type=str)
    parser.add_argument("--ssl-certfile", type=str)
    args = parser.parse_args()

    ray.init()

    try:
        controller = ray.get_actor(CONTROLLER_NAME)
    except ValueError:
        controller = run_controller(args.host, args.port, "/",
                                    args.ssl_keyfile, args.ssl_certfile)

    group_id = 0
    controller.launch_mesh_group_manager.remote(group_id)
    t = controller.register_model.remote(
        args.register_name, LangaugeModelWorker,
        (args.model, args.path, args.torch_device, args.tokenizer, NUM_BEAMS, NUM_RETURN_SEQ,
         False if args.no_recaptcha else USE_RECAPTCHA),
        override=True)
    ray.get(t)
    t = controller.create_replica.remote(args.register_name, group_id)
    ray.get(t)

    while True:
        pass

About this issue

Original URL
State: closed
Created 2 years ago
Comments: 27 (23 by maintainers)

Most upvoted comments

Hi @pascalwhoop, no worries!

num_pp_stages is the number of pipeline stages (pp stands for pipeline).

On execution, all hidden layers are divided by stages in the pipeline (like layers 1-n to stage 1, layers n-m to stage 2, etc) and parallel the model (inference/training) using the pipeline. To make the pipeline effectively, we need to divide resources given to Alpa to match each stage in the pipeline. All resources are managed by Ray (basically all the GPUs) and Ray runs on a node manner (Ray consists of worker nodes and head node where each node has certain GPU).

In language model serving (OPT/Bloom), the number of pipeline stages is determined by the code below: https://github.com/alpa-projects/alpa/blob/98df634fdf97c82f016195f74a4d4965420a7d17/examples/llm_serving/model/wrapper.py#L446-L449

I’m not an expert on Ray that I’m not familiar with how Ray computes the value (devices in a mesh, devices per node, etc). For the part in Alpa, please refer to this file in Alpa for the definition of get_global_cluster() and how the attributes are calculated (this file is long, but num_hosts and num_devices are what you might focus on and please notice the assumption made there like all nodes are identical). For the concepts of nodes, worker, please check these using Ray documentation.

If you would like to see some examples on how to use Alpa to parallel, please check this section for parallel training in Alpa documentation.

Please feel free to ask me regarding any issue with serving Bloom model on Alpa while for other issue with Alpa, I think Hao (zhisbug), Lianmin (merrymercy), and zhuohan (zhuohan123) might have more insights as creators of the project.

Happy weekend~

ddxxdd-code on Mar 3, 2023

Some possible next step I saw that might help in this case is to look into how alpa plans the parallelization in this setup. As mentioned in issue #891 , I feel Inspect the parallelization strategy will be useful to debug (To look into how alpa planned the parallelization and how the assertion failure comes from).

ddxxdd-code on Mar 8, 2023