LLM-Shearing: AssertionError: Currently only supports dynamic loading from each domain for once.

When I use a single node, 8*A100 80G configuration, I find that an error occurs:

LLM-Shearing/llmshearing/datasets/streaming_da │
│ taset.py:46 in generate_work                                                                     │
│                                                                                                  │
│    43 │   │   List[List[int]]: The epoch for each domain of data (num physical nodes,            │
│    44 │   │   ranks per node, workers per rank, batches per worker, batch size).                 │
│    45 │   """                                                                                    │
│ ❱  46 │   assert epoch == 0, "Currently only supports dynamic loading from each domain for onc   │
│    47 │   # Ensure that num_canonical_nodes has been set.                                        │
│    48 │   if dataset.num_canonical_nodes is None:                                                │
│    49 │   │   raise RuntimeError(f'`num_canonical_nodes` can never be None. ' +                  │
╰───────────────────────────────────────────
AssertionError: Currently only supports dynamic loading from each domain for once.

If I delete "assert epoch == 0, “Currently only supports dynamic loading from each domain for once.”, i will cause another error in

# Currently only supports dynamically loading data from each domain for once. 
        # Issues could occur if one domain of data is used up. 
        while True:
            proportion = self.proportion
            stream_id = np.random.choice(range(self.num_streams), 1, p=proportion)[0].item()
            domain_sample_id = sample_ids_per_stream[stream_id]
            domain_sample_id = domain_sample_id[self.used_num_samples_per_stream[stream_id] \
                                % self.samples_per_stream[stream_id]]
            self.used_num_samples_per_stream[stream_id] += 1
            yield self[domain_sample_id]
---
IndexError: index 24 is out of bounds for axis 0 with size 24

If i add “if world.is_local_leader and epoch==0:” SharedMemory in _attach_work will come error:

│   304 │   │   │   # Load the generated epoch shape from shared memory.                           │
│   305 │   │   │   name = _get_path(self._shm_prefix_int, EPOCH_SHAPE + f"_{stream_id}")          │
│   306 │   │   │   size = ndim * np.int64().nbytes                                                │
│ ❱ 307 │   │   │   shape_shm = SharedMemory(name=name, create=False, size=size, auto_cleanup=Fa   │
│   308 │   │   │   shape = tuple(np.ndarray(5, buffer=shape_shm.buf, dtype=np.int64))             │
│   309 │   │   │                                                                                  │
│   310 │   │   │   # Attach to the generated epoch data in shared memory.      
---
FileNotFoundError: [Errno 2] No such file or directory: '/000000_epoch_shape_0'
Exception ignored in atexit callback: <function Engine._close at 0x7fe6dc766a70>

About this issue

  • Original URL
  • State: closed
  • Created 8 months ago
  • Comments: 17 (12 by maintainers)

Most upvoted comments

There are two ways to use a fixed data loading proportion!

The first way:

  • dynamic: false
  • split: wikipedia (make sure that it’s mds files in this directory) This set up allows you to load data from a single data folder with mds files.

The second way:

  • dynamic: true
  • update_type: constant
  • set_names: specify the set names
  • proportion: specify the loading proportion This setup allows you to load data from multiple data folders with mds files and use a constant proportion.

You can refer to the callback function of dynamic loading here: https://github.com/princeton-nlp/LLM-Shearing/blob/main/llmshearing/callbacks/dynamic_loading_callback.py#L32