LLM-Shearing: AssertionError: Currently only supports dynamic loading from each domain for once.
When I use a single node, 8*A100 80G configuration, I find that an error occurs:
LLM-Shearing/llmshearing/datasets/streaming_da │
│ taset.py:46 in generate_work │
│ │
│ 43 │ │ List[List[int]]: The epoch for each domain of data (num physical nodes, │
│ 44 │ │ ranks per node, workers per rank, batches per worker, batch size). │
│ 45 │ """ │
│ ❱ 46 │ assert epoch == 0, "Currently only supports dynamic loading from each domain for onc │
│ 47 │ # Ensure that num_canonical_nodes has been set. │
│ 48 │ if dataset.num_canonical_nodes is None: │
│ 49 │ │ raise RuntimeError(f'`num_canonical_nodes` can never be None. ' + │
╰───────────────────────────────────────────
AssertionError: Currently only supports dynamic loading from each domain for once.
If I delete "assert epoch == 0, “Currently only supports dynamic loading from each domain for once.”, i will cause another error in
# Currently only supports dynamically loading data from each domain for once.
# Issues could occur if one domain of data is used up.
while True:
proportion = self.proportion
stream_id = np.random.choice(range(self.num_streams), 1, p=proportion)[0].item()
domain_sample_id = sample_ids_per_stream[stream_id]
domain_sample_id = domain_sample_id[self.used_num_samples_per_stream[stream_id] \
% self.samples_per_stream[stream_id]]
self.used_num_samples_per_stream[stream_id] += 1
yield self[domain_sample_id]
---
IndexError: index 24 is out of bounds for axis 0 with size 24
If i add “if world.is_local_leader and epoch==0:” SharedMemory in _attach_work will come error:
│ 304 │ │ │ # Load the generated epoch shape from shared memory. │
│ 305 │ │ │ name = _get_path(self._shm_prefix_int, EPOCH_SHAPE + f"_{stream_id}") │
│ 306 │ │ │ size = ndim * np.int64().nbytes │
│ ❱ 307 │ │ │ shape_shm = SharedMemory(name=name, create=False, size=size, auto_cleanup=Fa │
│ 308 │ │ │ shape = tuple(np.ndarray(5, buffer=shape_shm.buf, dtype=np.int64)) │
│ 309 │ │ │ │
│ 310 │ │ │ # Attach to the generated epoch data in shared memory.
---
FileNotFoundError: [Errno 2] No such file or directory: '/000000_epoch_shape_0'
Exception ignored in atexit callback: <function Engine._close at 0x7fe6dc766a70>
About this issue
- Original URL
- State: closed
- Created 8 months ago
- Comments: 17 (12 by maintainers)
There are two ways to use a fixed data loading proportion!
The first way:
The second way:
You can refer to the callback function of dynamic loading here: https://github.com/princeton-nlp/LLM-Shearing/blob/main/llmshearing/callbacks/dynamic_loading_callback.py#L32