ColossalAI: [BUG]: Memory consumption by fp16 is not normal

๐Ÿ› Describe the bug

When i used pytorch origin amp, the gpu memory is much smaller than colossai, why? the config is

from colossalai.amp import AMP_TYPE
from colossalai.zero.shard_utils import TensorShardStrategy
from colossalai.nn.optimizer import HybridAdam

fp16 = dict(
    mode=AMP_TYPE.TORCH,
)

optimizer = dict(
    type=HybridAdam,
    lr=0.001,
    # weight_decay=1e-2,
)
<html> <body>

model | dataset | machine | batch | gradient accmulate size | ZeRO | speed | GPU memory | OPT | tensor_placement_policy | ย  | ย  โ€“ | โ€“ | โ€“ | โ€“ | โ€“ | โ€“ | โ€“ | โ€“ | โ€“ | โ€“ | โ€“ | โ€“ ir18 | private dataset | 1 | 64 | 1 | no ZeRO | 24%|โ–ˆโ–ˆโ– ย  ย  ย  | 2089/8549 [02:51<08:39, 12.43it/s] | 8703M | HybridAdam | ย  | single machine + Engine | ย  ir18 | private dataset | 1 | 64 | 1 | no ZeRO | 19%|โ–ˆโ–Š ย  ย  ย  ย | 1599/8549 [02:24<10:21, 11.17it/s] | 5769M | HybridAdam | ย  | single machineย  + wo Engineย + pytorch origin fp16 |

</body> </html>

Environment

No response

About this issue

  • Original URL
  • State: closed
  • Created 2 years ago
  • Comments: 26 (11 by maintainers)

Most upvoted comments

the command is

colossalai run --nproc_per_node 1 train_debug.py --config_dir $PATH_TO_YOUR_CONFIG