ColossalAI: [BUG]: Memory consumption by fp16 is not normal

🐛 Describe the bug

When i used pytorch origin amp, the gpu memory is much smaller than colossai, why? the config is

from colossalai.amp import AMP_TYPE
from colossalai.zero.shard_utils import TensorShardStrategy
from colossalai.nn.optimizer import HybridAdam

fp16 = dict(
    mode=AMP_TYPE.TORCH,
)

optimizer = dict(
    type=HybridAdam,
    lr=0.001,
    # weight_decay=1e-2,
)

model | dataset | machine | batch | gradient accmulate size | ZeRO | speed | GPU memory | OPT | tensor_placement_policy | | – | – | – | – | – | – | – | – | – | – | – | – ir18 | private dataset | 1 | 64 | 1 | no ZeRO | 24%|██▍ | 2089/8549 [02:51<08:39, 12.43it/s] | 8703M | HybridAdam | | single machine + Engine | ir18 | private dataset | 1 | 64 | 1 | no ZeRO | 19%|█▊ | 1599/8549 [02:24<10:21, 11.17it/s] | 5769M | HybridAdam | | single machine + wo Engine + pytorch origin fp16 |

</body> </html>

Environment

No response

About this issue

Original URL
State: closed
Created 2 years ago
Comments: 26 (11 by maintainers)

Most upvoted comments

the command is

colossalai run --nproc_per_node 1 train_debug.py --config_dir $PATH_TO_YOUR_CONFIG

powermano on Jun 10, 2022