llm-foundry: Loss won't converge using the LION optimizers
Environment
Collecting system information...
---------------------------------
System Environment Report
Created: 2023-06-12 23:25:33 CST
---------------------------------
PyTorch information
-------------------
PyTorch version: 1.13.1
Is debug build: False
CUDA used to build PyTorch: 11.7
ROCM used to build PyTorch: N/A
OS: CentOS Linux 7 (Core) (x86_64)
GCC version: (GCC) 4.8.5 20150623 (Red Hat 4.8.5-44)
Clang version: Could not collect
CMake version: version 3.26.3
Libc version: glibc-2.17
Python version: 3.10.11 (main, Apr 20 2023, 19:02:41) [GCC 11.2.0] (64-bit runtime)
Python platform: Linux-3.10.0-1160.83.1.el7.x86_64-x86_64-with-glibc2.17
Is CUDA available: True
CUDA runtime version: Could not collect
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration:
GPU 0: NVIDIA A100-PCIE-40GB
GPU 1: NVIDIA A100-PCIE-40GB
GPU 2: NVIDIA A100-PCIE-40GB
GPU 3: NVIDIA A100-PCIE-40GB
GPU 4: NVIDIA A100-PCIE-40GB
GPU 5: NVIDIA A100-PCIE-40GB
GPU 6: NVIDIA A100-PCIE-40GB
GPU 7: NVIDIA A100-PCIE-40GB
Nvidia driver version: 515.65.01
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True
Versions of relevant libraries:
[pip3] numpy==1.24.3
[pip3] pytorch-ranger==0.1.1
[pip3] torch==1.13.1
[pip3] torch-optimizer==0.3.0
[pip3] torchaudio==0.13.1
[pip3] torchmetrics==0.11.3
[pip3] torchtext==0.14.1
[pip3] torchvision==0.14.1
[conda] blas 1.0 mkl defaults
[conda] ffmpeg 4.3 hf484d3e_0 pytorch
[conda] mkl 2023.1.0 h6d00ec8_46342 defaults
[conda] mkl-service 2.4.0 py310h5eee18b_1 defaults
[conda] mkl_fft 1.3.6 py310h1128e8f_1 defaults
[conda] mkl_random 1.2.2 py310h1128e8f_1 defaults
[conda] numpy 1.24.3 py310h5f9d8c6_1 defaults
[conda] numpy-base 1.24.3 py310hb5e798b_1 defaults
[conda] pytorch 1.13.1 py3.10_cuda11.7_cudnn8.5.0_0 pytorch
[conda] pytorch-cuda 11.7 h778d358_5 pytorch
[conda] pytorch-mutex 1.0 cuda pytorch
[conda] pytorch-ranger 0.1.1 pypi_0 pypi
[conda] torch-optimizer 0.3.0 pypi_0 pypi
[conda] torchaudio 0.13.1 py310_cu117 pytorch
[conda] torchmetrics 0.11.3 pypi_0 pypi
[conda] torchtext 0.14.1 pypi_0 pypi
[conda] torchvision 0.14.1 py310_cu117 pytorch
Composer information
--------------------
Composer version: 0.14.1
Composer commit hash: None
Host processor model name: AMD EPYC 7352 24-Core Processor
Host processor core count: 48
Number of nodes: 1
Accelerator model name: NVIDIA A100-PCIE-40GB
Accelerators per node: 1
CUDA Device Count: 8
To reproduce
I’m training an 1B MPT model and the Adam optimizer worked quite well but the LIONs didn’t.
LION config:
optimizer:
name: decoupled_lionw
lr: 1e-4
betas:
- 0.9
- 0.99
weight_decay: 0.0
outlier_threshold: 5
which is suggested by the LION paper except for the weight_decay
. The paper was using 1e-2 which was complained by composer to be too large and it didn’t yield better loss in early runs.
On a custom corpus with customized tokenizer, the decoupled_adamw
worked quite well but the LION optimizers failed to converge anywhere close to Adam.
Expected behavior
Coverge to a loss close to Adam.
Additional context
Is there a bug in LION or do you have any suggestions regarding hyper-parameter tuning with LION? Thanks.
About this issue
- Original URL
- State: closed
- Created a year ago
- Comments: 23 (3 by maintainers)
Yes, I guess some numerical errors/losses accumulate in
triton
’s implementation.