llm-foundry: Loss won't converge using the LION optimizers

Environment

Collecting system information...
---------------------------------
System Environment Report        
Created: 2023-06-12 23:25:33 CST
---------------------------------

PyTorch information
-------------------
PyTorch version: 1.13.1
Is debug build: False
CUDA used to build PyTorch: 11.7
ROCM used to build PyTorch: N/A

OS: CentOS Linux 7 (Core) (x86_64)
GCC version: (GCC) 4.8.5 20150623 (Red Hat 4.8.5-44)
Clang version: Could not collect
CMake version: version 3.26.3
Libc version: glibc-2.17

Python version: 3.10.11 (main, Apr 20 2023, 19:02:41) [GCC 11.2.0] (64-bit runtime)
Python platform: Linux-3.10.0-1160.83.1.el7.x86_64-x86_64-with-glibc2.17
Is CUDA available: True
CUDA runtime version: Could not collect
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration: 
GPU 0: NVIDIA A100-PCIE-40GB
GPU 1: NVIDIA A100-PCIE-40GB
GPU 2: NVIDIA A100-PCIE-40GB
GPU 3: NVIDIA A100-PCIE-40GB
GPU 4: NVIDIA A100-PCIE-40GB
GPU 5: NVIDIA A100-PCIE-40GB
GPU 6: NVIDIA A100-PCIE-40GB
GPU 7: NVIDIA A100-PCIE-40GB

Nvidia driver version: 515.65.01
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

Versions of relevant libraries:
[pip3] numpy==1.24.3
[pip3] pytorch-ranger==0.1.1
[pip3] torch==1.13.1
[pip3] torch-optimizer==0.3.0
[pip3] torchaudio==0.13.1
[pip3] torchmetrics==0.11.3
[pip3] torchtext==0.14.1
[pip3] torchvision==0.14.1
[conda] blas                      1.0                         mkl    defaults
[conda] ffmpeg                    4.3                  hf484d3e_0    pytorch
[conda] mkl                       2023.1.0         h6d00ec8_46342    defaults
[conda] mkl-service               2.4.0           py310h5eee18b_1    defaults
[conda] mkl_fft                   1.3.6           py310h1128e8f_1    defaults
[conda] mkl_random                1.2.2           py310h1128e8f_1    defaults
[conda] numpy                     1.24.3          py310h5f9d8c6_1    defaults
[conda] numpy-base                1.24.3          py310hb5e798b_1    defaults
[conda] pytorch                   1.13.1          py3.10_cuda11.7_cudnn8.5.0_0    pytorch
[conda] pytorch-cuda              11.7                 h778d358_5    pytorch
[conda] pytorch-mutex             1.0                        cuda    pytorch
[conda] pytorch-ranger            0.1.1                    pypi_0    pypi
[conda] torch-optimizer           0.3.0                    pypi_0    pypi
[conda] torchaudio                0.13.1              py310_cu117    pytorch
[conda] torchmetrics              0.11.3                   pypi_0    pypi
[conda] torchtext                 0.14.1                   pypi_0    pypi
[conda] torchvision               0.14.1              py310_cu117    pytorch


Composer information
--------------------
Composer version: 0.14.1
Composer commit hash: None
Host processor model name: AMD EPYC 7352 24-Core Processor
Host processor core count: 48
Number of nodes: 1
Accelerator model name: NVIDIA A100-PCIE-40GB
Accelerators per node: 1
CUDA Device Count: 8

To reproduce

I’m training an 1B MPT model and the Adam optimizer worked quite well but the LIONs didn’t.

LION config:

optimizer:
  name: decoupled_lionw
  lr: 1e-4
  betas:
    - 0.9
    - 0.99
  weight_decay: 0.0
  outlier_threshold: 5

which is suggested by the LION paper except for the weight_decay. The paper was using 1e-2 which was complained by composer to be too large and it didn’t yield better loss in early runs.

On a custom corpus with customized tokenizer, the decoupled_adamw worked quite well but the LION optimizers failed to converge anywhere close to Adam.

hankcs com 2023-06-12 at 10 33 10 AM

Expected behavior

Coverge to a loss close to Adam.

Additional context

Is there a bug in LION or do you have any suggestions regarding hyper-parameter tuning with LION? Thanks.

About this issue

  • Original URL
  • State: closed
  • Created a year ago
  • Comments: 23 (3 by maintainers)

Most upvoted comments

Yes, you’re right! I’m using attn_impl: triton for all experiments. torch is significantly slower. Not sure if flash will be any faster and stabler.

yes, torch is slow, but it’s works, at least 0 times see problems with this

so, I think maybe problem in this area, maybe even some hardware missmatch

Yes, I guess some numerical errors/losses accumulate in triton’s implementation.