vit-pytorch: Model doesn't converge

We are trying to apply this method on a medical dataset, and have about 70K images (224 res) for 5 classes. However, our training doesn’t converge (we tried a range of learning rates e.g. 3e-3, 3e-4 etc.) however doesn’t seem to work. Currently our model outputs 45% accuracy where the average accuracy for this dataset is around 85-90% (we trained for 100 epochs). Is there anything else we should tune?

Also, here is our configuration:

batch_size = 64
epochs = 400
lr = 3e-4
gamma = 0.7
seed = 42

efficient_transformer = Linformer(
    dim=128,
    seq_len=49 + 1,  # 7x7 patches + 1 cls-token
    depth=4,
    heads=8,
    k=64
)

# Visual Transformer

model = ViT(
    dim=128,
    image_size=224,
    patch_size=32,
    num_classes=5,
    transformer=efficient_transformer,  # nn.Transformer(d_model=128, nhead=8),
    channels=1,
).to(device)

Thank you very much!

About this issue

Original URL
State: open
Created 4 years ago
Reactions: 2
Comments: 22 (7 by maintainers)

Most upvoted comments

@liberbey Hey Ahmet! One of pitfalls of transformers is having settings that result in the dimension per head to be too small. The dimension per head should be at least 32 and best at 64. It can be calculated as dim // heads, so in your case, the dimension of each head is 16. Try increasing the dimension to 256 and increasing the sequence length (decrease patch size to 16) I would be very surprised if it does not work

lucidrains on Dec 21, 2020

@lucidrains Thanks again! We will try to find a larger dataset. By the way, these are validation results, not test results. So we wondered if there could be another problem about our approach. Because we were expecting that the test results would be bad due to not using pretrained model but not the validation set… Also, do you have any suggestions by the dramatic drop around 80th epoch?

Did you use a special learning rate scheduler? My loss curve on my own dataset also shows an uncommon curve, check here. Seems that ViT is hard to train.

SuX97 on Dec 24, 2020

@lucidrains Thanks again! We will try to find a larger dataset. By the way, these are validation results, not test results. So we wondered if there could be another problem about our approach. Because we were expecting that the test results would be bad due to not using pretrained model but not the validation set… Also, do you have any suggestions by the dramatic drop around 80th epoch?

liberbey on Dec 23, 2020

@lucidrains We have changed the parameters as:

efficient_transformer = Linformer(
    dim=256,
    seq_len=197, 
    depth=6,
    heads=8,
    k=64
)

# Visual Transformer

model = ViT(
    dim=256,
    image_size=224,
    patch_size=16,
    num_classes=5,
    transformer=efficient_transformer,
    channels=1,
).to(device)

But our model still does not converge. Here are the results:

acc

loss

Do you have any other suggestions? Thanks again!

liberbey on Dec 22, 2020