meshgpt-pytorch: The MeshTransformer does not generate coherent results

I have trained the MeshTransformer on 200 different meshes from the chair category on ShapeNet after decimation and filtering meshes with less than 400 vertices and faces. The MeshTransformer reached a loss very close to 0 But when I call the generate method from the MeshTransformer, I get very bad results. From left to right, ground truth, autoencoder output, MeshTransformer generated mesh with a temperature of 0, with a temperature of 0.1, a temperature of 0.7 and a temperature of 1. This is done with meshgpt-pytorch version 0.3.3 Note: the MeshTransformer was not conditioned on text or anything, so the output is not supposed to exactly look like the sofa, but it barely look like a chair. We can guess the backrest and the legs but that’s it.

Initially I thought that there might have been an error with the KV cache so here are the results with cache_kv=False:

And this one with meshgpt-pytorch version 0.2.11

When I trained on a single chair with a version before 0.2.11, the generate method was able to create a coherent chair (from left to right, ground truth, autoencoder output, meshtranformer.generate())

comparisons

Why even though the transformer loss was very low the generated results are very bad?

I have uploaded the autoencoder and meshtransformer checkpoint (on version 0.3.3) as well as 10 data samples there: https://file.io/nNsfTyHX4aFB

Also quick question, why rewrite the transformer from scratch, and not use the HuggingFace GPT2 transformer?

About this issue

Original URL
State: closed
Created 6 months ago
Reactions: 3
Comments: 15 (10 by maintainers)

Most upvoted comments

The transformer needs to be near 0.01 or 0.001 and the autoencoder can be from 0.25 or 0.35 or lower.

fire on Dec 18, 2023

I have trained the MeshTransformer on 200 different meshes from the chair category on ShapeNet after decimation and filtering meshes with less than 400 vertices and faces. The MeshTransformer reached a loss very close to 0 !

I did some testing using 10 3d mesh chairs, I apply augmentation so each chair got 3 variations. Then i duplicated each variation 500 times so the total dataset size is 3000. I encoded the meshes using text as well but just using the same word ‘chair’, but this proves that the text generation works.

After 22minutes training the encoder (0.24 loss) and then 2.5hrs training the transformer (0.0048) I got the result below. To generate the complete models I think about 0.0001 loss should be good. I trained the transformer on different learning rates but in total there was 30 epochs e.g 30x 3000= 90 000 steps.

The training is very slow at the end, might need to up the transformers dim size to 1024.

Epoch 5/10: 100%|██████████| 1500/1500 [06:24<00:00,  3.91it/s, loss=0.00494]
Epoch 5 average loss: 0.004852373525189857
Epoch 6/10: 100%|██████████| 1500/1500 [06:25<00:00,  3.89it/s, loss=0.00417]
Epoch 6 average loss: 0.004819516897356759
Epoch 7/10: 100%|██████████| 1500/1500 [06:24<00:00,  3.90it/s, loss=0.00501]
Epoch 7 average loss: 0.004833068791311234
Epoch 8/10: 100%|██████████| 1500/1500 [06:22<00:00,  3.93it/s, loss=0.005]  
Epoch 8 average loss: 0.004832435622811318

I provided the text “chair” and looped the generation to use different temperature values from 0 to 1.0 with 0.1 as stepping value. bild bild

MarcusLoppe on Dec 17, 2023

I have trained the MeshTransformer on 200 different meshes from the chair category on ShapeNet after decimation and filtering meshes with less than 400 vertices and faces. The MeshTransformer reached a loss very close to 0 !

Why even though the transformer loss was very low the generated results are very bad?

I have uploaded the autoencoder and meshtransformer checkpoint (on version 0.3.3) as well as 10 data samples there: https://file.io/nNsfTyHX4aFB

The issue is this: not enough data.

In the PolyGen and MeshGPT paper they stress that they didn’t have enough training data and used only 28 000 mesh models. They needed to augment those with lets say 20 augments, this means that they trained on 560 000 mesh models. But since it seems like you are not using the texts you can try to feed the transformer a prompt of 10-30 connected faces of a model and see what happens (like in the paper), it should act as a autocomplete.

The loss of the transformer should be below 0.0001 for successful generations.

Here is some idea what amount of data you should use. 20 augment * 100 duplicates of the augments * 200 models = 400 000 per dataset.

I recommend that you create/take at look at my fork a trainer that uses epochs instead of steps since printing out 400k steps will slow down the training.

Then train on this for a day or two and use a large batch size (less then 64) to promote for generalizing.

In the paper they used 28 000 3d models, lets say they generate 10 augmentations per each model and then used 10 duplicates since the it’s more effective to train a model with big batch size of 64 and when you are using a small number of models per dataset it will not train effectively and you will waste parallelism of GPUs. This means that : 10 * 10 = 100 * 28 000 = 2 800 000

I want to stress this: Over fitting a model with 1 model = super easy. Training a model to be general enough for many different models = very hard.

Also quick question, why rewrite the transformer from scratch, and not use the HuggingFace GPT2 transformer?

It’s GPT2 is quite old and there have been improvements so it’s not very good anymore.

MarcusLoppe on Dec 16, 2023

@MarcusLoppe @fire next get some tables and chairs and see if it can learn to generate two separate classes! so this isn’t documented, but in order to do better text binding (once you scale up with more text variety), you can improve text binding with classifier free guidance, a technique employed in a lot of denoising diffusion models (SD included). .generate(cond_scale = 3)

Not much better with cond_scale = 3 😕

this hyperparameter doesn’t actually improve results, just better alignment to the text description (if it is not following it)

lucidrains on Dec 18, 2023

@MarcusLoppe @fire next get some tables and chairs and see if it can learn to generate two separate classes! so this isn’t documented, but in order to do better text binding (once you scale up with more text variety), you can improve text binding with classifier free guidance, a technique employed in a lot of denoising diffusion models (SD included). .generate(cond_scale = 3)

Success! 😃 I let it run during the night and got this result, I think i might modify the encoder parameters, I used 768 dim for the transformer so that got 236M parameters. I’ll try decreasing and increasing the parameter count for the encoder.

Autoencoder training 6h 210 epochs, 756 000 steps : 0.9 loss Transformer training 3h 20 epochs : 0.00496 loss

Variations: 3, 100 examples each = 3600 examples/steps /epoch


num_examples: 100
filtered/chair
103b75dfd146976563ed57e35c972b4b vertices 285 faces 171
112cee32461c31d1d84b8ba651dfb8ac vertices 360 faces 272
11347c7e8bc5881775907ca70d2973a4 vertices 208 faces 160

filtered/sofa
10f2a1cbaee4101896e12b33feac8da2 vertices 152 faces 100
126ed5982cdd56243b02598625ec1bf7 vertices 270 faces 212
12aec536f7d558f9342398ca9dc32672 vertices 244 faces 184

filtered/table
10bb44a54a12a74e4719088c8e42c6ab vertices 240 faces 152
10e6398274554867fdf2e93846e20960 vertices 216 faces 152
119a538325398df617b2b37d6988a89b vertices 192 faces 120

filtered/vessel
1b1cf4f2cc24a2a2a5895e3729304f68 vertices 548 faces 228
29c5c9924a3e1e2367585a906cb87a62 vertices 130 faces 156
4ac3edea6f7b3521cd71f832bc14be6f vertices 178 faces 166

Chosen models count for each category:
chair: 3
sofa: 3
table: 3
vessel: 3
Total number of chosen models: 12
Got 3600 data

bild

When changing cond_scale it gave me this error:


File [c:\Users\Username\AppData\Local\Programs\Python\Python311\Lib\site-packages\meshgpt_pytorch\meshgpt_pytorch.py:1048](file:///C:/Users/Username/AppData/Local/Programs/Python/Python311/Lib/site-packages/meshgpt_pytorch/meshgpt_pytorch.py:1048), in MeshTransformer.generate(self, prompt, batch_size, filter_logits_fn, filter_kwargs, temperature, return_codes, texts, text_embeds, cond_scale, cache_kv)
   [1043](file:///C:/Users/Username/AppData/Local/Programs/Python/Python311/Lib/site-packages/meshgpt_pytorch/meshgpt_pytorch.py:1043) for i in tqdm(range(curr_length, self.max_seq_len)):
   [1044](file:///C:/Users/Username/AppData/Local/Programs/Python/Python311/Lib/site-packages/meshgpt_pytorch/meshgpt_pytorch.py:1044)     # v1([q1] [q2] [q1] [q2] [q1] [q2]) v2([eos| q1] [q2] [q1] [q2] [q1] [q2]) -> 0 1 2 3 4 5 6 7 8 9 10 11 12 -> v1(F F F F F F) v2(T F F F F F) v3(T F F F F F)
   [1046](file:///C:/Users/Username/AppData/Local/Programs/Python/Python311/Lib/site-packages/meshgpt_pytorch/meshgpt_pytorch.py:1046)     can_eos = i != 0 and divisible_by(i, self.num_quantizers * 3)  # only allow for eos to be decoded at the end of each face, defined as 3 vertices with D residual VQ codes
-> [1048](file:///C:/Users/Username/AppData/Local/Programs/Python/Python311/Lib/site-packages/meshgpt_pytorch/meshgpt_pytorch.py:1048)     logits, new_cache = self.forward_on_codes(
   [1049](file:///C:/Users/Username/AppData/Local/Programs/Python/Python311/Lib/site-packages/meshgpt_pytorch/meshgpt_pytorch.py:1049)         codes,
   [1050](file:///C:/Users/Username/AppData/Local/Programs/Python/Python311/Lib/site-packages/meshgpt_pytorch/meshgpt_pytorch.py:1050)         cache = cache,
   [1051](file:///C:/Users/Username/AppData/Local/Programs/Python/Python311/Lib/site-packages/meshgpt_pytorch/meshgpt_pytorch.py:1051)         text_embeds = text_embeds,
   [1052](file:///C:/Users/Username/AppData/Local/Programs/Python/Python311/Lib/site-packages/meshgpt_pytorch/meshgpt_pytorch.py:1052)         return_loss = False,
   [1053](file:///C:/Users/Username/AppData/Local/Programs/Python/Python311/Lib/site-packages/meshgpt_pytorch/meshgpt_pytorch.py:1053)         return_cache = True,
   [1054](file:///C:/Users/Username/AppData/Local/Programs/Python/Python311/Lib/site-packages/meshgpt_pytorch/meshgpt_pytorch.py:1054)         append_eos = False,
   [1055](file:///C:/Users/Username/AppData/Local/Programs/Python/Python311/Lib/site-packages/meshgpt_pytorch/meshgpt_pytorch.py:1055)         cond_scale = cond_scale
   [1056](file:///C:/Users/Username/AppData/Local/Programs/Python/Python311/Lib/site-packages/meshgpt_pytorch/meshgpt_pytorch.py:1056)     )
   [1058](file:///C:/Users/Username/AppData/Local/Programs/Python/Python311/Lib/site-packages/meshgpt_pytorch/meshgpt_pytorch.py:1058)     if can_cache:
   [1059](file:///C:/Users/Username/AppData/Local/Programs/Python/Python311/Lib/site-packages/meshgpt_pytorch/meshgpt_pytorch.py:1059)         cache = new_cache

File [c:\Users\Username\AppData\Local\Programs\Python\Python311\Lib\site-packages\classifier_free_guidance_pytorch\classifier_free_guidance_pytorch.py:146](file:///C:/Users/Username/AppData/Local/Programs/Python/Python311/Lib/site-packages/classifier_free_guidance_pytorch/classifier_free_guidance_pytorch.py:146), in classifier_free_guidance.<locals>.inner(self, cond_scale, rescale_phi, *args, **kwargs)
    [143](file:///C:/Users/Username/AppData/Local/Programs/Python/Python311/Lib/site-packages/classifier_free_guidance_pytorch/classifier_free_guidance_pytorch.py:143)     return logits
    [145](file:///C:/Users/Username/AppData/Local/Programs/Python/Python311/Lib/site-packages/classifier_free_guidance_pytorch/classifier_free_guidance_pytorch.py:145) null_logits = fn_maybe_with_text(self, *args, **kwargs_with_cond_dropout)
--> [146](file:///C:/Users/Username/AppData/Local/Programs/Python/Python311/Lib/site-packages/classifier_free_guidance_pytorch/classifier_free_guidance_pytorch.py:146) scaled_logits = null_logits + (logits - null_logits) * cond_scale
    [148](file:///C:/Users/Username/AppData/Local/Programs/Python/Python311/Lib/site-packages/classifier_free_guidance_pytorch/classifier_free_guidance_pytorch.py:148) if rescale_phi <= 0:
    [149](file:///C:/Users/Username/AppData/Local/Programs/Python/Python311/Lib/site-packages/classifier_free_guidance_pytorch/classifier_free_guidance_pytorch.py:149)     return scaled_logits

TypeError: unsupported operand type(s) for -: 'tuple' and 'tuple'

MarcusLoppe on Dec 17, 2023

What is your autoencoder loss?

Around 0.35

The autoencoder is able to reconstruct accurately the input, so I don’t understand why the MeshTransformer is not able to create a coherent chair. Do you also have similar results?

Kurokabe on Dec 16, 2023