meshgpt-pytorch: The MeshTransformer does not generate coherent results
I have trained the MeshTransformer on 200 different meshes from the chair category on ShapeNet after decimation and filtering meshes with less than 400 vertices and faces. The MeshTransformer reached a loss very close to 0
But when I call the
generate
method from the MeshTransformer, I get very bad results.
From left to right, ground truth, autoencoder output, MeshTransformer generated mesh with a temperature of 0, with a temperature of 0.1, a temperature of 0.7 and a temperature of 1. This is done with meshgpt-pytorch version 0.3.3
Note: the MeshTransformer was not conditioned on text or anything, so the output is not supposed to exactly look like the sofa, but it barely look like a chair. We can guess the backrest and the legs but that’s it.
Initially I thought that there might have been an error with the KV cache so here are the results with cache_kv=False
:
And this one with meshgpt-pytorch version 0.2.11
When I trained on a single chair with a version before 0.2.11, the generate
method was able to create a coherent chair (from left to right, ground truth, autoencoder output, meshtranformer.generate()
)
Why even though the transformer loss was very low the generated results are very bad?
I have uploaded the autoencoder and meshtransformer checkpoint (on version 0.3.3) as well as 10 data samples there: https://file.io/nNsfTyHX4aFB
Also quick question, why rewrite the transformer from scratch, and not use the HuggingFace GPT2 transformer?
About this issue
- Original URL
- State: closed
- Created 6 months ago
- Reactions: 3
- Comments: 15 (10 by maintainers)
The transformer needs to be near 0.01 or 0.001 and the autoencoder can be from 0.25 or 0.35 or lower.
I did some testing using 10 3d mesh chairs, I apply augmentation so each chair got 3 variations. Then i duplicated each variation 500 times so the total dataset size is 3000. I encoded the meshes using text as well but just using the same word ‘chair’, but this proves that the text generation works.
After 22minutes training the encoder (0.24 loss) and then 2.5hrs training the transformer (0.0048) I got the result below. To generate the complete models I think about 0.0001 loss should be good. I trained the transformer on different learning rates but in total there was 30 epochs e.g 30x 3000= 90 000 steps.
The training is very slow at the end, might need to up the transformers dim size to 1024.
I provided the text “chair” and looped the generation to use different temperature values from 0 to 1.0 with 0.1 as stepping value.

The issue is this: not enough data.
In the PolyGen and MeshGPT paper they stress that they didn’t have enough training data and used only 28 000 mesh models. They needed to augment those with lets say 20 augments, this means that they trained on 560 000 mesh models. But since it seems like you are not using the texts you can try to feed the transformer a prompt of 10-30 connected faces of a model and see what happens (like in the paper), it should act as a autocomplete.
The loss of the transformer should be below 0.0001 for successful generations.
Here is some idea what amount of data you should use. 20 augment * 100 duplicates of the augments * 200 models = 400 000 per dataset.
I recommend that you create/take at look at my fork a trainer that uses epochs instead of steps since printing out 400k steps will slow down the training.
Then train on this for a day or two and use a large batch size (less then 64) to promote for generalizing.
In the paper they used 28 000 3d models, lets say they generate 10 augmentations per each model and then used 10 duplicates since the it’s more effective to train a model with big batch size of 64 and when you are using a small number of models per dataset it will not train effectively and you will waste parallelism of GPUs. This means that : 10 * 10 = 100 * 28 000 = 2 800 000
I want to stress this: Over fitting a model with 1 model = super easy. Training a model to be general enough for many different models = very hard.
It’s GPT2 is quite old and there have been improvements so it’s not very good anymore.
this hyperparameter doesn’t actually improve results, just better alignment to the text description (if it is not following it)
Success! 😃 I let it run during the night and got this result, I think i might modify the encoder parameters, I used 768 dim for the transformer so that got 236M parameters. I’ll try decreasing and increasing the parameter count for the encoder.
Autoencoder training 6h 210 epochs, 756 000 steps : 0.9 loss Transformer training 3h 20 epochs : 0.00496 loss
Variations: 3, 100 examples each = 3600 examples/steps /epoch
When changing cond_scale it gave me this error:
Around 0.35
The autoencoder is able to reconstruct accurately the input, so I don’t understand why the MeshTransformer is not able to create a coherent chair. Do you also have similar results?