InvokeAI: Prompts with * are processed incorrectly

Describe your environment AWS p3.2xlarge, 16gb GPU

Describe the bug After fine tuning the model with the textual inversion, prompts with the placeholders are processed incorrectly. “A pencil sketch of *” never returns a pencil sketch, but “a pencil sketch of yellow hat” does.

To Reproduce Images used for training: hat3 hat2 hat1

configs/stable-diffusion/v1-finetune.yaml is not changed. Just changed the init word to “hat”.

  1. Fine tune the model with python main.py --base configs/stable-diffusion/v1-finetune.yaml -t --actual_resume ./models/ldm/stable-diffusion-v1/model.ckpt -n michal_hat --gpus 0, --data_root ./train/michal_hat/
  2. Wait for 5000 steps.
  3. Run prompts “A pencil sketch of *” and “A pencil sketch of yellow baseball hat”. Notice how “pencil sketch” part is not picked up in the first case, but picked up in the second case.
    Result for “A pencil sketch of *”: with_pt Result for “A pencil sketch of yellow baseball hat”: no_pt

Another examples can be “a photo of Obama in *” vs “a photo of Obama in yellow baseball hat”.

About this issue

  • Original URL
  • State: closed
  • Created 2 years ago
  • Comments: 17 (6 by maintainers)

Most upvoted comments

I’ve been trying to get textual inversion working locally and am struggling to find a straightforward answer to how placeholder token naming and usage works.

On the inversion docs page there’s no mention of specifying a placeholder token in the script to do the training. There’s an -n value (which I think only impacts the project name and/or the file system location it winds up), and there’s an init_word, but I think this is just a category of the thing, not relevant to placeholder token (ie, you might use “dog” for init_word if you were training a specific dog). I think this is not used, because the default is *, which is specified in the yml fine tuning file that the script reads settings from.

Further down that page there are examples like a photo of a * which seem to indicate that using the asterisk as your placeholder when writing a prompt should work.

Separately, on the concepts doc page seems to indicate that the generic * asterisk approach to a placeholder value is somewhat outdated, and that (in part to avoid conflicts from multiple embeds), that the <trigger-phrase> approach is preferred instead.

First question … which of these approaches is better/more accurate?

Related - when I attempt to train a model and I leave the yaml file as default (ie, * as a placeholder token value), the training runs fine … but then I’m unable to actually get any generated photos with my trained embed when I attempt to use * in prompts. On the other hand, if I attempt to update the placeholder value in the yaml, to something like <joe-dog> or whatever, the main.py training script crashes (I’m on an M1 if that matters) with an error like: “RuntimeError: element 0 of tensors does not require grad and does not have a grad_fn” (most of stack trace in pytorch lighting trainer area).

I noticed a comment on another issue speculating that the tokens might be limited to certain one character long strings? If this is correct, I’m not sure if this is enforced by the script in invoke itself, or an underlying library, or what.

Any pointers appreciated.

Dreambooth is one of the things I really want to get to work on M1.

I’ve tried learning from selfies of myself with regular Textual Inversion and results were meh. It resembles, but not really me. When I tried Dreambooth, it was as you say, a few hacks + model unfrozen. At the end of Epoch 0 is when RAM jumped from 60GB to 120GB or the like, and then it crashed. I’ll try to fix that but also give the HuggingFace version a look.

@Any-Winter-4079 Just thought I’d feed something back that has been uncovered. The ‘Dreambooth’ implementations most people have been using are actually nothing more than Textual Inversion with the model unfrozen (and a few other hacks). That’s essentially the key to getting better results but also why it requires large VRAM. The recent HuggingFace version of Dreambooth based on Diffusers is actually based on the original paper and implements prior preservation, which is missing from the unfrozen TI versions.

@Any-Winter-4079 Thank you much appreciated, no need to remove those amazing images you’ve documented, I was just going to reply with my Dreambooth results as comparison. It’s actually very helpful to me to see someone else working with the same training data 😃

@lkewis I’ve removed the images on the zip from the Pull Request https://github.com/invoke-ai/InvokeAI/pull/814 and deleted a Reddit post I had just created, documenting the result (without training images, but still, it redirected here).

Do you want me to remove the training part from here too? https://github.com/invoke-ai/InvokeAI/issues/517#issuecomment-1257216030 It was basically a screenshot from the Reddit post that you shared https://www.reddit.com/r/StableDiffusion/comments/xia53p/textual_inversion_results_trained_on_my_3d/ If you want, after class, I can try re-training with some other images, and update all of the comments from #517 (I think it was 4 comments) with new images.

@Any-Winter-4079 Hey I’m really sorry but I’ve just noticed in that other thread you’ve taken my images and started distributing them as a training example exactly as I have done. I would have appreciated if you’d asked me first about this since that digital human is something I created from scratch and has a purpose in a couple of projects I am working on.

EDIT- not mad btw, figured this would happen at some point just didn’t expect this fast

@lkewis I got Textual Inversion to work on M1 thanks to your guide, after fixing the nan M1 error. Have you done any more experiments? It’d be good to create a guide with our findings along the way (for example, what may be a good val/loss_simple_ema value to stop training at, the different results with various num_vectors_per_token, the importance of the input images -e.g. pose, lighting, etc.)

Oh hey!! thanks for gold btw. I’ve actually been pretty involved with Dreambooth recently as well. Got invited to a discord server, then Corridor Crew mentioned it in a video, I got made a mod and had to deal with 1000 people suddenly joining in the past 24+hours.

It’s been optimised to run on 24GB VRAM GPU now so I’ve been playing with that locally, and also trying to combine Dreambooth and Textual Inversion by doing a small fast 500 step training first with Dreambooth, then when you train Textual Inversion on top of that checkpoint it finds your subjects very fast so requires less steps as well. Results are starting to improve a little but needs a lot more experiments.

I wrote about my own experiments here, but there are no ‘winning formulas’ yet, everyone is still testing this stuff https://www.reddit.com/r/StableDiffusion/comments/xia53p/textual_inversion_results_trained_on_my_3d/