-
Notifications
You must be signed in to change notification settings - Fork 54
Description
Overview
There's an interesting shift happening right now in image generation models. Historically, most (all?) popular models were a diffusion model. Recently, models like Nano Banana and GPT-Image are Transformer models. These are very different hand handle image generation quite differently. I won't get into the how here, but it's fascinating and worth looking into. I raise it here because the two models have very different trade-offs.
Diffusion models are extraordinarily creative and can often make beautiful looking images that "vibe" really well as they look quite smooth. They're quite bad when it comes to text and iteration. So you want to create a gorgeous landscape, this is your model of choice. If you want to make an infographic, it's not great. Diffusion models also don't handle broader context as well as they look to the prompt as a strict description of the image. These models tend to be faster and have more predictable cost as well.
Transformer models, on the other hand, are pretty good at making creative pictures, but really shine at image iteration and text. The model has a clear sense of what is in the image, so if you want to modify or remove a specific object, it's quite likely to succeed, and won't alter other parts of the image in the process. They're also capable of taking in broader context like chat history. This all comes with the downside that they're slower and cost is token-based and therefore less predictable.
In short:
- For one-shot generation of a picture without text, Diffusion models are generally the better choice
- For modifying images, using history, or images with text, Transformer models are choice.
Options
Right now we don't have a way to differentiate between the two aside from specifying a model when building a prompt. But when the model inference system is used, there's no way to provide a preference. I'm considering whether we should and if perhaps one of the following options could be useful:
// As part of prompt termination
$prompt->generateImageResult(); // use first available image model
$prompt->generateTransformerImageResult(); // use first available transformer image model
$prompt->generateDiffusionImageResult(); // use first available diffusion image model
// As a fluent optino
$prompt->usingDiffusionModel(); // model must be a diffusion model
$prompt->usingTransformerModel(); // model must be a transformer model
$prompt->usingDiffusionModelPreference(); // prefers a diffusion model, but will accept whatever
$prompt->usingTransformerModelPreference(); // prefers a transformer model, but will accept whateverThis would mean that we'd need a way of differentiating between diffusion and transformer image models. We could do something like add a new ImageGenerationModelInterface::getModelType() option, or perhaps distinct interfaces that extend that interface.
These are my thoughts so far! I welcome input!