Control, See: Guided Text-to-Image Generation via In-context Reference Representation

Sourav Ghosh

Control See 5















Abstract

Guided text-to-image generation deals with the task of generating images from textual descriptions while constraining the output generation using a set of references. Typically, this reference consists of a set of RGB images, depicting the desired characteristics of the generated image. Motivation for this may include sketch-to-image [1], style transfer [2], character consistency [3], etc. However, it is relatively difficult to maintain similar conformity to a reference when it is provided in the form of text. In this work, I explore guided generation of images from textual descriptions with in-context reference representation. To demonstrate the effectiveness of the approach, I present a series of images generated using DALL·E 3 [4] and showcase its ability to retain character consistency and scene genre preservation across multiple generations.

The following catalogue primarily presents a key frame storyboard in cinematic style for a short story of one particular genre -- horror. Additionally, a curated set of images generated with a different style (cartoons) and an alternate genre (Rom-Com, Indo-Western) are also showcased. Last but not the least, keeping in line with the theme of the content, a few blooper outputs are also included in this list. The album is divided into multiple sections, each containing a set of images generated from a single prompt. Prompts are designed to contain (a) thematic description: style of the image and desired genre, (b) scene description: background, foreground, etc., and (c) detailed character descriptions: facial features, attire, pose, etc.

In this work, two protagonist characters and a key supporting character are used. The two protagonists are described as two university students -- a male and a female. The male character, Bob, is detailed out as a tall, lean, and fair-skinned individual with a sharp jawline who wears a yellow t-shirt and denims. He also carries a blue bag as an accessory. The female character, Alice, is mentioned in the prompt as having a dual-toned hair (blue and red) and a fair complexion. She dons a gray hoodie.

Prompt summary: {theme: cinematic, genre: horror}. Scene:

"Alice and Bob pack their belongings in university classroom after their exam."

Prompt summary: {theme: cinematic, genre: horror}. Scene:

"Bob enjoys a movie with Alice."

Prompt summary: {theme: cinematic, genre: horror}. Scene:

"The duo returns to the university building as Bob has forgotten to collect an essential item; Alice is annoyed and decides to wait outside the building."

Prompt summary: {theme: cinematic, genre: horror}. Scene:

"Bob looks for the item in his locker using his phone as a flashlight, when he hears a noise coming from the classroom."

Prompt summary: {theme: cinematic, genre: horror}. Scene:

"Bob is shocked to see himself and Alice taking the test along with others in the darkness of the classroom."

Prompt summary: {theme: cinematic, genre: horror}. Scene:

"He witnesses an eerie figure behind his doppelganger."

Prompt summary: {theme: cinematic, genre: horror}. Scene:

"A horrified Bob runs out of the classroom where Alice is waiting for him."

Prompt summary: {theme: cinematic, genre: horror}. Scene:

"After listening to Bob's ridiculous story, Alice decides to investigate as she does not believe a word."

Prompt summary: {theme: cinematic, genre: horror}. Scene:

"Alice and Bob are taken aback by a sudden beam of light."

Prompt summary: {theme: cinematic, genre: horror}. Scene:

"Bob faces stark horror as the light he assumed to be coming from the watchman's torch blurs his vision; he feels the presence of a ghostly woman."

Prompt summary: {theme: cinematic, genre: horror}. Scene:

"He looks out for some sanity in Alice but his petrified as he now faces an Alice (or her doppelganger?) whose dark-covered face bears an unsettling grin."

Prompt summary: {theme: cartoon/comics, genre: horror}. Scene:

"The two protagonists are startled by the watchman's flashlight."

Prompt summary: {theme: cartoon/comics, genre: Rom-Com, Indo-Western}. Scene:

"Alice and Bob discuss their plans together in the classroom after exam."

Prompt summary: {theme: cinematic, genre: horror}.

Blooper reel of images generated in the theme of duality (dual-tone hair, doppleganger, duplicity).

References
  1. [1] Voynov, A., Aberman, K., & Cohen-Or, D. (2023, July). Sketch-guided text-to-image diffusion models. In ACM SIGGRApH 2023 conference proceedings (pp. 1-11).
  2. [2] Wang, H., Xing, P., Huang, R., Ai, H., Wang, Q., & Bai, X. (2024). InstantStyle-plus: Style transfer with content-preserving in text-to-image generation. arXiv preprint arXiv:2407.00788.
  3. [3] Avrahami, O., Hertz, A., Vinker, Y., Arar, M., Fruchter, S., Fried, O., Cohen-Or, D., & Lischinski, D. (2024, July). The chosen one: Consistent characters in text-to-image diffusion models. In ACM SIGGRAPH 2024 conference papers (pp. 1-12).
  4. [4] DALL·E 3. (n.d.). OpenAI. https://openai.com/index/dall-e-3/ (accessed on 09 Mar, 2025).