AI & Tech

Two Google image models, two jobs: a working prompt guide for Nano Banana Pro and Nano Banana 2

Google ships two image models on Gemini 3 now: Pro for hero shots, NB2 for everything else. The original Nano Banana's prompting habits actively hurt; here's the working stack.

Ben Honda||10 min readAI research · human-curated

Google now ships two image models on Gemini 3. Nano Banana Pro for hero shots, Nano Banana 2 for everything else. Both are reasoning models with the LLM in front of the image generator, which means most of the prompt habits from the original Nano Banana, or from your Stable Diffusion days, actively hurt now. Use real prose, not tag soup. Name the camera. Stay in-thread to edit. Ask for “visible pores, not airbrushed.” Generate at 2K, ship at served size.

Two models, two jobs

Google released Nano Banana Pro (gemini-3-pro-image-preview) in November 2025, and Nano Banana 2 (gemini-3.1-flash-image-preview) in February 2026. Despite the name, they’re a long way from the original Nano Banana that shipped on Gemini 2.5 Flash Image. They’re not diffusion models. They’re transformer-based image generators with a Gemini 3 reasoning model sitting in front, planning the scene before anything renders.

Pro is the slower, more expensive one. It can call live Google Search, mix up to fourteen reference images, render legible text in multiple languages, and generally handles anything you’d put on a hero or in a brand-critical surface. Nano Banana 2 is roughly twice as fast, cheaper, and runs the same architecture tuned for throughput.

Practical rule: Pro for the few hero images per page; NB2 for everything else.

Capability Nano Banana Pro Nano Banana 2 Original Nano Banana
Resolutions 1K / 2K / 4K 0.5K / 1K / 2K / 4K ~1K
Aspect ratios 1:1, 3:2, 2:3, 3:4, 4:3, 4:5, 5:4, 9:16, 16:9, 21:9 + 1:4, 4:1, 1:8, 8:1 banners Limited
Reference images Up to 14 Up to 14 ~3
Text rendering State of the art, multilingual ~87–96% on benchmarks Weak
Search grounding Yes Yes No
Cost per image ~$0.039 to $0.151 by resolution Cheaper Cheapest

4K costs roughly 2.3× a 1K image and is overkill for most LCP-sensitive web pages. Generate at 2K, then re-encode to AVIF or WebP at the actual served size.

What changes when the model can think

The thing that breaks every old habit is this: there’s an LLM between your prompt and the image. It plans before it renders, it can call tools, and it parses prose. A useful framing: Flux is an image generator that happens to include a VLM; Nano Banana Pro is a reasoning system that happens to output images.

Three consequences fall out of that.

The first is that natural language beats tag soup. Google’s own Ultimate prompting guide for Nano Banana is explicit about it: drop the 4k, masterpiece, ultrarealistic, trending on artstation spam. The reasoning front-end parses prose. Tag lists from the Stable Diffusion 1.5 era look to it like low-information noise, and they actively pull the output toward the centre. Descriptive sentences carry more signal.

The second is that specificity gets compressed into latent priors. “Hasselblad X2D, 135mm, f/2.8” triggers the model’s medium-format and portrait priors: higher dynamic range, tighter compression, denser texture in the skin. “Professional camera” doesn’t trigger anything. The same goes for film stocks. Kodak Portra 400, Fujifilm Pro 400H, Cinestill 800T, Arri Alexa colour science: each name pulls the output toward a real-world look the model has seen.

The third is that “visible pores, not airbrushed” is the single most-cited anti-plastic phrase. Multiple community prompt libraries converge on that exact wording. The negation suppresses a strong “beautify” prior baked into training data. Nothing else seems to do the same thing as reliably.

A template you can copy

A starting shape that works for portraits, products, and lifestyle shots:

A [shot type] of [specific subject with materials/age/wardrobe], [doing what],
in [setting with time of day]. Lit by [specific lighting setup].
Shot on [camera body] with a [focal length] lens at [aperture], [DOF descriptor].
[Colour/film stock/grade]. Visible [texture cues].
Aspect ratio [X:Y]. Photorealistic. No text, no watermark.

Filled in for a SaaS landing page hero:

A medium three-quarter shot of a 34-year-old product designer in a charcoal merino sweater, leaning over a sketch on a walnut desk, in a sunlit Brooklyn studio at 9am. Lit by soft window light from camera-left with a subtle bounce on the right. Shot on a Sony A7IV with an 85mm f/1.4 at f/2.0, shallow depth of field with the background falling into a clean bokeh. Kodak Portra 400 colour science, natural skin tones. Visible skin pores, baby hairs at the temple, fine fabric weave on the sweater. Aspect ratio 16:9. Photorealistic editorial photography. No text, no watermark.

The product version of the same shape is shorter. No human means no pore-cluster or asymmetry instructions, but you keep the named camera, the lens, the aperture (f/8 to f/11 for product), and one specific surface noun (concrete, linen, seamless paper) instead of “minimal background.”

Edit in-thread, not by re-rolling

This is the change most likely to surprise you if you came from the original Nano Banana. The Gemini 3 image preview API attaches encrypted “thought signatures” to each turn, and on gemini-3-pro-image-preview and gemini-3.1-flash-image-preview those signatures are strictly required on subsequent edits. Missing them returns a 400 error. The practical effect is that staying in the same conversation thread means the model literally remembers the composition logic. Starting a fresh conversation throws all of that away.

The phrasing the Google Cloud guide recommends, and that the Gemini 3 Pro Image docs echo, is to ask for one change at a time and explicitly anchor everything else:

Using this image, [single change]. Keep everything else exactly the same:
preserve composition, lighting direction, colour grade, and subject features.

If you treat each prompt as a re-roll instead, you’ll watch the subject drift between turns. The fix is one sentence in the prompt and a habit of staying in-thread.

The anti-AI-look stack

Things to weave into any realistic prompt, on top of the camera and film stock:

  • Natural skin texture with visible pores, not airbrushed, not waxy.
  • Subtle skin imperfections: faint freckles, slight asymmetry, fine flyaway hairs.
  • Natural catchlights in both eyes.
  • Imperfect, lived-in environment. (Kills the sterile-stock-photo vibe.)
  • Subject slightly off-centre, leading look-room left.

Negatives don’t use the Stable Diffusion (plastic skin:1.4) syntax. Nano Banana parses them as natural-language instructions. Append at the end:

Avoid: plastic skin, waxy appearance, airbrushed look, doll-like eyes, fused fingers, deformed hands, distorted background text, oversaturation, HDR halos, motion blur on a still subject. No text, no watermark, no logos.

These reduce the failure rate. They don’t guarantee anything. The model treats them as instructions, not strict tokens, so run a few generations and prune negatives that fire too often into hallucinated artifacts of their own.

Multiple references, named characters

Two practices, one principle. Both rely on the model anchoring on named entities and repeated descriptors, not on a hidden seed.

The multi-image formula from the Google docs:

[Reference images attached] + [Relationship instruction] + [New scenario]

For example: Using Reference 1 (the model) and Reference 2 (the silk gown), generate a high-end editorial shot of the model wearing the gown on a Milan rooftop at golden hour. The relationship instruction is the part most people skip; without it, the model doesn’t know which reference is doing what.

For character consistency across a series, generate a brand or character sheet first (“…showing front, three-quarter, and side views on a neutral background”), then in every subsequent prompt name the character and re-state three to five anchor traits:

Mia, short auburn hair, freckles, denim jacket. [New scene.]

Reference-image anchoring is more reliable than text-only naming. Past five named characters in a single scene, anchoring degrades. Ambience AI’s hands-on testing puts the practical ceiling at roughly five named characters and fourteen named objects per series, though that’s anecdotal and worth re-testing on your own briefs.

Search grounding is the other lever Pro adds. It can call Google Image Search before generating, useful when you need a real bird species, a specific city skyline, or current product packaging. Trigger phrase: Use image search to find accurate references of [X], then create… The caveat is that community testing keeps catching it producing good-looking but factually wrong infographics, so verify any image that carries data.

Web-specific tactics

The bits that matter once the image is leaving your laptop.

For aspect ratios, use 16:9 for hero and OG images, 4:5 for product detail pages, 1:1 for cards, and 9:16 for mobile and social. The banner ratios (1:8, 8:1, 1:4, 4:1) are NB2-only. Specify the ratio in the prompt and in the API parameter where one exists; both surfaces respect it.

Generate at 2K by default. 4K is overkill outside print or specific hero contexts and ships at 4 to 8 MB raw, so never serve those directly. Re-encode to AVIF or WebP at the actual served size.

For hero shots that need to work on both light and dark themes, prompt for soft, mid-key lighting against a neutral mid-grey background. The model has no native “transparent on dark or light” mode for photos, but the mid-key range reads cleanly under either theme.

Put any text you want in the image inside "quotation marks", or the model paraphrases. Pro renders quoted strings reliably now, including multilingual; this is the biggest single improvement over v1.

For brand work, lock a logo, two key colours, a palette swatch image, a face or character sheet, and a style reference. Feed that bundle as references on every brand image. Multi-image conditioning is the single most reliable consistency lever.

Failure modes worth knowing

Some of these will surprise you. Small faces in crowds and wide shots still smear; the model produces “crowd soup” past a certain distance, so don’t put critical subjects in the deep background of a wide shot. Hands have improved but not been solved. Holding small objects, intertwined fingers, and complex grips still fail occasionally; negate them explicitly with no fused fingers, no deformed hands.

IP and celebrity refusals are silent. A prompt naming a copyrighted character or a public figure returns finishReason: OTHER with null content, which looks like a transport failure. Reword to remove the named IP.

Preview API reliability is real. Multiple aggregator blogs report 30 to 45 per cent peak-hour failure rates on the Pro preview during launch quarters, and Google itself acknowledges the preview-status caveat. If you’re shipping a production flow that depends on real-time generation, build a fallback chain: Pro to NB2 to a cached prior generation. Catch the failure, don’t surface it to the user.

Major edits break realism. Day to night, full background swap, blending three images: all produce uncanny lighting mismatches if you don’t tell the model to match the new environment. Add: match the new lighting direction and colour temperature to the new environment, including cast shadows and rim light.

Default safety thresholds are now OFF on Gemini 2.5 and 3 unless you set them explicitly (Google’s safety docs were updated to reflect this in January 2026). Old tutorials telling you to manually relax filters are out of date. If you’re getting a refusal, check finishReason first; the prompt is rarely the problem.

Every output gets a SynthID watermark. Plan accordingly for any “looks like stock” use case.

Framings the model treats as permission to render imperfectly

These aren’t policy bypasses. They’re contextual setups the model interprets as license to render the kind of imperfection that reads as real, and they produce more convincing results than asking for “photorealistic” or “high quality”:

  • Documentary photography. Behind-the-scenes still. Candid photojournalism. Suppresses the model’s tendency toward composed, stage-lit perfection.
  • iPhone snapshot, slight motion blur, mixed indoor lighting. Convincing “real person took this” energy. Useful for UGC-style web imagery.
  • Editorial portrait for The New Yorker / Wired / Monocle. Triggers higher-fidelity skin and lighting priors than “professional photo.”
  • Shot on a disposable camera with direct flash. Forces grain, harsh shadow falloff, and colour casts that read as authentic.
  • Photograph from [year], with era-specific clothing and tech. Era-grounded realism beats generic “vintage” every time.

Where each model sits in the wider field

For most of early 2026, Nano Banana 2 sat at the top of the LM Arena image leaderboard with an Elo around 1,360. As of May 2026 OpenAI’s GPT Image 2 has since taken the top spot by a record margin (1,512 Elo, a +242 lead), but the qualitative picture for working web imagery hasn’t changed much. NB2 still leads on multilingual text rendering and search-grounded composition. FLUX.2 Pro is the strongest open and self-hostable option. Midjourney v7 is still the aesthetic leader for editorial mood work, but its text rendering is weak (~71 per cent accuracy) and there’s no first-class API.

The hybrid stack a lot of teams use: Midjourney for concept exploration, Nano Banana Pro for the chosen hero with brand text, NB2 for variants and bulk, Photoshop or Firefly for the final pixel polish.

A closing note

Models update silently. If a technique here stops working, particularly content-policy edges, search-grounding behaviour, or aspect-ratio support, assume the model changed, not your prompt. Re-baseline quarterly.

References

© 2026 The Adpharm. All rights reserved.