Context
This project was developed as part of the KIMUVA program. The goal: build a repeatable, cinematic AI video workflow that takes a creative concept through to a finished 30-second advertisement — visual footage, voice-over, and sound effects included.
Tools used: Google Veo 3.1 Fast · Adobe Firefly · Gemini Nano Banana · Suno · ElevenLabs · Adobe Premiere Pro
Success metrics:
- Match rate vs. reference image per frame
- Artefact count per shot
- VO words-per-second vs. perceived calm
Character System — Waitress
A consistent character ("Mila") anchors the entire spot. Continuity is achieved through locked identity markers, not by re-using the same source image.
- Identity markers: hair colour/length, skin tone, age group, posture, micro-expressions
- Wardrobe base: café apron + neutral blouse + subtle accessory (repeatable across scenes)
- Continuity markers: name tag, cup style, characteristic gesture — half-smile, eyes left toward camera
- Variability is key: lighting and colour palette may shift; anatomy, silhouette and props stay locked
- Prompt hygiene: precise descriptions; no contradictions; max 1–2 central props
"Mila, late-20s café server; fair skin with light freckles across nose; hazel eyes; chestnut-brown hair in a low ponytail with a few baby hairs; subtle natural makeup; oval face; small silver stud earrings and thin chain necklace; brown leather watch on right wrist; crisp white oxford shirt, sleeves rolled; navy cotton apron tied in a center knot; calm, kind, attentive; signature details: tiny beauty mark under left cheekbone; slightly frayed apron strap; gentle, reassuring smile."


Cinematography Prompt — Opening Shot
The opening shot is generated from a single fully-specified prompt. Two generated outputs from the same prompt:


Prompt (excerpt): Wide Shot through closed café double-door glass from the street. Straight-on, eye level. 50 mm spherical, f/2, 1/100, ISO 200, Pro-Mist 1/8, CPL. Interior 2700K warm amber vs exterior 5600K cool blue/gray. Negative: teal door, bright daytime sun, oversized sign blocking face, motion blur, hands on counter outside.
Prompt Anatomy
Every shot follows the same six-part template:
| Element | Content |
|---|---|
| Frame | Shot type, camera angle, composition rule |
| Camera | Focal length (50–65 mm), shallow DOF, max 1 movement |
| Action | 1 clear subject action (set cup down, gentle glance) |
| Lighting | Key / fill / rim + practicals; softness, direction, contrast ratio |
| Palette & Texture | 3–5 colour accents; wet asphalt, porcelain, matte cotton |
| Constraints | No logos/text; negative prompts for unwanted geometry |
Era Characters — 80s to Today
The commercial spans three time periods, each requiring its own character set with locked continuity parameters.
80s characters: roller skater (quad rollerskates, high-waist shorts, pastel windbreaker) · boombox carrier (colourful track jacket, cassette recorder) · skater (bleached print tee, flannel tied at waist) · office couple (shoulder-pad blazer, trenchcoat)
90s–Today characters: couple under umbrella (oversized denim jacket, plaid shirt) · runner with AirPods (turquoise functional shirt, fitness band) · e-scooter rider (quilted vest, helmet) · pedestrian with smartphone (olive parka, screen glow in rain)
Each era character has locked continuity: clothing palette, props, movement rhythm and hairstyle stay consistent across all shots they appear in.
Color & Lighting
The commercial shifts from warm café interior to rainy street exterior, requiring a deliberate colour temperature transition:
- Interior: 3000K warm amber — pendant bokeh, glazed tile reflections, steam haze
- Exterior: ≈5600K cool blue/grey — wet pavement reflections, diffuse sky light, muted saturation
- Correction notes: protect skin tones; soft halation; lift blacks gently; narrow lantern cone outdoors
Artifact Minimization
- Overloaded prompts → split into "1 subject + 1 action"
- Ambiguous prop lists → detailed foreground / midground / background inventory
- Phantom limbs / metal rods → add to negative prompt
- Unstable composition → lock with reference image; run edge scans and tangent checks
For the coffee grinder rotation, keyframe checkpoints were specified at t=15%, 35%, 55%, 75%, 92% with a final elastic settle <5° — no description of the full arc, only the reference angles.
Time Transition Troubleshooting
The camera transitions from inside the café through the glass to the street in a single continuous shot. The fix: specifying dolly movement — not zoom — with explicit parallax instructions.



Camera prompt (required): Slow forward dolly-in along the optical axis — not a zoom, no pan/tilt/roll. From t=0→90%, move forward ~40 cm; horizon steady. At t≈70–80% the window frame is completely out of view so the street is fully revealed. Show parallax: chalkboard/bike/cars scale up ~15–20%.
Packshot Design
The product (AURUM coffee tin) was designed and placed in-scene using Gemini Nano Banana, then animated with Google Veo 3.1 Fast.
Nano Banana: package design generation, colour and font customisation, environment compositing Veo 3.1 Fast: motion graphics and product reveal animation
Sound Design
- Diegetic sounds: quiet room ambience · soft street noise · cup set down · coffee grinder hum · steam hiss
- Music: Sweet Silence (0.75×) — slow, dreamy, piano-dominated ballad; generated in Suno v4.5-all (6:08)
- Dynamics: low base level, soft transient peaks only — no constant background noise
Voice-Over System
Cadence card:
| Timecode | Content |
|---|---|
| :00–:08 | Scene introduction |
| :10–:18 | Product value |
| :20–:26 | Sensory line |
| :27–:30 | Brand name / CTA (very quiet) |
Process: draft → trim to duration → cut ~40% for headroom → map to cadence card
TTS settings (ElevenLabs): speed slightly below normal · softer voice onset · soft sibilants — Voice: Jessica, Eleven Multilingual v2
Editing — Shot List










| # | Shot | Timecode |
|---|---|---|
| 1 | Café Opening | 0.00–2.50 |
| 2 | 80s Street Energy | 2.50–6.00 |
| 3 | First Drop | 6.00–8.00 |
| 4 | Quick Camera Sweep Through Time | 8.00–11.00 |
| 4a | Grinding the Beans | 11.00–13.00 |
| 5 | Street in the 90s — Rain | 13.00–16.00 |
| 6 | Steam Cleaning | 16.00–19.00 |
| 7 | The Street Today | 19.00–22.00 |
| 8 | Latte Art Finish | 22.00–25.50 |
| 9 | Packshot | 25.50–30.00 |
Outcome
A repeatable end-to-end workflow: Concept → Character system → Prompt engineering → Generation → Sound & VO → Edit.
Results: higher image accuracy; consistent colour palette with rainy-mood transitions; reduced artefact count; VO speaking rate matched to the calm editing pace.
What to repeat:
- Multi-take sequences (2 × 4-second takes per shot)
- Alternative lens focal lengths per scene
- Locked colour palette as a generation anchor
Final checklist: Composition ✓ | Colour palette ✓ | Artefacts ✓ | SFX background ✓ | VO ✓