Here is a production problem that did not exist two years ago: you need to generate a single video clip that works in both landscape (16:9 for YouTube) and portrait (9:16 for Instagram Reels, YouTube Shorts, TikTok). You are using an AI video model — Gemini Veo, Sora, Runway, whatever ships next month. You can only afford to generate each clip once. The same footage must survive a center crop that removes more than half the original frame.
This is not a hypothetical. I produce daily video content this way: seven cinematic clips per day, generated in 16:9 landscape, then center-cropped to 9:16 portrait in editing. One generation, two formats, zero reshoots. Over the course of iterating across dozens of clips, a pattern emerged in how AI video models respond to compositional instructions — and it generalizes well beyond my specific use case.
The first instinct is to tell the model what to do:
This works. Roughly. The model understands "centered" and places the subject near the middle of the frame. But "near the middle" is not the same as "survives a 56% width reduction." On a 1920-pixel-wide frame, a 9:16 center crop keeps only the middle 540 pixels. If the subject's shoulder extends to pixel 650, it gets cut. If atmospheric elements that sell the scene live in the outer thirds, they vanish entirely.
The imperative instruction fails because it tells the model what without telling it why. The model has no mental model of what will happen to its output after generation. It optimizes for the frame it can see — the full 16:9 canvas — and distributes visual interest across the entire width, as any good composition would.
The improvement came from giving the model the full context of the downstream transformation:
Three things changed. First, the model now knows why centering matters — the footage will be physically cropped. This is not an aesthetic preference; it is a production constraint with real consequences. Second, the "middle third" gives a concrete spatial rule instead of a vague "centered." Third, and most importantly, the instruction reframes the left and right edges from "wasted space" to "expendable atmosphere" — giving the model explicit permission to put beautiful but non-essential content there. The edges become a visual gift to the landscape version without threatening the portrait version.
Generative video models (and image models, and language models) are context-completion machines. They do not follow instructions the way a camera operator follows a director. They generate the most likely completion of the context they are given. When the context includes a production constraint with a clear rationale, the model integrates that constraint into its generative process more deeply than when it receives a bare imperative.
This is the same principle that makes "explain your reasoning step by step" more effective than "be accurate" in language models. It is not that the model suddenly becomes more capable. It is that the richer context activates more relevant patterns in the model's learned representations. A model that has seen thousands of examples of center-weighted compositions designed for multi-format output will generate better multi-format compositions when it recognizes the pattern from the prompt.
For multi-format video specifically, the "middle third" provides a clean mental model. Divide the 16:9 frame into three vertical strips:
Left third — expendable atmosphere. Fog, architecture edges, environmental depth. Beautiful in landscape, invisible in portrait. This is where you let the cinematographer in the model do its most extravagant work, because nothing here needs to survive the crop.
Middle third — the story. Every subject, every piece of action, every visual element that carries narrative or emotional weight lives here. This strip is the portrait version. Compose as if this were the only frame that exists.
Right third — expendable atmosphere. Same as the left. Symmetry is not required — asymmetric atmospheric framing often produces more cinematic landscape compositions precisely because the model is not trying to balance visual weight across the full frame.
This framework applies to any aspect ratio transformation where the target is narrower than the source. A 4:3 to 1:1 crop. A 2.39:1 cinema crop to 16:9. The principle is the same: identify what survives, tell the model what survives, and explicitly grant permission to fill the rest with atmosphere.
Across approximately fifty clips generated with the v1 imperative instruction and thirty with the v2 contextual instruction, several patterns emerged:
Subject drift. With v1, subjects occasionally drifted to the left or right third during camera movement sequences (slow pans, tracking shots). The model understood "centered" as a starting position, not a sustained constraint. The v2 instruction, by framing the middle third as the zone where "all key subjects and action" must live, produced more consistent centering throughout the clip's duration.
Edge-loaded storytelling. V1 clips sometimes placed secondary narrative elements — a glowing symbol, a reaching hand, a significant object — in the outer edges. These elements were visible in 16:9 and lost in 9:16, creating a portrait version that felt incomplete. The v2 instruction explicitly maps "visual storytelling" to the middle third, which reduced this failure mode.
Atmospheric courage. Counterintuitively, the v2 instruction produced more visually ambitious landscape versions. By giving the model explicit permission to use the edges for atmosphere that "can be lost without affecting the story," the model appeared to treat those regions as a creative playground. More volumetric fog. More architectural detail. More environmental scale. The permission to be expendable freed the edges from having to be conservative.
The underlying principle — describe downstream transformations in the prompt — applies wherever AI-generated output will be processed, cropped, reformatted, or embedded after generation:
Image generation for responsive web design. If a hero image will be displayed at 16:9 on desktop and center-cropped to 1:1 on mobile, the image generation prompt should say so. "This image will be displayed as a wide banner on desktop and center-cropped to a square on mobile. Place the key visual elements in the center square of the composition."
Text generation for multi-channel distribution. If a piece of copy will be truncated to 280 characters for X/Twitter and displayed in full on a blog, the prompt can encode this: "The first sentence must be a complete, compelling statement that works as a standalone tweet. The full piece can expand to three paragraphs."
Code generation for multiple environments. If generated code will run in both Node.js and browser environments, describing both targets in the prompt produces more robust output than asking for "universal JavaScript."
In every case, the model performs better when it knows the full lifecycle of its output, not just the immediate format.
This research note is based on one model family (Gemini Veo), one use case (daily recovery content), and one operator (me). The sample size is meaningful for practical workflow development but not for rigorous claims. The comparison between v1 and v2 instructions was not controlled — later clips also benefited from improved prompt engineering in other dimensions (mood language, camera movement specificity, film grain instructions) that evolved in parallel.
The middle-third framework assumes center cropping, which is the most common but not the only transformation. Offset crops, dynamic crops that track subjects, and AI-powered reframing tools may change the optimal prompt strategy. As reframing tools improve, the problem this note addresses may become less relevant — or it may become more relevant, as prompt-encoded compositional intent gives reframing tools better signals to work with.