AI Generated Ads · How to Maintain Quality at Scale

Every team running AI-generated ads at scale eventually hits the same wall: volume is easy, quality is not. The technology can produce a thousand variants in an afternoon. Most of them will not ship, because most of them will not pass the person who has to approve them, and the ones that do ship will not necessarily perform. That is the actual quality problem of AI-generated ads in 2026, and almost every framework for solving it gets the definition of quality wrong from the start.

This piece is the math behind what quality actually means inside an AI ad operation, the two-part definition Hi Luca uses to score it, and why solving for one without the other is what burns out paid social teams running these systems.

AI ad production pipeline showing the approval gate and the performance gate as two separate filters

The definition of quality that actually holds at scale

Most teams running AI-generated ads define quality as “does this look good?” — a craft standard inherited from the agency model. That standard was fine when a creative team produced eight ads per quarter. It collapses immediately when the same team is approving four hundred variants per month, because “does this look good?” is not a falsifiable scoring function.

The working definition of quality for AI-generated ads has two components, both of which have to be satisfied for an ad to be considered high quality:

Approval-grade quality. Does the ad get approved by the person whose job it is to approve ads? Not in the abstract — concretely, by the specific human who owns the brand voice and the legal and compliance posture for this account. If the ad would not survive that person's review, it does not matter how technically polished it is.
Performance-grade quality. Does the ad, once shipped, actually move the delivery metric the campaign is being optimised for? If it gets approved but underperforms relentlessly, it is high-quality artwork and a low-quality ad.

An ad is high quality only when both gates are satisfied. The system that produces AI-generated ads at scale has to be engineered around scoring both, simultaneously, before the ad is ever shown to a human reviewer. That is the part most production systems skip, and the gap is where most of the wasted budget goes. We unpacked the broader version of this argument in the uncomfortable truth about variant generation — psychological diversity matters, but only inside the constraint that both quality gates remain satisfied.

The approval gate · why most AI systems fail it

Approval-grade quality is not about brand guidelines. Every AI system claims to respect brand guidelines. The actual variable is whether the system knows who is going to approve the ad and what that person specifically cares about — because brand guidelines as written and brand standards as enforced by a specific approver are almost never the same document.

Consider a typical multi-product fintech account. The brand guidelines document says certain things about tone, certain things about visual treatment, certain things about regulatory disclosures. The approver — the marketing lead who signs off on every ad before launch — has additional unwritten rules. Never use the word “guarantee.” Never lead with the interest rate in the first three seconds. Never show a model without a hand visible. None of those rules are in the brand document. All of them will get an ad rejected.

An AI ad production system that scores quality only against the brand document will produce ads that the AI system thinks are on-brand and that the approver rejects. The team absorbing that gap is the team manually reviewing rejected variants every morning — usually for three hours before getting to the work they were hired to do. Salesforce's 2025 State of Marketing report documents that marketing teams running AI creative production lose an average of 22% of generated assets at the approval gate — a tax that does not appear on the production cost line.

The fix is not better brand guidelines. The fix is structurally different. The system has to learn the approver's actual decision pattern, not just the documented one, and the scoring step has to run against that pattern before the asset is ever rendered.

The performance gate · why approval alone is not enough

The second gate is the one that gets shorted by every AI system optimising purely for approval. An ad that passes the approver but underperforms in market burns budget without producing learning. Approval without performance is artwork. Performance without approval is policy violations. Both fail.

The performance gate has to be scored at production time, not at post-launch read-out. Scoring it at read-out is just measurement, not quality control — by the time the data comes in, the budget is already spent. The working AI ad system scores performance likelihood at the variant generation step, using prior performance data from the same account, the same audience, and adjacent product lines, to estimate the probability that a given variant will land above the campaign performance baseline.

Google's Think with Google measurement research places the gap between the highest-scoring and lowest-scoring variant in a typical paid social campaign at a 4-7x performance differential. Producing volume without performance scoring means a team is, on average, spending the campaign budget at three-to-four times the cost-per-result it could be achieving if performance scoring ran at production time. We modelled the dollar version of this gap in detail in AI ads · the system Meta actually rewards.

Diagram comparing AI ad production with and without performance scoring at the variant generation step

The Hi Luca quality algorithm · approval × performance, scored pre-launch

Hi Luca's quality control system is built around the two-gate definition above. The system produces variants in three coordinated steps, with quality scored at each step before the variant ever reaches a human reviewer.

Step one · Approver modelling. The system maintains a structured representation of each account's actual approver — not just the brand document but the approver's historical decisions: which variants they accepted, which they rejected, and what changed between the two. The approver model is editable by the marketing lead; it is not a black box that the operator has to fight.
Step two · Variant generation under approval constraint. The system generates variants designed to land inside the approver model from the first draft. It does not generate freely and then filter; it generates inside the constraint. The variants that emerge are pre-scored against the approver model with a confidence score, and any variant scoring below the confidence threshold is iterated automatically before the operator sees it.
Step three · Performance scoring against account history. Each surviving variant is then scored against the account's prior performance data — same audience, same product, same platform. The variants below a performance probability threshold are flagged. The operator sees only variants that have passed both gates with confidence.

The output of this pipeline is a smaller set of variants per round than a free-generation system would produce — typically four to eight variants instead of forty — and a higher proportion of them ship, with a higher hit rate once shipped. The operator does not spend three hours every morning triaging rejections. The marketing lead does not have to absorb the cognitive load of reviewing forty assets to approve eight.

Three-step variant generation pipeline · approver modelling, generation under approval constraint, performance scoring

The math of fewer experimentation loops

The structural advantage of the two-gate quality model is that it collapses the number of experimentation loops required to land a winning variant. A free-generation system that scores only at human review needs three to four rejection-and-revision cycles per campaign to converge on a variant that ships and performs. The two-gate model converges in one cycle, sometimes two.

The dollar version of this difference looks like the following table, modelled on a mid-size paid social campaign with a four-week run window:

Operating model	Variants generated	Variants shipped	Experimentation loops	Time to first winning variant
Manual creative team	6-8	4-6	1	14-18 days
AI-assisted (no quality scoring)	40-60	15-25	3-4	10-12 days
Two-gate quality scoring	8-12	8-12	1-2	4-6 days

The headline number is not the variant count. The headline number is the time to first winning variant. A team operating with two-gate quality scoring runs three times the experimentation cadence per quarter, with the same budget and the same human team. The compounding effect over a year is roughly an order of magnitude in audience insight per dollar spent.

The companion piece on the broader system economics of this shift is in our deep-dive on AI marketing tools versus AI marketing systems — the performance gate is one of the dimensions where the system model beats the tool model most decisively.

Where two-gate quality scoring fails

Honest version: the two-gate model has failure modes. Three of them show up consistently in teams six months into the operating model.

Approver drift. The approver's decision pattern changes — new regulation, new product positioning, a change in legal counsel — and the approver model lags behind. Variants start getting rejected at higher rates than the system predicts. The fix is a quarterly approver model audit, run by the marketing lead, surfaced by the system as a scheduled review.
Performance baseline rot. The performance scoring step uses prior campaign data. If the account has a quiet quarter, or pivots its product strategy, the baseline becomes stale. Performance scores stay confident, but the underlying assumption is wrong. The fix is recency-weighting the baseline and flagging confidence-decay explicitly to the operator.
Over-pruning at the generation step. The approval-constraint step can be tuned too tightly, producing variants that are technically perfectly on-approver but psychologically too similar to each other. The fix is monitoring cluster diversity in the surviving variant set and loosening the approval threshold if cluster diversity drops below a minimum bound.

None of these failure modes invalidate the model. All of them argue for treating the quality algorithm as a living system — tuned, audited, and recalibrated on a regular cadence, the same way a finance team treats a forecasting model.

Dashboard view showing approver confidence and performance probability scoring side by side

Why this matters for the production layer specifically

Hi Luca's Ads Assembly system is built around the two-gate model described above. The approver layer is structured per account, persistent across sessions, and editable by the marketing lead without engineering involvement. The performance scoring layer pulls from the account's historical delivery data — what shipped, what got served, what got suppressed, what landed above baseline — and runs the prediction step at variant generation, not at read-out.

The teams running Ads Assembly do not stop reviewing variants. They do not stop checking approval. They do less of both, because the system pre-scores both gates with confidence the operator can see, and the assets that reach the human reviewer have already passed the harder filter. The marketing lead approves variants instead of triaging them. The performance read-out converges in days instead of weeks. The agency P&L improves in the part of the spreadsheet that gets argued about most: the cost per validated creative learning.

Forrester's 2025 State of Marketing AI projects that by 2027, the gap between teams operating two-gate quality control and teams operating free-generation systems will be roughly a two-to-one differential on cost-per-validated learning. The teams adopting the two-gate model now are the teams that will be operating at the upper bound of that range. The teams adopting it in 2027 will be operating at the lower bound, after their competitors have already compounded the advantage for two years. The practitioner-side data on this differential is consistent with Meta's official newsroom (about.fb.com) — teams operating with pre-scored variant generation see measurable lift on the campaigns Meta's delivery system rewards.

For the operating-system view of the broader shift this fits inside, see The Creative Agency of 2028 — the two-gate quality model is one of three load-bearing components of the operating layer that distinguishes the agencies tripling their margins from the ones that are not.