When an LLM produces wrong output, the first instinct is to fix the prompt.
Add more instructions. Be more specific. Add examples.
This works sometimes.
But there is a class of errors where the prompt is not the problem. The problem is in the data model the LLM is reasoning about.
Here is a real case from a production marketing AI system.
The symptom
The system detects marketing performance issues and writes human-readable recommendations.
One issue type - call it FATIGUED_OR_RED_MEDIA - covered two distinct situations:
Creative fatigue. Ads seen too many times by the same audience, leading to declining engagement.
Low-quality or disapproved media. Ads being suppressed by the platform because of quality signals, policy flags, or audience rejection rates.
The problem: the LLM was regularly confusing the two.
A recommendation for a creative being suppressed for quality reasons would describe it as "fatigued" - pointing the team toward refreshing the creative, when the actual fix was reviewing the quality signals.
The inverse also happened. Fatigued creatives got described with language about "platform suppression."
Wrong direction every time.
The wrong fix
First attempt was a prompt change:
IMPORTANT: Do not describe frequency-based fatigue as platform suppression.
Do not describe platform-suppressed media as fatigued.
If the issue is frequency, use the word "fatigue".
If the issue is quality/rejection, use "quality signals" or "disapproval".
This helped on straightforward cases.
It failed on edge cases where the evidence was mixed or the context was long.
Prompt instructions are not machine code. They are suggestions with varying compliance.
The deeper issue: the prompt instruction was trying to do the work that the data model was supposed to do.
FATIGUED_OR_RED_MEDIA was ambiguous by design. The original detection logic could not reliably distinguish the two situations, so it combined them into one issue type.
The LLM was being asked to distinguish what the data model refused to distinguish.
The real fix
Split FATIGUED_OR_RED_MEDIA into two distinct issue types:
FATIGUED_MEDIA: frequency-driven fatigue (high avg frequency, declining CTR over time)RED_MEDIA_CLASSIFICATION: platform quality/rejection signals (low quality score, disapproval flags, high audience rejection rate)
Each issue type now has distinct detection criteria, different evidence fields, and different recommended actions.
@dataclass
class FatiguedMediaEvidence:
avg_frequency: float
frequency_benchmark: float
ctr_delta_7d: float
engagement_rate_delta_7d: float
# No quality_score, no disapproval_flags
@dataclass
class RedMediaEvidence:
quality_score: float
quality_benchmark: float
disapproval_flags: list[str]
audience_rejection_rate: float
# No frequency fieldsThe LLM prompt for FATIGUED_MEDIA does not mention platform suppression because the evidence object does not contain suppression signals.
The LLM prompt for RED_MEDIA_CLASSIFICATION does not mention frequency because the evidence object does not contain frequency data.
The evidence object makes conflation impossible.
Deterministic post-processing as a guardrail
Even with the split data model, I added one more check.
A deterministic keyword validation runs on LLM output before it writes to the database:
FATIGUE_TERMS = {"fatigued", "fatigue", "frequency", "overexposed"}
SUPPRESSION_TERMS = {"suppressed", "disapproved", "rejected", "quality score", "policy"}
def validate_recommendation(
issue_type: str,
recommendation_text: str,
) -> tuple[bool, str | None]:
text_lower = recommendation_text.lower()
if issue_type == "FATIGUED_MEDIA":
found = SUPPRESSION_TERMS & set(text_lower.split())
if found:
return False, f"Suppression language in fatigue recommendation: {found}"
if issue_type == "RED_MEDIA_CLASSIFICATION":
found = FATIGUE_TERMS & set(text_lower.split())
if found:
return False, f"Fatigue language in suppression recommendation: {found}"
return True, NoneIf the check fails, the recommendation is flagged for review or regeneration.
After the data model split, this check fires near zero times in normal operation. But the flag rate is a useful signal for prompt quality. If it starts rising, something changed.
The general pattern
A few things this case makes clear.
Ambiguous issue types produce ambiguous outputs. If your data model has catch-all categories, your LLM will produce catch-all recommendations. The fix is in the schema, not the prompt.
Evidence objects prevent cross-contamination. When the evidence object for an issue type only contains fields relevant to that type, the LLM cannot reason about irrelevant signals. This is more effective than telling the LLM what to ignore.
Deterministic post-processing is cheap insurance. A keyword check on the output catches the rare cross-contamination that slips through.
Prompts handle language, not logic. When you find yourself writing long prompt instructions to compensate for an underspecified data model, the data model is the problem.
What changed
After splitting the issue type:
FATIGUED_MEDIArecommendations consistently describe frequency trends, audience saturation, and creative refresh directionRED_MEDIA_CLASSIFICATIONrecommendations consistently describe quality signals, platform feedback, and remediation steps- Post-processing validation fires near zero times
The creative team now gets recommendations that point to the correct fix without having to re-diagnose the issue themselves.
Related posts in this series:
- Why I Don't Let the LLM Decide Issue State
- The 6-Phase Pipeline for Generating Creative Briefs
- Case study: Creative & Campaign Intelligence Data Platform
If you are building AI systems that generate recommendations from structured data and want to discuss the architecture, get in touch.