Grade Any AI Output — Know If It's Actually Good or Just Sounds Good
AI can sound confident while being wrong. This prompt turns AI into its own quality checker.
Use one AI to judge outputs from multiple models. Get better answers than any single AI can provide alone.
You are a multi-model arbitration expert. I'm going to give you the same question answered by different AI models (or different prompting approaches). Your job: determine which answer is actually best — and synthesize a superior final answer. The question: [THE ORIGINAL QUESTION] Answer A: [PASTE FIRST AI'S ANSWER — label which model/approach] Answer B: [PASTE SECOND AI'S ANSWER] Answer C (optional): [PASTE THIRD AI'S ANSWER] Perform: 1. INDIVIDUAL SCORING — Rate each answer on: accuracy (1-10), depth (1-10), actionability (1-10), clarity (1-10) 2. DISAGREEMENT ANALYSIS — Where do the answers contradict each other? Who's right and why? 3. UNIQUE CONTRIBUTIONS — What does each answer provide that the others miss? 4. BLIND SPOTS — What did ALL answers miss? (This is often the most valuable part) 5. SYNTHESIS — Create the definitive answer by combining the best elements from all responses, fixing errors, and filling gaps 6. CONFIDENCE LEVEL — How confident should I be in the synthesized answer? What still needs human verification? Be ruthlessly objective. Don't favor any model. The goal is truth, not diplomacy.
SCORING: Answer A (GPT-4): Accuracy 8, Depth 9, Actionability 6, Clarity 8 Answer B (Claude): Accuracy 9, Depth 7, Actionability 9, Clarity 9 DISAGREEMENT: A says market size is $4.2B, B says $3.8B. B is likely correct — A's figure includes adjacent markets. BLIND SPOTS: Neither answer discussed regulatory risk, which is critical for this industry. SYNTHESIS: [Combined best-of-both answer with regulatory section added]
Multi-model comparison exploits the fact that different AI models have different strengths, training biases, and failure modes. By having one model judge outputs from others against explicit criteria, you get more balanced, thoroughly vetted results than any single model produces alone.
Use for high-stakes decisions where you want maximum confidence — strategic recommendations, technical architecture choices, or creative directions. Ideal when you have access to multiple AI models and the question is important enough to warrant cross-validation.
You receive a synthesized best answer that combines the strongest elements from multiple model outputs, with a clear rationale for why specific reasoning was selected. Blind spots from individual models get caught and corrected.
AI can sound confident while being wrong. This prompt turns AI into its own quality checker.
A systematic prompt that forces AI to flag its own uncertain claims. Trust but verify — automatically.
Make AI argue both sides, find blind spots, and give a clear recommendation for any decision
Stop guessing which prompt is better. This systematic framework tests variations and picks the winner.
Go beyond 'what happens next' to predict the downstream consequences most people miss