Back to prompts
AI Masteryintermediate
4.6

AI Output Quality Grader — Evaluate and Score Any AI Response Objectively

Create a rubric to evaluate AI outputs objectively — useful for comparing models, testing prompts, or building quality assurance into AI workflows.

Copy & Paste this prompt
You are a quality assurance specialist for AI outputs. Help me build a grading system for evaluating AI responses.

What I'm Evaluating:
- Task type: [Writing / Analysis / Code / Creative / Research / Customer service / etc.]
- The prompt that generated the output: [PASTE IT]
- The AI's response: [PASTE IT or describe it]
- What 'excellent' looks like for this task: [DESCRIBE YOUR GOLD STANDARD]
- What I'll use this rubric for: [Comparing models / Testing prompts / QA process / Training]

Build my evaluation system:

**1. SCORING RUBRIC**
Create a task-specific rubric:
| Dimension | Weight | 1 (Poor) | 3 (Adequate) | 5 (Excellent) |
|-----------|--------|----------|--------------|---------------|
| Accuracy | | | | |
| Completeness | | | | |
| Relevance | | | | |
| Clarity | | | | |
| [Task-specific] | | | | |

**2. GRADE THIS OUTPUT**
Apply the rubric to the response I provided:
| Dimension | Score | Reasoning |
|-----------|-------|-----------|

Overall weighted score: X/5

**3. SPECIFIC FEEDBACK**
- What this response does WELL (be specific)
- What's MISSING or wrong (be specific)
- What would make this a 5/5 (exact improvements needed)

**4. REWRITTEN 'GOLD STANDARD'**
Show me what a perfect response would look like for this prompt.

**5. PROMPT IMPROVEMENT**
If the output scored low — what prompt changes would improve results?
- Changes to wording
- Additional context needed
- Constraints to add
- Format specifications

**6. COMPARISON FRAMEWORK**
If comparing multiple AI responses to the same prompt:
- Side-by-side scoring table
- Winner by dimension
- Overall recommendation
- When to use each model/approach

**7. AUTOMATED QA TEMPLATE**
A reusable template I can use to quickly evaluate any AI output:
- 5 yes/no questions for quick pass/fail
- Scoring shorthand for detailed evaluation
- Red flags that mean 'do not use this output'

Be brutally honest in grading. I want to improve, not feel good about mediocre outputs.
#AI-evaluation#quality-assurance#prompt-testing#rubric#benchmarking

Works with

chatgptclaudeany

💡 Pro Tips

  • Grade your AI outputs before using them — 60% of first-draft AI content needs revision
  • Ask follow-up: 'Rewrite this output to score 5/5 on the rubric you just created'
  • Use this rubric to A/B test different prompts for the same task — data beats intuition

✨ Example Output

📊 RUBRIC FOR: Marketing Email Copy

| Dimension | Weight | 1 (Poor) | 3 (Good) | 5 (Excellent) |
|-----------|--------|----------|----------|---------------|
| Hook/Subject | 25% | Generic, no curiosity | Decent, would open | Irresistible, must-open |
| Relevance | 20% | Generic blast | Segment-aware | Personally resonant |
| CTA Clarity | 20% | Buried or vague | Clear but boring | Compelling, specific |
| Tone/Voice | 15% | Corporate/AI-sounding | Professional | Sounds like a human I'd reply to |
| Brevity | 10% | Wall of text | Reasonable length | Every word earns its place |
| Persuasion | 10% | Lists features | Shows benefits | Creates urgency + desire |

📝 GRADING YOUR OUTPUT:
| Dimension | Score | Why |
|-----------|-------|-----|
| Hook | 2/5 | 'I hope this finds you well' — instant delete |
| Relevance | 3/5 | Mentions their industry but not their specific pain |
| CTA | 4/5 | Clear ask, specific time — good |
| Tone | 2/5 | Reads like AI wrote it — too formal |
| Brevity | 3/5 | Could cut 40% without losing meaning |
| Persuasion | 2/5 | All features, no emotion or urgency |

🏆 OVERALL: 2.6/5 — Needs significant revision

✅ QUICK QA CHECKLIST:
- [ ] Would I open this email? (honest answer)
- [ ] Is there ONE clear action to take?
- [ ] Could this apply to anyone, or is it specific?
- [ ] Does it sound like a human wrote it?
- [ ] Would I forward this to a friend?

🧠 Why This Works

Without a rubric, evaluating AI output is subjective — 'this feels good' isn't a system. This prompt creates objective, repeatable scoring criteria specific to YOUR task, then applies them. It's the difference between 'I think this is okay' and 'This scores 3.2/5 on my rubric, and here's exactly what would make it a 5.'

📅 When to Use This Prompt

When testing different prompts to find what works best, when comparing AI models for a specific task, when building QA processes for AI-generated content, or when you need to explain to stakeholders why one AI output is better than another.

🎯 What You'll Get

A task-specific rubric with weighted dimensions, objective scoring of your AI output, specific improvement suggestions, and a reusable QA template. You'll stop accepting mediocre AI outputs and start systematically improving them.

🔗 Related Prompts