Back to prompts
Data & AnalysisPremiumintermediate
4.7

Create a Bulletproof Data Cleaning Checklist

Systematically find and fix messy data — missing values, duplicates, outliers, and format issues

Copy & Paste this prompt
I have a dataset that needs cleaning before analysis.

Dataset info:
- Source: [WHERE THE DATA CAME FROM — CRM, survey, API, manual entry, CSV export]
- Size: [ROWS] rows × [COLUMNS] columns
- Key columns: [LIST THE IMPORTANT ONES]
- Known issues: [ANY PROBLEMS YOU ALREADY KNOW ABOUT]
- Tool I am using: [EXCEL / PYTHON / R / SQL / GOOGLE SHEETS]

Create a complete data cleaning plan:

1. INITIAL ASSESSMENT
   - What to check first (shape, types, head/tail)
   - Exact commands/formulas for my tool
   - Expected vs actual column types

2. MISSING DATA AUDIT
   - How to find missing/null/blank values per column
   - For each column: recommended handling strategy
     (drop / fill with mean / fill with mode / flag / interpolate)
   - How to decide which strategy fits

3. DUPLICATE DETECTION
   - How to find exact duplicates
   - How to find near-duplicates (fuzzy matching)
   - Decision framework: keep first / keep latest / merge

4. OUTLIER ANALYSIS
   - Statistical methods to detect outliers (IQR, Z-score)
   - For each numeric column: what counts as suspicious
   - When to remove vs. investigate vs. keep outliers

5. FORMAT STANDARDIZATION
   - Date formats → one standard
   - Text → consistent casing, trim whitespace
   - Categories → consolidate similar values
   - Numbers → handle currency symbols, commas, units

6. VALIDATION RULES
   - Cross-column consistency checks
   - Range checks (age > 0, percentage ≤ 100)
   - Referential integrity (do all IDs exist?)

7. DOCUMENTATION
   - Cleaning log template: what was changed, why, how many rows affected
   - Before/after summary statistics

Give me the actual code/formulas, not just concepts.
#data#analytics#create#bulletproof#cleaning

Works with

chatgptclaudegemini

💡 Pro Tips

  • Always make a backup before cleaning — work on a copy, never the original
  • Clean columns in dependency order — fix IDs before fixing foreign key references
  • Document every cleaning step — future you will forget why you removed those 47 rows

✨ Example Output

Dataset: Customer orders export (CSV from Shopify)
Tool: Python (pandas)

INITIAL ASSESSMENT:
df = pd.read_csv("orders.csv")
print(df.shape) → (5,712, 23)
print(df.dtypes) → Found 3 columns typed as object that should be numeric
print(df.isnull().sum()) → 6 columns have missing values

MISSING DATA AUDIT:
| Column | Missing | % | Strategy |
| email | 23 | 0.4% | Flag — likely guest checkout |
| shipping_address | 5 | 0.09% | Drop — cannot ship without |
| phone | 1,247 | 22% | Keep as-is — optional field |
| discount_code | 3,891 | 68% | Fill with "none" — absence is meaningful |

DUPLICATE DETECTION:
Found 12 exact duplicates → removed (kept first occurrence)
Found 34 potential near-duplicates (same email + date + total) → flagged for manual review

OUTLIERS:
order_total: IQR method flagged 89 orders over ,340 → investigated, 3 were data entry errors (extra zero), rest legitimate bulk orders

🧠 Why This Works

This prompt systematically addresses every category of data quality issue—completeness, consistency, accuracy, validity, and timeliness—ensuring nothing falls through the cracks. It produces reproducible cleaning steps rather than ad-hoc fixes, making your data pipeline reliable and auditable.

📅 When to Use This Prompt

Use before any analysis, model training, or reporting when working with a new dataset or after data ingestion from external sources. Critical when merging datasets from multiple systems, inheriting legacy databases, or preparing data for machine learning where garbage-in-garbage-out applies.

🎯 What You'll Get

You get a prioritized checklist covering missing value strategies, duplicate detection rules, outlier handling approaches, format standardization steps, and validation rules—each with specific implementation guidance for your tool (Python/pandas, SQL, or spreadsheet).

🔗 Related Prompts

Productivity at WorkPremium

Automation Opportunity Finder

Identify and prioritize the highest-impact automation opportunities in your workflow with ROI estimates.

automationworkflowefficiency
4.7
intermediate