How do I use the "Handle Missing Data Values" prompt?

Copy the prompt template, fill in the variables (DATASET_TYPE, DATASET_SIZE, MISSING_PERCENTAGE, MISSING_VARIABLES, ANALYSIS_GOAL, DOMAIN_CONTEXT), and paste it into ChatGPT, Claude, or Gemini.

What AI tools work with this data prompt?

This prompt works with chatgpt, claude, gemini. Simply copy and paste the template into any of these AI assistants.

Tips for getting the best results?

Always analyze the missingness pattern before choosing an imputation method - different patterns require different approaches For prediction tasks, consider using algorithms that handle missing values natively (like XGBoost) as an alternative to imputation Validate your imputation by comparing statistical properties of imputed vs. original data distributions Document your missing data assumptions and perform sensitivity analyses to test how different approaches affect your final results For critical analyses, consider multiple imputation techniques that account for uncertainty in the imputed values

data intermediate

Handle Missing Data Values

Get AI-powered strategies for handling missing data values in datasets. Learn imputation techniques, deletion methods, and best practices.

Works with: chatgptclaudegemini

Prompt Template

You are an expert data scientist specializing in data preprocessing and missing data analysis. I need comprehensive guidance on handling missing data values in my dataset. **Dataset Information:** - Dataset type: [DATASET_TYPE] - Size: [DATASET_SIZE] - Missing data percentage: [MISSING_PERCENTAGE] - Key variables with missing values: [MISSING_VARIABLES] - Analysis goal: [ANALYSIS_GOAL] - Domain context: [DOMAIN_CONTEXT] Please provide a detailed strategy that includes: 1. **Missing Data Pattern Analysis**: Assess whether the data is Missing Completely at Random (MCAR), Missing at Random (MAR), or Missing Not at Random (MNAR) based on the context provided. 2. **Recommended Approaches**: For each variable with missing data, suggest the most appropriate handling method from: - Deletion methods (listwise, pairwise) - Imputation techniques (mean/median/mode, regression, KNN, multiple imputation, model-based) - Advanced methods (ML-based imputation, interpolation) 3. **Implementation Steps**: Provide step-by-step instructions for implementing your recommended approach, including any statistical tests or validation methods. 4. **Impact Assessment**: Explain how each approach might affect the analysis results and what assumptions are being made. 5. **Code Examples**: Include practical code snippets in Python/R for implementing the recommended solutions. 6. **Quality Checks**: Suggest methods to validate the imputation quality and assess potential bias introduced. Consider the trade-offs between data completeness, accuracy, and the specific requirements of the intended analysis.

Variables to Customize

[DATASET_TYPE]

Type of dataset (e.g., survey data, time series, experimental data, observational study)

Example: customer survey data with demographic and satisfaction scores

[DATASET_SIZE]

Number of rows and columns in the dataset

Example: 5,000 rows, 25 columns

[MISSING_PERCENTAGE]

Overall percentage of missing data and per-variable percentages

Example: 15% overall, with income (30%), age (5%), satisfaction_score (20%)

[MISSING_VARIABLES]

Specific variables/columns that have missing values

Example: income, age, education_level, satisfaction_score, purchase_frequency

[ANALYSIS_GOAL]

The intended analysis or modeling objective

Example: predict customer churn using logistic regression and random forest models

[DOMAIN_CONTEXT]

Business or research domain and any relevant context

Example: e-commerce platform analyzing customer behavior for retention strategies

Example Output

Based on your customer survey dataset, here's my comprehensive missing data strategy: **1. Missing Data Pattern Analysis:** The income variable (30% missing) likely follows MAR pattern - missing based on age/education demographics. Age (5% missing) appears MCAR - random survey non-response. Satisfaction scores (20% missing) might be MNAR - dissatisfied customers less likely to respond. **2. Recommended Approaches:** - **Income**: Multiple imputation using age, education, and purchase behavior as predictors - **Age**: Median imputation within demographic segments - **Satisfaction_score**: Model-based imputation with sensitivity analysis for MNAR assumption **3. Implementation Steps:** ```python from sklearn.impute import KNNImputer from sklearn.experimental import enable_iterative_imputer from sklearn.impute import IterativeImputer # Multiple imputation for income iterative_imputer = IterativeImputer(random_state=42) df[['income']] = iterative_imputer.fit_transform(df[['income', 'age', 'education']]) ``` **4. Impact Assessment:** Multiple imputation preserves uncertainty and relationships between variables, crucial for churn prediction accuracy. This approach maintains statistical power while introducing minimal bias. **5. Quality Checks:** - Compare imputed vs. observed value distributions - Cross-validation with different imputation methods - Sensitivity analysis on final model performance

Pro Tips for Best Results

Always analyze the missingness pattern before choosing an imputation method - different patterns require different approaches
For prediction tasks, consider using algorithms that handle missing values natively (like XGBoost) as an alternative to imputation
Validate your imputation by comparing statistical properties of imputed vs. original data distributions
Document your missing data assumptions and perform sensitivity analyses to test how different approaches affect your final results
For critical analyses, consider multiple imputation techniques that account for uncertainty in the imputed values