Skip to main content
data advanced

Create Advanced Outlier Detection Script

Generate production-ready Python scripts for detecting outliers in datasets using multiple statistical methods and visualizations.

Works with: chatgptclaudegemini

Prompt Template

Create a comprehensive Python script for outlier detection in datasets. The script should be production-ready and include multiple detection methods. **Dataset Information:** - Dataset type: [DATASET_TYPE] - Target columns for analysis: [TARGET_COLUMNS] - Dataset size: [DATASET_SIZE] - File format: [FILE_FORMAT] **Requirements:** 1. Implement at least 3 different outlier detection methods: IQR, Z-score, and Isolation Forest 2. Include statistical summary and visualization of detected outliers 3. Provide options to either remove, cap, or flag outliers 4. Add data validation and error handling 5. Generate a comprehensive report of findings 6. Include configurable thresholds and parameters 7. Add logging functionality for tracking changes 8. Create before/after comparison visualizations **Technical Specifications:** - Use pandas, numpy, matplotlib, seaborn, and sklearn - Include docstrings and type hints - Make the script modular with separate functions - Add command-line interface support - Include unit tests for key functions - Handle missing values appropriately - Support both univariate and multivariate outlier detection The script should be well-documented, efficient, and suitable for integration into data preprocessing pipelines. Include example usage and parameter explanations.

Variables to Customize

[DATASET_TYPE]

Type of dataset being analyzed

Example: customer transaction data with numerical features

[TARGET_COLUMNS]

Specific columns to analyze for outliers

Example: transaction_amount, account_balance, monthly_spending

[DATASET_SIZE]

Approximate size of the dataset

Example: 50,000 rows, 15 columns

[FILE_FORMAT]

Format of the input data file

Example: CSV

Example Output

```python import pandas as pd import numpy as np import matplotlib.pyplot as plt import seaborn as sns from sklearn.ensemble import IsolationForest from scipy import stats import logging from typing import Tuple, List, Dict class OutlierDetector: def __init__(self, contamination=0.1, z_threshold=3, iqr_multiplier=1.5): self.contamination = contamination self.z_threshold = z_threshold self.iqr_multiplier = iqr_multiplier self.outlier_indices = {} def detect_iqr_outliers(self, data: pd.Series) -> List[int]: """Detect outliers using Interquartile Range method""" Q1 = data.quantile(0.25) Q3 = data.quantile(0.75) IQR = Q3 - Q1 lower_bound = Q1 - self.iqr_multiplier * IQR upper_bound = Q3 + self.iqr_multiplier * IQR return data[(data < lower_bound) | (data > upper_bound)].index.tolist() def detect_zscore_outliers(self, data: pd.Series) -> List[int]: """Detect outliers using Z-score method""" z_scores = np.abs(stats.zscore(data.dropna())) return data[z_scores > self.z_threshold].index.tolist() ```

Pro Tips for Best Results

  • Test the script on a sample of your data first to optimize thresholds and parameters
  • Consider domain expertise when choosing detection methods - some outliers might be valid extreme values
  • Visualize outliers before removal to understand their distribution and potential business impact
  • Save outlier detection results and parameters for reproducibility and audit trails
  • Use cross-validation to tune hyperparameters like contamination rate in Isolation Forest

Tags

Want 500+ Expert Prompts?

Get the Premium Prompt Pack — organized, tested, and ready to use.

Get it for $29

Related Prompts You Might Like