Metadata-Version: 2.4
Name: mixsad-anomaly-detection
Version: 0.1.0
Summary: An implementation of the MixSAD algorithm for anomaly detection in mixed-feature data.
Author-email: Your Name <you@example.com>
Project-URL: Homepage, https://github.com/your-username/mixsad
Project-URL: Issues, https://github.com/your-username/mixsad/issues
Classifier: Programming Language :: Python :: 3
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Requires-Python: >=3.7
Description-Content-Type: text/markdown
Requires-Dist: pandas
Requires-Dist: numpy
Requires-Dist: scikit-learn

MixSAD: High-Performance Fraud Detection
This project implements a high-performance, supervised learning pipeline for fraud detection. Originally based on the unsupervised MixSAD algorithm, the model has been significantly enhanced to use a direct supervised approach, enabling it to achieve high accuracy and recall on complex fraud detection tasks.

The current implementation is optimized to run on the Kaggle Credit Card Fraud Detection dataset.

Core Approach: Supervised Prediction
The key to the model's high performance is its shift from unsupervised anomaly detection to a direct supervised classification strategy.

Supervised Feature Engineering: The pipeline trains a LogisticRegression model on the labeled data. This model's primary purpose is to generate a powerful, predictive feature: a fraud_score for each transaction, which represents the probability of that transaction being fraudulent.

Threshold-Based Prediction: Instead of using a complex secondary model, predictions are made by applying a simple probability threshold to the fraud_score. Any transaction with a score greater than or equal to the threshold is classified as fraud.

This direct approach is highly effective and transparent, allowing for precise control over the model's sensitivity to fraud.

Project Structure
mixsad/: The main package source code, including the pipeline, preprocessor, feature_engineer, and prediction_builder.

examples/: Contains the run_on_kaggle_data.py script demonstrating how to use the package.

pyproject.toml: The package configuration file.

README.md: This file.

Setup and Installation
Local Setup

Clone the repository and navigate into it.

Create a virtual environment: python -m venv venv and activate it.

Install requirements: pip install -r requirements.txt

Install the package in editable mode: pip install -e .

Usage
Download the Dataset:

Download the "Credit Card Fraud Detection Dataset" from Kaggle.

Rename the file to credit_card_fraud.csv and place it in the project's root directory.

Run the Example:
Execute the example script to see the model in action:

python examples/run_on_kaggle_data.py

Fine-Tuning for High Performance 🎯
For fraud detection, missing a real case of fraud (low recall) is usually much worse than flagging a legitimate transaction for review (low precision). The primary way to fine-tune this model is by adjusting the probability threshold.

Adjusting the Prediction Threshold

The run method of the pipeline accepts a threshold parameter.

A higher threshold (e.g., 0.7) makes the model more conservative. It will only flag transactions it is very confident are fraudulent. This leads to high precision but lower recall.

A lower threshold (e.g., 0.3) makes the model more sensitive. It will flag transactions that have even a small chance of being fraudulent. This leads to high recall but lower precision.

The examples/run_on_kaggle_data.py script demonstrates this principle by running the pipeline with two different thresholds to show how it directly impacts the precision-recall trade-off.

# The example script shows how to adjust the threshold
# to meet the goal of >90% recall for fraud.
pipeline.run(df_features, true_labels, threshold=0.30)

By adjusting this single parameter, you can configure the model to meet the specific business requirements of your fraud detection system.

