Metadata-Version: 2.4
Name: featsel
Version: 0.1.0
Summary: Feature selection pipeline for high-dimensional data with a focus on genomics and bioinformatics
Author-email: Dror Meirovich <dror@drorm.com>
Maintainer-email: Dror Meirovich <dror@drorm.com>
License: MIT
Project-URL: Homepage, https://github.com/drormeir/featsel
Project-URL: Documentation, https://github.com/drormeir/featsel#readme
Project-URL: Repository, https://github.com/drormeir/featsel
Project-URL: Bug Reports, https://github.com/drormeir/featsel/issues
Keywords: feature-selection,machine-learning,genomics,bioinformatics,high-dimensional-data,gene-expression
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Topic :: Scientific/Engineering :: Bio-Informatics
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Requires-Python: >=3.9
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: numpy>=1.20.0
Requires-Dist: pandas>=1.3.0
Requires-Dist: scipy>=1.7.0
Requires-Dist: scikit-learn>=1.0.0
Requires-Dist: matplotlib>=3.4.0
Requires-Dist: seaborn>=0.11.0
Requires-Dist: pyyaml>=5.4.0
Requires-Dist: tqdm>=4.60.0
Provides-Extra: dev
Requires-Dist: jupyter>=1.0.0; extra == "dev"
Requires-Dist: pytest>=7.0.0; extra == "dev"
Requires-Dist: pytest-cov>=3.0.0; extra == "dev"
Requires-Dist: black>=22.0.0; extra == "dev"
Requires-Dist: flake8>=4.0.0; extra == "dev"
Requires-Dist: mypy>=0.950; extra == "dev"
Provides-Extra: optuna
Requires-Dist: optuna>=3.0.0; extra == "optuna"
Dynamic: license-file

# featsel

Feature selection pipeline for high-dimensional data with a focus on genomics and bioinformatics.

## The Problem

Modern datasets often contain thousands of features but relatively few samples. This is especially common in:

- **Genomics**: Gene expression profiles with 20,000+ genes per patient
- **Text analysis**: Document classification with large vocabularies
- **Sensor data**: IoT and industrial monitoring systems

Training models on such data leads to overfitting, long computation times, and poor interpretability. Feature selection addresses this by identifying the most predictive variables while discarding noise.

## What This Project Does

This project implements a feature selection pipeline that:

- Applies multiple feature selection methods (filter, wrapper, and embedded approaches)
- Compares their effectiveness on classification tasks
- Scales efficiently through parallelization
- Produces interpretable results for domain experts

The primary use case is predicting breast cancer molecular subtypes from gene expression data, but the pipeline generalizes to other high-dimensional classification problems.

## Project Structure

```
├── configs/            # Dataset configuration files (YAML)
├── docs/               # Report chapters (Markdown)
├── notebooks/          # Jupyter notebooks for analysis
├── featsel/            # Main Python package
├── datasets/           # Input data (one subfolder per dataset)
├── figures/            # Generated plots
└── references/         # Project proposal and papers
```

## Installation

```bash
# Clone the repository
git clone https://github.com/drormeir/featsel.git
cd featsel

# Create virtual environment
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

# Install dependencies
pip install -r requirements.txt
```

Future PyPI installation (not yet available):
```bash
pip install featsel
```

## Usage

Run the pipeline with a configuration file:

```bash
python -m featsel.run --config configs/scanb.yaml
```

To use your own dataset, create a config file (see `configs/scanb.yaml` as a template) and a data folder in `datasets/`.

## Datasets

The pipeline is dataset-agnostic. Each dataset needs:
- A subfolder in `datasets/` with `features.csv` and `metadata.csv`
- A YAML config file in `configs/`

### SCAN-B Breast Cancer (included config)

- Gene expression measurements (thousands of features)
- PAM50 molecular subtype labels (Basal, LumA, LumB, HER2, Normal)
- Clinical metadata (ER status, survival data)

**Note**: Data files are not included due to size. Download from [TBD] and place in `datasets/scanb/`.

## Status

This project is part of an M.Sc. thesis at Reichman University, supervised by Dr. Ben Galili.
