Metadata-Version: 2.1
Name: subsampwinner
Version: 0.0.8
Summary: A package for feature selection using Subsampling Winner Algorithm
Home-page: https://github.com/wdai0/subsamp
Author: Wei Dai
Author-email: wdai@gmu.edu
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.7
Classifier: Programming Language :: Python :: 3.8
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Requires-Python: >=3.7
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: numpy>=1.26.4
Requires-Dist: scipy>=1.13.0
Requires-Dist: statsmodels>=0.14.2
Requires-Dist: mpi4py>=3.1.6

# subsamp Feature Selection with Subsampling Winner Algorithm

Subsampling Winner Algorithm (SWA)

SubsampWinner is a Python package that implements the Subsampling Winner Algorithm (SWA) for feature selection in high-dimensional datasets.
It includes a robust double assurance procedure to enhance stability and reliability in feature selection.

## Features

- Subsampling Winner Algorithm (SWA) for efficient feature selection;
- Double Assurance procedure for improved stability;
- Support for both homoskedastic and heteroskedastic data;
- Parallel processing capabilities for improved performance;
- Flexible parameter tuning and multiple testing correction methods.

## Installation

You can install SubsampWinner using pip:

```bash
pip install subsampwinner
```

## Quick Start

We start the experiment by generating a dataset with **80 samples and 100 features**.
We test the performance of the subsampling winner algorithm against different levels of signal strength.
The output includes the indices of the selected features and the summary of the final model.

Additionally, we run the double assurance procedure to further enhance the stability of the feature selection.

```python
### setup
import numpy as np
from subsampwinner.subsamp import subsamp
from subsampwinner.SubsampDoubleAssurance import SubsampDoubleAssurance
from subsampwinner.GenerateData import generate_heteroskedastic_data

# Generate sample data
n, p = 80, 100
beta0 = np.array([0.5, 1, 1.5, 2, 2.5, 3, 3.5, 4, 4.5, 5])
beta0_index = np.arange(len(beta0))
beta = np.zeros(p)
beta[beta0_index] = beta0
gamma = np.zeros(p)

X, y, _, _ = generate_heteroskedastic_data(n, p, hetero_func=lambda x: 1.2,
    beta=beta, gamma=gamma, type='diagonal')

# Initialize and run SWA
swa = subsamp(s=25, m=1000, qnum=15)
swa.fit(X, y)
```

We obtain the following selected feature indices:

```python
# selected variables
selected_features = [selected_var + 1 for selected_var in swa.finalists]

print("Selected features:", selected_features)
```

and the following summary of the final model:

```python
# A summary of selected features
swa.final_model.summary()
```

We verify the stability of the feature selection by running the double assurance procedure.

```python
# Run Double Assurance procedure
sda = SubsampDoubleAssurance(m=1000)
results = sda.double_assurance(X, y, s0=26, T=0.9, I_max=20, init_range=0.3, r=0.75)
```
