Metadata-Version: 2.1
Name: plinkformatter
Version: 0.1.80
Summary: 
Author: nick-sebasco
Author-email: nicksebasco.jax@gmail.com
Requires-Python: >=3.9,<4.0
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Requires-Dist: joblib (>=1.4.2,<2.0.0)
Requires-Dist: numpy (>=1.26.4,<2.0.0)
Requires-Dist: pandas (>=2.2.2,<3.0.0)
Requires-Dist: pyarrow (>=21.0.0,<22.0.0)
Requires-Dist: pytest (>=8.2.2,<9.0.0)
Requires-Dist: scipy (>=1.13.1,<2.0.0)
Description-Content-Type: text/markdown

# PLINKFORMATTER

`plinkformatter` transforms genotype and phenotype inputs into PLINK-compatible
artifacts for downstream linear mixed-model workflows (primarily PyLMM).

This repository is based on the original R workflow implemented in:

- `plinkformatter/IGNORE_misc/pyLMM_utils.R`
- `tests/IGNORE_MISC/hao_v2/pyLMM_analysis_NonDO.R`

The Python implementation keeps the same core workflow while improving
maintainability, testability, and performance for large PED/MAP datasets.

## Prerequisites

- Python 3.8+
- Poetry
- PLINK 2.0 (required for PLINK integration tests and full pipeline runs)

Install dependencies:

```bash
poetry install
```

If `plink2` is not on your `PATH`, set `PLINK2_PATH` explicitly.

PowerShell example:

```powershell
$env:PLINK2_PATH = "C:\path\to\plink2.exe"
```

## Running Tests

Run all commands from the repository root.

### 1) Fast test pass (no external PLINK dependency)

```bash
poetry run pytest -q tests/test_generate_pheno_plink_fast.py
```

### 2) PLINK utility tests

These include PLINK-facing behavior and may require a working `plink2` binary.

```bash
poetry run pytest -q tests/test_plink_utils.py
```

### 3) Full test suite

```bash
poetry run pytest -q tests
```

### Test output flags

- `-q`: quiet output (compact summary)
- `-s`: show stdout/stderr (`print`, logs written to console)

Example:

```bash
poetry run pytest -s -q tests/test_generate_pheno_plink_fast.py
```

## Workflow Parity with R

The Python pipeline mirrors the same logical stages as the R scripts:

1. Extract/normalize phenotype rows for selected measure IDs.
2. Generate per-measure, sex-specific `.ped/.map/.pheno` files.
3. Build `.bed/.bim/.fam` with PLINK.
4. Align `.pheno` ordering to `.fam` and recompute rank-Z on retained samples.
5. Compute kinship (PyLMM3 or PLINK-based path).

## Performance Design (Large PED Files)

A major difference from older dataframe-heavy patterns is how PED is handled in
`generate_pheno_plink_fast.py`:

- It does **not** load the full PED into a pandas DataFrame.
- It builds a compact byte-offset index (`strain -> file position`) once.
- It seeks directly to needed PED rows and writes outputs in a streaming manner.

This avoids high memory usage and scales better for large genotype files than
`pandas.read_csv()` on full PED content.

## Publishing to PyPI

1. Update version:

```bash
poetry version patch
```

2. Build:

```bash
poetry build
```

3. Configure repository and token:

```bash
poetry config repositories.pypi https://upload.pypi.org/legacy/
poetry config pypi-token.pypi pypi-YourActualTokenHere
```

4. Publish:

```bash
poetry publish
```

