Metadata-Version: 2.1
Name: emu-popgen
Version: 1.1
Summary: EM-PCA for inferring population structure in the presence of missingness
Home-page: https://github.com/Rosemeis/emu
Author: Jonas Meisner
Author-email: meisnerucph@gmail.com
Classifier: Programming Language :: Python :: 3
Classifier: License :: OSI Approved :: GNU General Public License v3 (GPLv3)
Classifier: Operating System :: OS Independent
Classifier: Development Status :: 4 - Beta
Requires-Python: >=3.10
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: cython>3.0.0
Requires-Dist: numpy>2.0.0

# EMU
EMU is a software for performing principal component analysis (PCA) in the presence of missingness for genetic datasets. EMU can handle both random and non-random missingness by modelling it directly through a truncated SVD approach. EMU uses binary PLINK files as input.

### Citation
Please cite our paper in *Bioinformatics*: https://doi.org/10.1093/bioinformatics/btab027

## Installation
```bash
# Build and install via PyPI
pip install emu-popgen

# Download source and install via pip
git clone https://github.com/Rosemeis/emu.git
cd emu
pip install .

# Download source and install in new Conda environment
git clone https://github.com/Rosemeis/emu.git
conda env create -f environment.yml
conda activate emu

# You can now run the program with the `emu` command
```

## Quick usage
### Running EMU
Provide `emu` with the file prefix of the PLINK files.
```bash
# Check help message of the program
emu -h

# Model and extract 2 eigenvectors using the EM-PCA algorithm
emu --bfile test --eig 2 --threads 64 --out test.emu
```

### Memory efficient implementation
A more memory efficient implementation has been added. It is based of the randomized SVD algorithm using custom matrix multiplications that can handle decomposed matrices. Only factor matrices as well as the 2-bit genotype matrix is kept in memory.
```bash
# Example run using '--mem' argument
emu --mem --bfile test -eig 2 -threads 64 -out test.emu.mem
```
