Metadata-Version: 2.4
Name: raep
Version: 0.0.4
Summary: Random Forest Enzyme Prediction
Author-email: DHY <dhy.scut@outlook.com>
License: MIT
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Intended Audience :: Science/Research
Requires-Python: >=3.7
Description-Content-Type: text/markdown
Requires-Dist: numpy
Requires-Dist: pandas
Requires-Dist: scikit-learn
Requires-Dist: xgboost
Requires-Dist: joblib

# RAEP: Rapid Enzyme/Non-Enzyme Prediction

[![Python Version](https://img.shields.io/badge/Python-3.8%2B-blue.svg)](https://www.python.org/)
[![License](https://img.shields.io/badge/License-MIT-green.svg)](LICENSE)

RAEP (Rapid Enzyme/Non-Enzyme Prediction) is an efficient enzyme/non-enzyme prediction tool for protein sequences, built on multi-physicochemical property features and the XGBoost machine learning algorithm.

## Features

* **Efficient Prediction** : Achieves fast and accurate enzyme/non-enzyme classification using optimized feature extraction and the XGBoost model.
* **Multi-Mode Support** : Supports single-sequence prediction, multi-sequence batch prediction, and FASTA file batch prediction.
* **Rich Feature Set** : Utilizes multi-physicochemical property pseudo-amino acid composition (Pseudo-AAC), CTD features, and windowed amino acid composition.
* **User-Friendly** : Offers a concise Python API that is easy to integrate into existing projects.
* **Multi-Process Optimization** : Employs multi-process parallel processing in the feature extraction step to improve processing efficiency for large-scale datasets.

## Installation Methods

### **Install from PyPI (Recommended)**

```bash
pip install raep
```

### **Install from Source Code**

```bash
# 克隆仓库（如果有）
git clone <repository-url>
cd RAEP

# 开发模式安装
pip install -e .
```

## Dependencies

- joblib: Used for model saving/loading and parallel processing.
- numpy: Numerical computation.
- pandas: Data processing.
- scikit-learn: Machine learning utilities and evaluation metrics.
- xgboost: Implementation of the gradient boosting tree algorithm.

## Basic Usage

### Import and Initialization

```python
from raep_package import RAEP

# 默认初始化（使用内置模型）
predictor = RAEP()

# 使用自定义模型路径初始化
# predictor = RAEP(model_path="path/to/your/model.pkl")
```

### Single Sequence Prediction

```python
# 预测单个蛋白质序列
sequence = "MVLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHFDLSHGSAQVKGHGKKVADALTNAVAHVDDMPNALSALSDLHAHKLRVDPVNFKLLSHCLLVTLAAHLPAEFTPAVHASLDKFLASVSTVLTSKYR"
prediction, probability = predictor.predict_sequence(sequence)

print(f"预测结果: {'酶' if prediction == 1 else '非酶'}")
print(f"预测概率: 非酶={probability[0]:.4f}, 酶={probability[1]:.4f}")
```

### Batch Prediction for Multiple Sequences

```python
# 批量预测多个序列
sequences = [
    "MASMTGGQQMGRGSEF",
    "MKVLVVLVLLAVLVLA",
    "MDEKTTGWRGGHVA"
]

results = predictor.predict_sequences(sequences)

for i, (pred, prob) in enumerate(results, 1):
    print(f"序列 {i}: {'酶' if pred == 1 else '非酶'} (酶概率: {prob[1]:.4f})")
```

### Batch Prediction for FASTA Files

```python
# 从FASTA文件批量预测
fasta_path = "test_sequences.fasta"
results = predictor.predict_fasta(fasta_path)

print(f"预测结果 ({len(results)} 个序列):")
for i, (pred, prob) in enumerate(results, 1):
    print(f"序列 {i}: {'酶' if pred == 1 else '非酶'} (酶概率: {prob[1]:.4f})")
```

## API Reference

### RAEP Class

#### Initialization

```python
RAEP(model_path=None)
```

**Purpose** : Instantiates the RAEP predictor, automatically loads the model and initializes feature extraction parameters (e.g., `LAG=10`, `W=0.05`), ensuring consistency in subsequent prediction workflows.

- **Parameters** :
  - `model_path`: Optional, path to a custom model file. If not provided, the built-in `enzyme_xgb_model.pkl` model will be used.

#### Methods

##### predict_sequence

```python
predict_sequence(sequence)
```

**Purpose** : Predicts whether a single protein sequence is an enzyme.

- Parameters:

  - `sequence`: String, the protein sequence to be predicted.
- Return Value:

  - Tuple `(prediction, probability)`:
    - `prediction`: Integer, the predicted class (0 = Non-enzyme, 1 = Enzyme).
    - `probability`:List ,containing two floats, representing the probabilities of the sequence being a non-enzyme and an enzyme, respectively.

##### predict_sequences

```python
predict_sequences(sequences)
```

**Purpose** : Performs batch prediction for multiple protein sequences.

- **Parameters** :

  - `sequences`: List containing multiple protein sequence strings.
- **Return Value**:

  - List where each element is a tuple `(prediction, probability)` corresponding to the prediction result of the input sequence.

##### predict_fasta

```python
predict_fasta(fasta_path)
```

**Purpose** : Performs batch prediction for protein sequences from a FASTA file.

- **Parameters**:

  - `fasta_path`: String, path to the FASTA file.
- **Return Value**:

  - List where each element is a tuple `(prediction, probability)` corresponding to the prediction result of the sequences in the FASTA file.

## Usage Examples

Refer to the `example_usage.py` file in the project, which contains complete usage examples:

```bash
python example_usage.py
```

## Notes

1. **Sequence Format Requirements**: Input sequences should only contain single-letter codes (uppercase) for the 20 standard amino acids.
2. **Sequence Length**: The tool automatically processes sequences of different lengths, but excessively short sequences (e.g., < 10 amino acids) may affect prediction accuracy.
3. **Multi-Process Processing**: The feature extraction process uses multi-processing acceleration by default, which automatically adjusts based on the number of CPU cores in the system.
4. **Model File**: Ensure the model file exists and is accessible, especially when using a custom model path.

## Troubleshooting

### Common Issues

1. **Failed to import the RAEP package**: Ensure the package is correctly installed in the current Python environment.
2. **Model loading failure**: Verify that the model file path is correct and the file exists at the specified location.
3. **Prediction errors**: Check if the input sequence format is valid and contains only standard amino acid characters.
4. **Performance issues**: For extremely large datasets, consider processing in batches to avoid memory overflow.

### Getting Help

If you encounter any problems, please contact the author:

- Author: DHY
- Email: dhy.scut@outlook.com

## License

This project is licensed under the MIT License. See the [LICENSE](LICENSE) file for details.

## Acknowledgments

The development of this project is supported by several open-source tools, especially machine learning libraries such as XGBoost and scikit-learn.

*Version: 0.0.4*
*Last updated: 2025*
