Metadata-Version: 2.1
Name: nucleoseeker
Version: 0.1.2
Summary: Precision filtering of RNA databases to curate high-quality datasets
Home-page: https://github.com/theuutkarsh/nucleoseeker
Author: Utkarsh Upadhyay
Author-email: u.upadhyay@fz-juelich.de
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: POSIX
Classifier: Operating System :: POSIX :: Linux
Classifier: Operating System :: MacOS :: MacOS X
Classifier: Operating System :: Unix
Classifier: Programming Language :: Python :: 3.12
Requires-Python: >=3.6
Description-Content-Type: text/markdown
License-File: LICENSE

# NucleoSeeker - Precision filtering of RNA databases to curate high-quality datasets

**Note**: NucleoSeeker currently supports Unix-based systems, including macOS.

## Dependencies

NucleoSeeker relies on a few external command-line tools. Before running the software, ensure these tools are properly installed on your system.

| Dependency       | Minimum Version | Installation Guide                                                    |
|------------------|-----------------|-----------------------------------------------------------------------|
| **Clustal Omega**| `1.2.4`         | [Clustal Omega Setup Instructions](http://www.clustal.org/omega/INSTALL)|
| **Infernal**     | `1.1.5`         | [Infernal Setup Instructions](http://eddylab.org/infernal/)           |
| **Emboss**       | `6.6.0`         | [Emboss Setup Instructions](http://emboss.open-bio.org/html/adm/ch01s01.html) (**Optional**) |

### Quick Installation Instructions
### Install Clustal Omega
* Clustal omega version supported `1.2.4`

```
  wget http://www.clustal.org/omega/clustal-omega-1.2.4.tar.gz
  tar zxf clustal-omega-1.2.4.tar.gz
  cd clustal-omega-1.2.4
  ./configure --prefix /your/install/path
  make
  make check                 # optional: run automated tests
  make install               # optional: install Infernal programs, man pages

  # or use this

  sudo apt-get install clustalo

```


### Install Emboss (Optional)
* **NOTE** - *Emboss is very slow, unless you are experimenting we don't recommend using it. Clustal Omega should be sufficient for most use cases.*
* Emboss version supported `6.6.0`

### Install Infernal

* Infernal version supported `1.1.5`

```
  wget http://eddylab.org/software/infernal/infernal.tar.gz
  tar zxf infernal.tar.gz
  cd infernal-1.1.5
  ./configure --prefix /your/install/path
  make
  make check                 # optional: run automated tests
  make install               # optional: install Infernal programs, man pages

  # or use this

  sudo apt-get install infernal infernal-doc

```


### Get Rfam.cm file ready

* To use this tool, you need to provide the Rfam covariance model which is available for download at https://ftp.ebi.ac.uk/pub/databases/Rfam/CURRENT/Rfam.cm.gz. It also needs to be modified using `cmpress` command from `Infernal` tool (mentioned above). If you don't have it then use the code below - 

```
  mkdir -p rfam
  cd rfam
  wget https://ftp.ebi.ac.uk/pub/databases/Rfam/CURRENT/Rfam.cm.gz -O Rfam.cm.gz
  gunzip Rfam.cm.gz
  cmpress Rfam.cm
```

### Generating new dataset

After the preliminary steps, a new dataset can be generated by installing `NucleoSeeker` as follows:

```
# We recommend setting up a virtual env when using this tool

python3 -m venv nucleoseeker_env
source nucleoseeker_env/bin/activate
pip install nucleoseeker
```

After you have prepared the environment you can generate datasets using the following code

```
export DATA_PATH=/your/desired/path/to/save/the/dataset
nucleoseeker \
        --dataset_name test_dataset \
        --rfam_cm_path your/rfam/path \
        --exptl_method "X-RAY DIFFRACTION" \
        --resolution 3.6 \
        --year_range 2019 \
        --dend 500
        --save 1 \

```
After using this command, a directory with the name `test` will be created in the `DATA_PATH` directory with the following structure:

```
      ├── DATA_PATH
      ├── pdb_files
      ├── test_dataset
          ├── files
          ├── sequences
          ├──clean_tblout.tblout
          ├──cmscan.out
          ├──combined.fasta
          ├──fam_pdb_chain.csv
          ├──final.fasta
          ├──raw_experimental_RNA_0_500.csv
          ├──sequence_identity_mat_clustal.csv
          ├──tblout.tblout

```

- **raw_experimental_RNA_0_500.csv**: Data for first 500 results from the PDB database.
  
- **combined.fasta**: Sequences used in sequence identity calculation by Clustal Omega and Emboss, obtained after applying StructureLevelFilter and PDBFilter on the raw data.
  
- **sequence_identity_mat_clustal.csv**: Sequence identity matrix obtained from Clustal Omega and Emboss tools.
  
- **final.fasta**: Final sequences in fasta format; the output if family analysis is not required.
  
- **cmscan.out, tblout.tblout, clean_tblout.tblout**: Output files of the Infernal tool.
  
- **fam_pdb_chain.csv**: Mapping of family and PDB chain, obtained after family search by Infernal; the final output for family analysis.
  
- **test_dataset/files**: Directory containing dataframes and lists for structures at each filter level.
  
- **test_dataset/sequences**: Directory containing sequences for all final structures in individual fasta files.

For some simple examples, please take a look at the Jupyter-Notebook in the `examples` directory of the GitHub repository [here](https://github.com/theuutkarsh/nucleoseeker/blob/main/examples/simple_examples.ipynb).




