Metadata-Version: 2.1
Name: CoMBCR
Version: 0.2.1
Summary: A python lib for CoMBCR
Home-page: https://github.com/deepomicslab/CoMBCR.git/
Author: Yiping Zou
Author-email: yipingzou2-c@my.cityu.edu.hk
License: MIT
Classifier: Programming Language :: Python :: 3
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Requires-Python: >=3.6
Description-Content-Type: text/markdown
License-File: LICENSE.md

# CoMBCR
## Introduction
CoMBCR is an innovative B-cell embedding method designed to integrate multi-modal data from B cells, particularly BCRs and gene expressions, within a co-learning framework. By accepting paired BCR sequences and gene expression profiles as input, CoMBCR effectively integrates these two modalities to produce joint representations for each B cell, focusing specifically on the heavy chain of BCRs. 
## Prerequisites
CoMBCR is implemented in Python and requires a GPU for the acceleration. 

We recommend the versions of the following packages:  
- Pytorch (2.4.1)  
- Transformers (4.41.2)  
- Numpy (1.26.4)  
- Pandas (2.2.3)  
- Scikit-learn (1.5.1)  
- huggingface_hub by ```python3 -m pip install huggingface_hub```

Please install the following packages if you want to use the visualization functions:
- anndata (0.9.2)
- scanpy (1.9.8)
- matplotlib (3.6.3)


## Installation
Install CoMBCR using pip:

```
pip3 install CoMBCR
```
Then, install the default pre-trained encoder (The code only need to be executed once when install CoMBCR):
```
from CoMBCR.utils import download_BCRencoder
download_BCRencoder()
```
## Tutorial
We provide a [tutorial](./tutorial.ipynb) for the usage of CoMBCR. The following usage section is for the current version of CoMBCR.

Please refer to [tutorial_pair](./tutorial_pair.ipynb) if you want to use the paired chains. Kindly noted that the paired-chains will cost double computational resources and the performance won't increase significantly according to the current tested outcomes. 

## Usage
> ### Prepare input data
> CoMBCR integrates BCRs and gene expressions but requires three files: **a BCR sequences file, a gene expression file, and a file containing BCR embeddings** generated by a BCR encoder (e.g., AntiBERTa, ESM2).  
> - Ensure each file includes an index column labeled "barcode," serving as a unique identifier for each cell.   
> - Verify that the cells are aligned in the same order across all three files.
>> #### BCR sequences file
>> This CSV file should include an index column named "barcode" and columns labeled "fwr1", "cdr1", "fwr2", "cdr2", "fwr3", "cdr3" and "fwr4". The file should resemble the example shown below: ![](images/BCRs.png)
>> #### Gene expression file
>> Normalization and log-transformation are recommended. Batch effect removal is advisable if applicable. We suggest using the top 5,000 highly variable genes, though you can select input genes according to your criteria.
>> #### Original BCR embeddings file
>> Please clone or download the "runberta.py" in this github. This script generates the original BCR embeddings required for computing pairwise BCR distances in the CoMBCR framework. We recommend using our default pre-trained encoder, though any encoder can be used to encode BCRs. 
>> ```
>> python3 runberta.py --datapath "exampledata/example_bcr.csv" --outdir "example_outdir" --outfilename "antiberta_embedding.csv"
>> ```
>> The code generates an original BCR embedding file named "antiberta_embedding.csv" under the outdir.
> ### Quick run
>> To quickly run CoMBCR, use the following code:  
>> ```python
>> from CoMBCR.CoMBCR import CoMBCR_main
>> bcremb, gexemb = CoMBCR_main(bcrpath="exampledata/example_bcr.csv", 
>>            rnapath="exampledata/example_rna.csv", 
>>            bcroriginal="exampledata/example_bcrori.csv", 
>>            outdir="example_outdir",
>>            epochs=1,  # You can revise the epochs here. Default if 200.
>>            batch_size=32,
>>            encoderprofile_in_dim=5000)
>> ```
>> This code returns numpy arrays for BCR embeddings and gene expression embeddings, and outputs "bcrembedding.csv" and "gexembedding.csv" in the specified output directory.  

> ### Parameters of CoMBCR
>> | Parameter | Description |
>> | ------------- | ------------- |
>> | **bcrpath** | (Required) The path to the BCR sequences file.|
>> | **rnapath** | (Required) The path to the gene expression file.|
>> | **bcroriginal**| (Required) The path to the BCR original embedding file.|
>> |**outdir**|(Required) The directory where the best checkpoint file and the output embeddings will be stored.|
>> |**checkpoint**|Default is "best_network.pth". This parameter specifies the name of the saved checkpoint.|
>> |**lr**|Default is 1e-6.|
>> |**lam**| Default is 1e-1. Intra-modal constrastive loss weight (α in paper).|
>> |**batch_size** | Default is 256.|
>> |**epochs** | Default is 200.|
>> |**patience**| Default is 15, the patience for early stopping.|
>> |**save_epoch** | Default is None. If specified (e.g., 150), saves the model at that epoch and exits training. By default, uses early stopping strategy.
>> |**lr_step** | Default is [50,100]. These are the milestones for the MultiStepLR setting, which adjusts the learning rate at specified epochs.|
>> |**encoderprofile_in_dim**| Default is 5000. Adjust this parameter if the number of input genes differs from 5000.|
>> |**separatebatch**|The default is False. If set to True, BCRs from different samples will be treated as distinct BCRs. Ensure that your BCR input file contains a "sample" column if you choose to enable this option. |
>> |**user_defined_cluster**|Default is False. If set to True, the model utilizes custom cluster labels specified in the "cluster_label" column of the BCR input file for intra-modal contrastive learning.|

## Visualization
We provide functions to interpret the optimization performance and visualize the output embeddings.

### 1. Optimization performance

Use `plot_training_loss` to visualize the optimization process. This function plots three key loss components:

- **Cross-Modal Loss (L_cross)**: Measures cross-modal alignment. A decrease indicates the model is learning the correspondence between BCR and GEX modalities.
- **Profile Loss (L_p)**: Measures preservation of GEX intrinsic structure. A decrease indicates biological variation is being retained.
- **BCR Loss (L_b)**: Measures preservation of BCR intrinsic structure. A decrease indicates clonal relationships are being maintained.

**Example:**

```python
from CoMBCR.visualization import plot_training_loss

# Visualize training progress
# Mode: 'earlystopping' (default) or 'save_epoch'
fig = plot_training_loss(
    log_path='example_outdir/CoMBCR.pth.log', # Path to the log file
    mode='earlystopping',
    save_path='training_loss.png'  # Optional: save figure
)
```
<details>
<summary><b>Key Parameters</b></summary>

- `log_path` (required): Path to the training log file (e.g., `'output/CoMBCR.pth.log'`)
- `mode` (default: `'earlystopping'`): Set to `'save_epoch'` if you designated a specific epoch to save
- `save_path` (default: `None`): Path to save the output figure. If `None`, the figure is displayed but not saved

</details>
<details>
<summary><b>Output Figure</b></summary>

<img src="images/loss_visualization.png" width="80%">

</details>


### 2. Joint Embedding Visualization

Use `create_joint_embedding_adata` to process the output embeddings into a Scanpy-compatible AnnData object for downstream analysis.

**Example:**
```python
from CoMBCR.visualization import create_joint_embedding_adata
import scanpy as sc

# 1. Create AnnData with joint embeddings
adata = create_joint_embedding_adata(
    bcr_emb_path="example_outdir/Embeddings/bcr_embeddings.csv",
    gex_emb_path="example_outdir/Embeddings/gex_embeddings.csv",
    metadata="example_outdir/annotation.csv",  # Optional
)

# 2. Visualize using standard Scanpy workflow
sc.pl.umap(adata, color='celltypes', title='CoMBCR Joint Embedding')
```
<details>
<summary><b>Key Parameters</b></summary>

- `bcr_emb_path` (str): Path to the generated BCR embedding CSV file.
- `gex_emb_path` (str): Path to the generated GEX embedding CSV file.
- `metadata` (str or pd.DataFrame): Optional. Path to your annotation file (or a DataFrame). The index should match the barcodes.
</details>

<details>
<summary><b>Output</b></summary>

The returned `adata` object is structured for flexible analysis:
- `adata.X`: Stores the **Joint Embeddings**. Use this for clustering and global visualization.
- `adata.obsm['CoMBCR_bcr']`: Stores the CoMBCR-BCR Embeddings.
- `adata.obsm['CoMBCR_gex']`: Stores the CoMBCR-GEX Embeddings.
</details>

<details>
<summary><b>Output Figure</b></summary>

<table>
  <tr>
    <td><img src="images/joint_visualization1.png" width="400"></td>
    <td><img src="images/joint_visualization2.png" width="400"></td>
  </tr>
</table>

</details>

<details>
<summary><b> Visualize CoMBCR's Individual Modalities</b></summary>

Sometimes, users may want to inspect CoMBCR-BCR embeddings or CoMBCR-GEX embeddings. To visualize these embeddings separately, use `create_sub_embedding_adatas`:
```python
from CoMBCR.visualization import create_sub_embedding_adatas
import scanpy as sc

# Extract separate embeddings from joint AnnData
bcr_adata, gex_adata = create_sub_embedding_adatas(
    adata,
    compute_umap=True
)

# Visualize BCR embedding
sc.pl.umap(bcr_adata, color='v_genes', title='BCR Embedding')

# Visualize GEX embedding
sc.pl.umap(gex_adata, color='celltypes', title='GEX Embedding')

```
</details>

<!-- ## Acknowledgements
The code was based in part on the source code of [UniTCR](https://github.com/bm2-lab/UniTCR/tree/main). -->
## Questions
If you encounter issues installing or using CoMBCR, please feel free to open an issue or contact me via [email](yipingzou2-c@my.cityu.edu.hk).

