Metadata-Version: 2.1
Name: vembed
Version: 0.23
Summary: Package providing methods to create Vector Embeddings from Strings, calculate similarities between lists of Strings, and Generate Visualizations such as Heatmaps from simple Lists.
Home-page: https://github.com/kuro337/vembed
Author: kuro337
Classifier: Programming Language :: Python :: 3
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Requires-Python: >=3.5
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: sentence_transformers
Requires-Dist: torch
Requires-Dist: transformers
Requires-Dist: pandas
Requires-Dist: matplotlib
Requires-Dist: seaborn

# vembed



<br/>

Library to generate and serialize Vector Embeddings, extract Semantic Similarity, and create Visualizations.

<hr/>

#### String to Embeddings 
<br/>

- Convert a String to a Vector Embedding

```py
from vembed import string_to_embedding

input_string = "This is a test sentence."
embedding = string_to_embedding(input_string)
print(embedding)
```
<br/>

- Use Batching to Convert Several Strings to their `Vector Float` Representation *Efficiently*.

```py
from vembed import lists_to_embeddings

embeddings = lists_to_embeddings(["Convert to a List[Float]", "Another String","More Strings!"])
print(embeddings)  # Output: [[0.123, 0.456, ...], [0.789, 0.012, ...]]
```
<br/>

#### Serialization

Utilities to convert embeddings into Serializable Formats for Transferring Embeddings over Network Calls.

- `Protobuf` Serializable Format to use with `gRPC` Services
-  `JSON` Serialization for usage with `REST` API's


```py
from vembed import lists_to_embeddings, embeddings_to_proto_format, embeddings_to_json_format

embeddings = lists_to_embeddings(["CSV,Row,1" , "CSV,Row,2"])

# Convert to a Protobuf Serializable Format to send over a gRPC Service
proto_embedding = embeddings_to_proto_format(embeddings)

# Convert to a JSON String for usage with REST API's
json_embedding = embeddings_to_json_format(embeddings)
```

<hr/>

### Similarity 

Extracting Similarity Between Entities

```bash
Negative - Low Similarity
Zero     - Orthogonal - no commonality
Positive - Strong Similarity 
```

#### Cosine Similarity 

  - Ranges between `-1` and `1`

  - Recommended when the Context and Similarity is important - and Frequency is not important (Magnitude)


- Use Case for Cosine Similarity 
  
  - Here, `Direction` - **thematic orienation** (*climate change, agriculture*) is relevant 

  - `Cosine Similarity` is useful here as we want to find the relevancy of documents discussing similar topics `(direction)` - irrespective of the length of frequency of specific words `(Magnitude)`


```py
@Usage

queries = ["Climate change effects on agriculture"]
data = [
    "Effects of climate change on wheat production",
    "Agriculture in developing countries",
    "Climate change and its impact on global food security",
    "Advances in agricultural technology"
]

# Calculate cosine similarities
cos_df, _ = calculate_similarities(queries, data, sorted=True, print_results=True)
```

#### Dot Product Similarity 

  - Ranges between any Real Number 

  - When both the `magnitude` and `direction` of the vectors are important, and you are dealing with vectors in a similar scale.

  - When the `Frequency` (Magnitude) as well as the `Direction` (Relevancy) is both important.

- Use Case for Dot Product 
  
  - `Direction` (Types of Articles) and `Magnitude` (Frequency of Reading Habits) are both important.

```py
@Usage

user_reading_profile = ["Read many articles on machine learning", "Occasionally reads about space exploration"]
article_options = [
    "Latest trends in machine learning",
    "Beginner's guide to space travel",
    "In-depth analysis of neural networks",
    "Recent discoveries in astronomy"
]

# Calculate dot product similarities
_, dot_df = calculate_similarities(user_reading_profile, article_options, sorted=True, print_results=True)


```


- Calculating Similarity

```py
@Usage

queries = ["What is the capital of France?", "How is the weather today?"]
    data = [
        "Paris is the capital of France.",
        "The weather is sunny.",
        "Berlin is the capital of Germany.",
        "It is raining in Berlin.",
    ]

    # Calculate similarities and Print Results
    cos_df, dot_df = calculate_similarities(
        queries, data, sorted=True, print_results=True
    )
```

### Visualization for Relevance 

<hr/>

- Create a visualization to display the Simalirities using a Heatmap.

```py
@Usage

customer_feedback = [
    "Loved the recent update",
    "The app is user-friendly",
    "Facing issues after the update",
    "The new interface is great",
]
themes = [
    "positive feedback",
    "negative feedback",
    "app interface",
    "app functionality",
]

# Heatmap of Both Cosine and Dot Product
cos_df, dot_df = calculate_similarities(customer_feedback, themes, sorted=True)
plot_similarities(cos_df, dot_df, save_path="customer_feedback_similarity.png")

# Heatmap of Only Cosine Similarity
cos_df, _ = calculate_similarities(customer_feedback, themes, sorted=True)
plot_similarities(cos_df, None, save_path="customer_feedback_similarity.png")

# Heatmap of Only Dot Product Similarity
_, dot_df = calculate_similarities(customer_feedback, themes, sorted=True)
plot_similarities(None, dot_df, save_path="customer_feedback_similarity.png")

# View customer_feedback_similarity.png to see the Heatmap
```

<hr/>

Build and Run Locally from Source

```bash
git clone git@github.com:kuro337/vembed.git

# Create Isolated Virtual Env
python3 -m venv venv
source venv/bin/activate

# Install Deps
pip install -e .

# Run Tests
chmod +x RUN_TESTS.sh
./RUN_TESTS.sh
```
<hr/>

Dependencies
- `sentence_transformers`
- `torch`
- `transformers`
- `pandas`
- `matplotlib`
- `seaborn`

*Note: This package uses `Nvidia Cuda` and `Torch`.*

```bash
# Check Disk Allocation for Packages 
du -h venv | sort -hr | head -n 10

2.8G    venv/lib/python3.11/site-packages/nvidia
1.4G    venv/lib/python3.11/site-packages/torch
1.3G    venv/lib/python3.11/site-packages/torch/lib
1.2G    venv/lib/python3.11/site-packages/nvidia/cudnn/lib
1.2G    venv/lib/python3.11/site-packages/nvidia/cudnn
596M    venv/lib/python3.11/site-packages/nvidia/cublas

# Checking System Cache

# Show pip cache location
pip cache dir # /home/user/.cache/pip

# Getting Top Folders from Cache by Size
du -h /home/user/.cache/pip | sort -hr | head -n 10

# Remove Cached Files
pip cache purge 

# Cached Files
pip cache list

# Installing Packages without Cache
pip install --no-cache-dir <package_name>
```
<hr/>

Author: [kuro337](https://github.com/kuro337)
