Metadata-Version: 2.1
Name: embedders
Version: 0.0.1
Summary: High-level API for creating sentence and token embeddings
Home-page: https://github.com/code-kern-ai/embedders
Author: Johannes Hötter
Author-email: johannes.hoetter@kern.ai
License: UNKNOWN
Keywords: kern,machine learning,representation learning,python
Platform: UNKNOWN
Classifier: Development Status :: 3 - Alpha
Classifier: Programming Language :: Python :: 3
Classifier: License :: OSI Approved :: MIT License
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: blis (==0.7.7)
Requires-Dist: catalogue (==2.0.7)
Requires-Dist: certifi (==2021.10.8)
Requires-Dist: charset-normalizer (==2.0.12)
Requires-Dist: click (==8.0.4)
Requires-Dist: cymem (==2.0.6)
Requires-Dist: filelock (==3.6.0)
Requires-Dist: gensim (==4.1.2)
Requires-Dist: huggingface-hub (==0.5.1)
Requires-Dist: idna (==3.3)
Requires-Dist: Jinja2 (==3.1.1)
Requires-Dist: joblib (==1.1.0)
Requires-Dist: langcodes (==3.3.0)
Requires-Dist: MarkupSafe (==2.1.1)
Requires-Dist: murmurhash (==1.0.7)
Requires-Dist: nltk (==3.7)
Requires-Dist: numpy (==1.22.3)
Requires-Dist: packaging (==21.3)
Requires-Dist: pathy (==0.6.1)
Requires-Dist: Pillow (==9.1.0)
Requires-Dist: preshed (==3.0.6)
Requires-Dist: pydantic (==1.8.2)
Requires-Dist: pyparsing (==3.0.8)
Requires-Dist: PyYAML (==6.0)
Requires-Dist: regex (==2022.4.24)
Requires-Dist: requests (==2.27.1)
Requires-Dist: sacremoses (==0.0.49)
Requires-Dist: scikit-learn (==1.0.2)
Requires-Dist: scipy (==1.8.0)
Requires-Dist: sentence-transformers (==2.2.0)
Requires-Dist: sentencepiece (==0.1.96)
Requires-Dist: six (==1.16.0)
Requires-Dist: smart-open (==5.2.1)
Requires-Dist: spacy (==3.2.4)
Requires-Dist: spacy-legacy (==3.0.9)
Requires-Dist: spacy-loggers (==1.0.2)
Requires-Dist: srsly (==2.4.3)
Requires-Dist: thinc (==8.0.15)
Requires-Dist: threadpoolctl (==3.1.0)
Requires-Dist: tokenizers (==0.12.1)
Requires-Dist: torch (==1.11.0)
Requires-Dist: torchvision (==0.12.0)
Requires-Dist: tqdm (==4.64.0)
Requires-Dist: transformers (==4.18.0)
Requires-Dist: typer (==0.4.1)
Requires-Dist: typing-extensions (==4.2.0)
Requires-Dist: urllib3 (==1.26.9)
Requires-Dist: wasabi (==0.9.1)

# embedders
With embedders, you can easily convert your text into sentence- or token-level embeddings within a few lines of code. Use cases for this include similarity search between texts, information extraction such as named entity recognition, or basic text classification.

## How to install
You can set up this library via either running `pip install embedders`, or via cloning this repository and running `pip install -r requirements.txt` in your repository.

## Example
*Calculating sentence embeddings*
```python
from embedders.classification.contextual import TransformerSentenceEmbedder
from embedders.classification.reduce import PCASentenceReducer

corpus = [
    "I went to Cologne in 2009",
    "My favorite number is 41",
    ...
]

embedder = TransformerSentenceEmbedder("bert-base-cased")
embeddings = embedder.encode(corpus) # contains a list of shape [num_texts, embedding_dimension]

# if the dimension is too large, you can also apply dimensionality reduction
reducer = PCASentenceReducer(embedder)
embeddings_reduced = reducer.fit_transform(corpus)
```

*Calculating token embeddings*
```python
from embedders.extraction.count_based import CharacterTokenEmbedder
from embedders.extraction.reduce import PCATokenReducer

corpus = [
    "I went to Cologne in 2009",
    "My favorite number is 41",
    ...
]

embedder = CharacterTokenEmbedder("en_core_web_sm")
embeddings = embedder.encode(corpus) # contains a list of ragged shape [num_texts, num_tokens (text-specific), embedding_dimension]

# if the dimension is too large, you can also apply dimensionality reduction
reducer = PCATokenReducer(embedder)
embeddings_reduced = reducer.fit_transform(corpus)
```

## How to contribute
Currently, the best way to contribute is via adding issues for the kind of transformations you like and starring this repository :-)


