Metadata-Version: 2.1
Name: finalfusion
Version: 0.7.1
Summary: Finalfusion in Python
Home-page: https://github.com/finalfusion/finalfusion-python
Author: Sebastian Pütz <seb.puetz@gmail.com>, Daniël de Kok <me@danieldk.eu>
License: BlueOak-1.0.0
Project-URL: Documentation, https://finalfusion-python.readthedocs.io/en/0.7.1
Project-URL: Finalfusion, https://finalfusion.github.io
Keywords: embeddings word2vec finalfusion finalfrontier fasttext glove
Platform: UNKNOWN
Classifier: Programming Language :: Python :: 3
Classifier: Operating System :: OS Independent
Classifier: Topic :: Text Processing :: Linguistic
Description-Content-Type: text/markdown
Requires-Dist: numpy
Requires-Dist: toml
Requires-Dist: dataclasses

# finalfusion-python
[![Documentation Status](https://readthedocs.org/projects/finalfusion-python/badge/?version=latest)](https://finalfusion-python.readthedocs.io/en/0.7.1/?badge=0.7.1)

## Introduction

`finalfusion` is a Python package for reading, writing and using 
[finalfusion](https://finalfusion.github.io) embeddings, but also
supports other commonly used embeddings like fastText, GloVe and
word2vec. 

The Python package supports the same types of embeddings as the
[finalfusion-rust crate](https://docs.rs/finalfusion/):

* Vocabulary:
  * No subwords
  * Subwords
* Embedding matrix:
  * Array
  * Memory-mapped
  * Quantized
* Norms
* Metadata

## Installation

The finalfusion module is
[available](https://pypi.org/project/finalfusion/#files) on PyPi for Linux,
Mac and Windows. You can use `pip` to install the module:

~~~shell
$ pip install --upgrade finalfusion
~~~

## Installing from source

Building from source depends on `Cython`. If you install the package using
`pip`, you don't need to explicitly install the dependency since it is
specified in `pyproject.toml`.

~~~shell
$ git clone https://github.com/finalfusion/finalfusion-python
$ cd finalfusion-python
$ pip install .
~~~

If you want to build wheels from source, `wheel` needs to be installed.
It's then possible to build wheels through:

~~~shell
$ python setup.py bdist_wheel
~~~

The wheels can be found in `dist`.

## Package Usage

### Basic usage

~~~python
import finalfusion
# loading from different formats
w2v_embeds = finalfusion.load_word2vec("/path/to/w2v.bin")
text_embeds = finalfusion.load_text("/path/to/embeds.txt")
text_dims_embeds = finalfusion.load_text_dims("/path/to/embeds.dims.txt")
fasttext_embeds = finalfusion.load_fasttext("/path/to/fasttext.bin")
fifu_embeds = finalfusion.load_finalfusion("/path/to/embeddings.fifu")

# serialization to formats works similarly
finalfusion.compat.write_word2vec("to_word2vec.bin", fifu_embeds)

# embedding lookup
embedding = fifu_embeds["Test"]

# reading an embedding into a buffer
import numpy as np
buffer = np.zeros(fifu_embeds.storage.shape[1], dtype=np.float32)
fifu_embeds.embedding("Test", out=buffer)

# similarity and analogy query
sim_query = fifu_embeds.word_similarity("Test")
analogy_query = fifu_embeds.analogy("A", "B", "C")

# accessing the vocab and printing the first 10 words
vocab = fifu_embeds.vocab
print(vocab.words[:10])

# SubwordVocabs give access to the subword indexer:
subword_indexer = vocab.subword_indexer
print(subword_indexer.subword_indices("Test", with_ngrams=True))

# accessing the storage and calculate its dot product with an embedding
res = embedding.dot(fifu_embeds.storage)

# printing metadata
print(fifu_embeds.metadata) 
~~~

### Beyond Embeddings

~~~Python
# load only a vocab from a finalfusion file
from finalfusion import load_vocab
vocab = load_vocab("/path/to/finalfusion_file.fifu")

# serialize vocab to single file
vocab.write("/path/to/vocab_file.fifu.voc")

# more specific loading functions exist
from finalfusion.vocab import load_finalfusion_bucket_vocab
fifu_bucket_vocab = load_finalfusion_bucket_vocab("/path/to/vocab_file.fifu.voc")
~~~

The package supports loading and writing all `finalfusion` chunks this way.
This is only supported by the Python package, reading will fail with e.g.
the `finalfusion-rust`.

## Scripts

`finalfusion` also includes a conversion script `ffp-convert` to convert
between the supported formats.
~~~shell
# convert from fastText format to finalfusion
$ ffp-convert -f fasttext fasttext.bin -t finalfusion embeddings.fifu
~~~

`ffp-bucket-to-explicit` can be used to convert bucket embeddings to embeddings
with an explicit ngram lookup.
~~~shell
# convert finalfusion bucket embeddings to explicit
$ ffp-bucket-to-explicit -f finalfusion embeddings.fifu explicit.fifu
~~~ 

Finally, the package comes with `ffp-similar` and `ffp-analogy` to do
analogy and similarity queries.
~~~shell
# get the 5 nearest neighbours of "Tübingen"
$ echo Tübingen | ffp-similar embeddings.fifu
# get the 5 top answers for "Tübingen" is to "Stuttgart" like "Heidelberg" to...
$ echo Tübingen Stuttgart Heidelberg | ffp-analogy embeddings.fifu
~~~

## Where to go from here

  * [documentation](https://finalfusion-python.readthedocs.io/en/0.7.1)
  * [finalfrontier](https://finalfusion.github.io/finalfrontier)
  * [finalfusion](https://finalfusion.github.io/)
  * [pretrained embeddings](https://finalfusion.github.io/pretrained)


