Metadata-Version: 2.1
Name: embfile
Version: 0.1.0
Summary: A package for working with files containing pre-trained word embeddings (aka word vectors).
Home-page: https://github.com/janLuke/embfile
Author: Gianluca Gippetto
Author-email: gianluca.gippetto@gmail.com
License: MIT license
Keywords: embeddings,word vectors,word2vec,nlp,neural networks,deep learning,machine learning
Platform: UNKNOWN
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: Unix
Classifier: Operating System :: POSIX
Classifier: Operating System :: Microsoft :: Windows
Classifier: Programming Language :: Python
Classifier: Programming Language :: Python :: 3.6
Classifier: Programming Language :: Python :: 3.7
Classifier: Programming Language :: Python :: 3.8
Classifier: Programming Language :: Python :: Implementation :: CPython
Classifier: Topic :: Utilities
Description-Content-Type: text/x-rst
Requires-Dist: numpy
Requires-Dist: tqdm
Requires-Dist: overrides
Requires-Dist: tabulate

========
Overview
========



A package for working with files containing word embeddings (aka word vectors).
Written for:

#. providing a common interface for different file formats;
#. providing a flexible function for building "embedding matrices" that you can use
   for initializing the `Embedding` layer of your deep learning model;
#. taking as less RAM as possible: no need to load 3M vectors like with
   `gensim.load_word2vec_format` when you only need 20K;
#. satisfying my (inexplicable) urge of writing a Python package.


Features
========
- Supports textual and Google's binary format plus a custom convenient format (.vvm)
  supporting constant-time access of word vectors (by word).

- Allows to easily implement, test and integrate new file formats.

- Supports virtually any text encoding and vector data type (though you should
  probably use only UTF-8 as encoding).

- Well-documented and type-annotated (meaning great IDE support).

- Extensively tested.

- Progress bars (by default) for every time-consuming operation.


Installation
============
::

    pip install embfile


Quick start
===========

.. code-block:: python

    import embfile

    with embfile.open("path/to/file.bin") as f:     # infer file format from file extension

        print(f.vocab_size, f.vector_size)

        # Load some word vectors in a dictionary (raise KeyError if any word is missing)
        word2vec = f.load(['ciao', 'hello'])

        # Like f.load() but allows missing words (and returns them in a Set)
        word2vec, missing_words = f.find(['ciao', 'hello', 'someMissingWord'])

        # Build a matrix for initializing the Embedding layer either from
        # an iterable of words or a dictionary {word: index}. Handle the
        # initialization of eventual missing word vectors (see argument "oov_initializer")
        matrix, word2index, missing_words = embfile.build_matrix(f, words)


.. if-doc-stop-here

Documentation
=============
Read the full documentation at https://embfile.readthedocs.io/.


Changelog
=========
v0.1.0 (2020-01-24)
-------------------
* First release on PyPI.



