Metadata-Version: 2.1
Name: embfile
Version: 0.1.1
Summary: A package for working with files containing pre-trained word embeddings (aka word vectors).
Home-page: https://github.com/janLuke/embfile
Author: Gianluca Gippetto
Author-email: gianluca.gippetto@gmail.com
License: MIT license
Keywords: embeddings,word vectors,word2vec,nlp,neural networks,deep learning,machine learning
Platform: UNKNOWN
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: Unix
Classifier: Operating System :: POSIX
Classifier: Operating System :: Microsoft :: Windows
Classifier: Programming Language :: Python
Classifier: Programming Language :: Python :: 3.6
Classifier: Programming Language :: Python :: 3.7
Classifier: Programming Language :: Python :: 3.8
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: Implementation :: CPython
Classifier: Topic :: Utilities
Description-Content-Type: text/x-rst
Requires-Dist: numpy
Requires-Dist: tqdm
Requires-Dist: overrides
Requires-Dist: tabulate
Provides-Extra: dev
Requires-Dist: tox ; extra == 'dev'
Requires-Dist: pytest ; extra == 'dev'
Requires-Dist: pytest-cov ; extra == 'dev'
Requires-Dist: coverage ; extra == 'dev'
Requires-Dist: flake8 ; extra == 'dev'
Requires-Dist: mypy ; extra == 'dev'
Requires-Dist: twine ; extra == 'dev'
Requires-Dist: bump2version ; extra == 'dev'

========
Overview
========



A package for working with files containing word embeddings (aka word vectors).
Written for:

#. providing a common interface for different file formats;
#. providing a flexible function for building "embedding matrices" that you can use
   for initializing the `Embedding` layer of your deep learning model;
#. taking as less RAM as possible: no need to load 3M vectors like with
   `gensim.load_word2vec_format` when you only need 20K;
#. satisfying my (inexplicable) urge of writing a Python package.


Features
========
- Supports textual and Google's binary format plus a custom convenient format (.vvm)
  supporting constant-time access of word vectors (by word).

- Allows to easily implement, test and integrate new file formats.

- Supports virtually any text encoding and vector data type (though you should
  probably use only UTF-8 as encoding).

- Well-documented and type-annotated (meaning great IDE support).

- Extensively tested.

- Progress bars (by default) for every time-consuming operation.


Installation
============
::

    pip install embfile


Quick start
===========

.. code-block:: python

    import embfile

    with embfile.open("path/to/file.bin") as f:     # infer file format from file extension

        print(f.vocab_size, f.vector_size)

        # Load some word vectors in a dictionary (raise KeyError if any word is missing)
        word2vec = f.load(['ciao', 'hello'])

        # Like f.load() but allows missing words (and returns them in a Set)
        word2vec, missing_words = f.find(['ciao', 'hello', 'someMissingWord'])

        # Build a matrix for initializing an Embedding layer either from
        # a list of words or from a dictionary {word: index}. Handles the
        # initialization of eventual missing word vectors (see "oov_initializer")
        matrix, word2index, missing_words = embfile.build_matrix(f, words)

Examples
========
The examples shows how to use embfile to initialize the ``Embedding`` layer of
a deep learning model. They are just illustrative, don't skip the documentation.

- `Keras using Tokenizer <https://github.com/janLuke/embfile/blob/master/examples/keras_with_Tokenizer.py>`_
- `Keras using TextVectorization <https://github.com/janLuke/embfile/blob/master/examples/keras_with_TextVectorization.py>`_
  (tensorflow >= 2.1)

.. if-doc-stop-here

Documentation
=============
Read the full documentation at https://embfile.readthedocs.io/.


Changelog
=========

v0.1.1 (2021-02-15)
-------------------
* No changes in the code.
* Add support to python 3.9.
* Migrate from TravisCI+AppVeyor to GitHub Actions.
* Add examples for Keras.
* Minor doc changes.

v0.1.0 (2020-01-24)
-------------------
* First release on PyPI.



