Metadata-Version: 2.1
Name: bof
Version: 0.3.4
Summary: Bag of Factors allow you to analyze a corpus from its self_factors.
Home-page: https://github.com/balouf/bof
Author: Fabien Mathieu
Author-email: loufab@gmail.com
License: GNU General Public License v3
Description: ==============
        Bag of Factors
        ==============
        
        
        .. image:: https://img.shields.io/pypi/v/bof.svg
                :target: https://pypi.python.org/pypi/bof
                :alt: PyPI Status
        
        .. image:: https://github.com/balouf/bof/workflows/build/badge.svg?branch=master
                :target: https://github.com/balouf/bof/actions?query=workflow%3Abuild
                :alt: Build Status
        
        .. image:: https://github.com/balouf/bof/workflows/docs/badge.svg?branch=master
                :target: https://github.com/balouf/bof/actions?query=workflow%3Adocs
                :alt: Documentation Status
        
        
        .. image:: https://codecov.io/gh/balouf/bof/branch/master/graphs/badge.svg
                :target: https://codecov.io/gh/balouf/bof/branch/master/graphs
                :alt: Code Coverage
        
        
        
        Bag of Factors allow you to analyze a corpus from its factors.
        
        
        * Free software: GNU General Public License v3
        * Documentation: https://balouf.github.io/bof/.
        
        
        --------
        Features
        --------
        
        
        Feature Extraction
        -------------------
        
        The `feature_extraction` module mimicks the module https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text
        with a focus on character-based extraction.
        
        The main differences are:
        
        - it is slightly faster;
        - the features can be incrementally updated;
        - it is possible to fit only a random sample of factors to reduce space and computation time.
        
        The main entry point for this module is the `CountVectorizer` class, which mimicks
        its *scikit-learn* counterpart (also named `CountVectorizer`).
        It is in fact very similar to sklearn's `CountVectorizer` using `char` or
        `char_wb` analyzer option from that module.
        
        
        Fuzz
        --------
        
        The `fuzz` module mimicks the fuzzywuzzy-like packages like
        
        - fuzzywuzzy (https://github.com/seatgeek/fuzzywuzzy)
        - rapidfuzz (https://github.com/maxbachmann/rapidfuzz)
        
        The main difference is that the Levenshtein distance is replaced by the Joint Complexity distance. The API is also
        slightly change to enable new features:
        
        - The list of possible choices can be pre-trained (`fit`) to accelerate the computation in
          the case a stream of queries is sent against the same list of choices.
        - Instead of one single query, a list of queries can be used. Computations will be parallelized.
        
        The main `fuzz` entry point is the `Process` class.
        
        
        
        ----------------
        Getting Started
        ----------------
        
        Look at examples from the reference_ section.
        
        
        -------
        Credits
        -------
        
        This package was created with Cookiecutter_ and the `francois-durand/package_helper_2`_ project template.
        
        .. _Cookiecutter: https://github.com/audreyr/cookiecutter
        .. _`francois-durand/package_helper_2`: https://github.com/francois-durand/package_helper_2
        .. _reference: https://balouf.github.io/bof/reference/index.html
        
        
        =======
        History
        =======
        
        ---------------------------------------------------
        0.3.4 (2021-01-05): Cleaning
        ---------------------------------------------------
        
        * Renaming process.py to fuzz.py to emphasize that the module aims at being an alternative to the fuzzywuzzy package.
        * Removed modules FactorTree and JC. What they did is now essentially covered by the feature_extraction and fuzz
          modules.
        * General cleaning / rewriting of the documentation.
        
        
        ---------------------------------------------------
        0.3.3 (2021-01-01): Cython/Numba balanced
        ---------------------------------------------------
        
        * All core CountVectorizer methods ported to Cython. Roughly 2.5X faster than sklearn counterpart (mainly because some features like min_df/max_df are not implemented).
        * Process numba methods NOT converted to Cython as Numba seems to be 20% faster for csr manipulation.
        * Numba functions are cached to avoid compilation lag.
        
        
        ---------------------------------------------------
        0.3.2 (2020-12-30): Going Cython
        ---------------------------------------------------
        
        * First attempt to use Cython
        * Right now only the fit_transform method of CountVectorizer has been cythonized, for testing wheels.
        * If all goes well, numba will probably be abandoned and all the heavy-lifting will be in Cython.
        
        
        -----------------------------------------------------
        0.3.1 (2020-12-28): Simplification of core algorithm
        -----------------------------------------------------
        
        * Attributes of the CountVectorizer have been reduced to the minimum: one dict!
        * Now faster than sklearn counterpart! (The reason been only one case is considered here so we can ditch a lot of checks and attributes).
        
        
        ---------------------------------------------------
        0.3.0 (2020-12-15): CountVectorizer and Process
        ---------------------------------------------------
        
        * The core is now the CountVectorizer class. Lighter and faster. Only features are kept inside.
        * New process module inspired by fuzzywuzzy!
        
        
        ---------------------------------
        0.2.0 (2020-12-15): Fit/Transform
        ---------------------------------
        
        * Full refactoring to make the package fit/transform compliant.
        * Add a fit_sampling method that allows to fit only a (random) subset of factors
        
        
        ---------------------------------
        0.1.1 (2020-12-12): Upgrades
        ---------------------------------
        
        * Docstrings added
        * Common module (feat. save/load capabilities)
        * Joint Complexity module
        
        ---------------------------------
        0.1.0 (2020-12-12): First release
        ---------------------------------
        
        * First release on PyPI.
        * Core FactorTree class added.
        
Keywords: bof
Platform: UNKNOWN
Classifier: Development Status :: 2 - Pre-Alpha
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: GNU General Public License v3 (GPLv3)
Classifier: Natural Language :: English
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.6
Classifier: Programming Language :: Python :: 3.7
Classifier: Programming Language :: Python :: 3.8
Requires-Python: >=3.6
Description-Content-Type: text/x-rst
