Metadata-Version: 2.1
Name: gismo
Version: 0.4.3
Summary: GISMO is a NLP tool to rank and organize a corpus of documents according to a query.
Home-page: https://github.com/balouf/gismo
Author: Fabien Mathieu
Author-email: fabien.mathieu@normalesup.org
License: GNU General Public License v3
Keywords: gismo
Classifier: Development Status :: 2 - Pre-Alpha
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: GNU General Public License v3 (GPLv3)
Classifier: Natural Language :: English
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.6
Classifier: Programming Language :: Python :: 3.7
Classifier: Programming Language :: Python :: 3.8
Requires-Python: >=3.6
Description-Content-Type: text/x-rst
License-File: LICENSE
License-File: AUTHORS.rst

.. image:: https://github.com/balouf/gismo/raw/master/docs/logo-line.png
    :alt: Gismo logo
    :target: https://balouf.github.io/gismo/

--------------------------------------------------------
A Generic Information Search... With a Mind of its Own!
--------------------------------------------------------

.. image:: https://img.shields.io/pypi/v/gismo.svg
        :target: https://pypi.python.org/pypi/gismo

.. image:: https://github.com/balouf/gismo/workflows/build/badge.svg?branch=master
        :target: https://github.com/balouf/gismo/actions?query=workflow%3Abuild
        :alt: Build Status

.. image:: https://github.com/balouf/gismo/workflows/docs/badge.svg?branch=master
        :target: https://github.com/balouf/gismo/actions?query=workflow%3Adocs
        :alt: Documentation Status

.. image:: https://codecov.io/gh/balouf/gismo/branch/master/graphs/badge.svg
        :target: https://codecov.io/gh/balouf/gismo/branch/master
        :alt: Code Coverage





GISMO is a NLP tool to rank and organize a corpus of documents according to a query.

Gismo stands for Generic Information Search... with a Mind of its Own.

* Free software: GNU General Public License v3
* Github: https://github.com/balouf/gismo/
* Documentation: https://balouf.github.io/gismo/


Features
--------

Gismo combines three main ideas:

* **TF-IDTF**: a symmetric version of the TF-IDF embedding.
* **DIteration**: a fast, push-based, variant of the PageRank algorithm.
* **Fuzzy dendrogram**: a variant of the Louvain clustering algorithm.

Quickstart
----------

Install gismo:

.. code-block:: console

    $ pip install gismo

Import gismo in a Python project::

    import gismo as gs


To get the hang of a typical Gismo workflow, you can check the `Toy Example`_ notebook. For more advanced uses,
look at the other tutorials_ or directly the reference_ section.



Credits
-------

Thomas Bonald, Anne Bouillard, Marc-Olivier Buob, Dohy Hong.

This package was created with Cookiecutter_ and the `francois-durand/package_helper`_ project template.

.. _reference: https://balouf.github.io/gismo/reference.html
.. _`Toy Example`: https://balouf.github.io/gismo/tutorials/tutorial_toy_example.html
.. _tutorials: https://balouf.github.io/gismo/tutorials/index.html#
.. _Cookiecutter: https://github.com/audreyr/cookiecutter
.. _`francois-durand/package_helper`: https://github.com/francois-durand/package_helper


=======
History
=======

X.X.X (TODO-List)
-----------------
* Rethink distortion on both vectors normalization and IDTF/query trade-off.
* Accelerate similarity computation (currently sklearn-based) in clustering.


0.4.X (2023-0X-XX) (tentative)
-------------------------------

* Context manager for FileSource (e.g. ``with FileSource(...) as source:``)

* 3.9 compatibility issues rechecked

* Wheels

* Minor change in test_dblp.py


0.4.3 (2022-12-26)
--------------------

* Refresh dependencies, compatibilities, and such.

* Gismo is tested up to Python 3.10.

* Patch sklearn change of API
  (`ngram_range` must be a tuple, `get_feature_names` has been renamed `get_feature_names_out`)

* Updates MixInIO logic: you now save with the `dump` method and load with the `load` **class** method.

* Package management now uses Github actions.

0.4.2 (2021-05-05)
-------------------------------

Minor patch

* Signature of the Gismo rank method changed to allow to enter directly a query vector instead of a string query
  (useful if one wants to craft a custom query vector).
* Original source of the Reuters 50/50 dataset was discontinued; changed to an alternate source.
* Fix change in spacy API

0.4.1 (2020-11-25)
------------------
Minor update.

* DBLP API modified to you can specify the set of fields you want to retrieve.
* Minor update in doctests.
* Python 3.9 compatibility added.

0.4.0 (2020-07-21)
------------------
0.4 is a big update. Lot of things added, lot of things changed.

* New API for Gismo runtime parameters (see new parameters module for details). Short version:
    * ``gismo = Gismo(corpus, embedding, alpha=0.85)``: create a gismo with damping factor set to 0.85 instead of default value.
    * ``gismo.parameters.alpha = 0.85``: set the damping factor of the gismo to 0.85.
    * ``gismo.rank(query, alpha=0.85)``: makes a query with damping factor temporarily set to 0.85.
* Landmarks! Half Corpus, half Gismo, the Landmarks class can simplify many analysis tasks.
    * Landmarks are (small) corpus where each entry is augmented with the computation of an associated gismo query;
    * Landmarks can be used to refine the analysis around a part of your data;
    * They can be used as soft and fast classifiers.
    * Landmarks' runtime parameters follow the same approach than for Gismo instances (cf above).
    * See the dedicated tutorial to learn more!
* Documentation summer cleaning.
* ``query_distortion`` parameter (reshape subspace for clustering) is renamed ``distortion`` and is now a float instead of a bool (e.g. you can apply distortion in a non-binary way).
* Full refactoring of get_*** and post_*** methods and objects.
    * The good news is that they are now more natural, self-describing, and unified.
    * The bad news is that there is no backward-compatibility with previous Gismo versions. Hopefully this refactoring
      will last for some time!
* Gismo logo added!

0.3.1 (2020-06-12)
------------------

* New dataset: Reuters C50
* New module: sentencizer


0.3.0 (2020-05-13)
------------------

* dblp module: url2source function added to directly load a small dblp source in memory instead of using a FileSource approach.
* Possibility to disable query distortion in gismo.
* XGismo class to cross analyze embeddings.
* Tutorials updated

0.2.5 (2020-05-11)
------------------

* auto_k feature: if not specified, a query-dependent, reasonable, number of results k is estimated.
* covering methods added to gismo. It is now possible to use get_covering_* instead of get_ranked_* to maximize coverage and/or eliminate redundancy.


0.2.4 (2020-05-07)
------------------

* Tutorials for ACM and DBLP added. After cleaning, there is currently 3 tutorials:
    * Toy model, to get the hang of Gismo on a tiny example,
    * ACM, to play with Gismo on a small example,
    * DBLP, to play with a large dataset.


0.2.3 (2020-05-04)
------------------

* ACM and DBLP dataset creation added.


0.2.2 (2020-05-04)
------------------

* Notebook tutorials added (early version)

0.2.1 (2020-05-03)
------------------

* Actual code
* Coverage badge

0.1.0 (2020-04-30)
------------------

* First release on PyPI.
