Metadata-Version: 2.4
Name: fuzzysearch
Version: 0.8.1
Summary: fuzzysearch is useful for finding approximate subsequence matches
Home-page: https://github.com/taleinat/fuzzysearch
Author: Tal Einat
Author-email: taleinat@gmail.com
License: MIT
Keywords: fuzzysearch
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: MIT License
Classifier: Natural Language :: English
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.8
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Programming Language :: Python :: 3.14
Classifier: Programming Language :: Python :: Implementation :: CPython
Classifier: Programming Language :: Python :: Implementation :: PyPy
Classifier: Topic :: Software Development :: Libraries :: Python Modules
License-File: LICENSE
License-File: AUTHORS.rst
Requires-Dist: attrs>=19.3
Dynamic: author
Dynamic: author-email
Dynamic: classifier
Dynamic: description
Dynamic: home-page
Dynamic: keywords
Dynamic: license
Dynamic: license-file
Dynamic: requires-dist
Dynamic: summary


===========
fuzzysearch
===========

.. image:: https://img.shields.io/pypi/v/fuzzysearch.svg?style=flat
    :target: https://pypi.python.org/pypi/fuzzysearch
    :alt: Latest Version

.. image:: https://img.shields.io/coveralls/taleinat/fuzzysearch.svg?branch=master
    :target: https://coveralls.io/r/taleinat/fuzzysearch?branch=master
    :alt: Test Coverage

.. image:: https://img.shields.io/pypi/wheel/fuzzysearch.svg?style=flat
    :target: https://pypi.python.org/pypi/fuzzysearch
    :alt: Wheels

.. image:: https://img.shields.io/pypi/pyversions/fuzzysearch.svg?style=flat
    :target: https://pypi.python.org/pypi/fuzzysearch
    :alt: Supported Python versions

.. image:: https://img.shields.io/pypi/implementation/fuzzysearch.svg?style=flat
    :target: https://pypi.python.org/pypi/fuzzysearch
    :alt: Supported Python implementations

.. image:: https://img.shields.io/pypi/l/fuzzysearch.svg?style=flat
    :target: https://pypi.python.org/pypi/fuzzysearch/
    :alt: License

Fuzzy search: Find parts of long text or data, allowing for some
changes/typos.

Highly optimized, simple to use, does one thing well.

.. code:: python

    >>> find_near_matches('PATTERN', '---PATERN---', max_l_dist=1)
    [Match(start=3, end=9, dist=1, matched="PATERN")]

* Two simple functions to use: one for in-memory data and one for files

  * Fastest search algorithm is chosen automatically

* Levenshtein Distance metric with configurable parameters

  * Separately configure the max. allowed distance, substitutions, deletions
    and/or insertions

* Advanced algorithms with optional C and Cython optimizations

* Properly handles Unicode; special optimizations for binary data

* Simple installation:
   * ``pip install fuzzysearch`` just works
   * pure-Python fallbacks for compiled modules
   * only one dependency (``attrs``)

* Extensively tested

* Free software: `MIT license <LICENSE>`_

For more info, see the `documentation <http://fuzzysearch.rtfd.org>`_.


How is this different than FuzzyWuzzy or RapidFuzz?
---------------------------------------------------

The main difference is that fuzzysearch searches for fuzzy matches through
long texts or data. FuzzyWuzzy and RapidFuzz, on the other hand, are intended
for fuzzy comparison of pairs of strings, identifying how closely they match
according to some metric such as the Levenshtein distance.

These are very different use-cases, and the solutions are very different as
well.


How is this different than ElasticSearch and Lucene?
----------------------------------------------------

The main difference is that fuzzysearch does no indexing or other
preparations; it directly searches through the given text or data for a given
sub-string. Therefore, it is much simpler to use compared to systems based on
text indexing.


Installation
------------

``fuzzysearch`` supports Python versions 3.8+, as well as PyPy 3.9 and 3.10.

.. code::

    $ pip install fuzzysearch

This will work even if installing the C and Cython extensions fails, using
pure-Python fallbacks.


Usage
-----
Just call ``find_near_matches()`` with the sub-sequence you're looking for,
the sequence to search, and the matching parameters:

.. code:: python

    >>> from fuzzysearch import find_near_matches
    # search for 'PATTERN' with a maximum Levenshtein Distance of 1
    >>> find_near_matches('PATTERN', '---PATERN---', max_l_dist=1)
    [Match(start=3, end=9, dist=1, matched="PATERN")]

To search in a file, use ``find_near_matches_in_file()``:

.. code:: python

    >>> from fuzzysearch import find_near_matches_in_file
    >>> with open('data_file', 'rb') as f:
    ...     find_near_matches_in_file(b'PATTERN', f, max_l_dist=1)
    [Match(start=3, end=9, dist=1, matched="PATERN")]


Examples
--------

*fuzzysearch* is great for ad-hoc searches of genetic data, such as DNA or
protein sequences, before reaching for more complex tools:

.. code:: python

    >>> sequence = '''\
    GACTAGCACTGTAGGGATAACAATTTCACACAGGTGGACAATTACATTGAAAATCACAGATTGGTCACACACACA
    TTGGACATACATAGAAACACACACACATACATTAGATACGAACATAGAAACACACATTAGACGCGTACATAGACA
    CAAACACATTGACAGGCAGTTCAGATGATGACGCCCGACTGATACTCGCGTAGTCGTGGGAGGCAAGGCACACAG
    GGGATAGG'''
    >>> subsequence = 'TGCACTGTAGGGATAACAAT' # distance = 1
    >>> find_near_matches(subsequence, sequence, max_l_dist=2)
    [Match(start=3, end=24, dist=1, matched="TAGCACTGTAGGGATAACAAT")]

BioPython sequences are also supported:

.. code:: python

    >>> from Bio.Seq import Seq
    >>> from Bio.Alphabet import IUPAC
    >>> sequence = Seq('''\
    GACTAGCACTGTAGGGATAACAATTTCACACAGGTGGACAATTACATTGAAAATCACAGATTGGTCACACACACA
    TTGGACATACATAGAAACACACACACATACATTAGATACGAACATAGAAACACACATTAGACGCGTACATAGACA
    CAAACACATTGACAGGCAGTTCAGATGATGACGCCCGACTGATACTCGCGTAGTCGTGGGAGGCAAGGCACACAG
    GGGATAGG''', IUPAC.unambiguous_dna)
    >>> subsequence = Seq('TGCACTGTAGGGATAACAAT', IUPAC.unambiguous_dna)
    >>> find_near_matches(subsequence, sequence, max_l_dist=2)
    [Match(start=3, end=24, dist=1, matched="TAGCACTGTAGGGATAACAAT")]


Matching Criteria
-----------------
The search function supports four possible match criteria, which may be
supplied in any combination:

* maximum Levenshtein distance (``max_l_dist``)

* maximum # of subsitutions

* maximum # of deletions ("delete" = skip a character in the sub-sequence)

* maximum # of insertions ("insert" = skip a character in the sequence)

Not supplying a criterion means that there is no limit for it. For this reason,
one must always supply ``max_l_dist`` and/or all other criteria.

.. code:: python

    >>> find_near_matches('PATTERN', '---PATERN---', max_l_dist=1)
    [Match(start=3, end=9, dist=1, matched="PATERN")]

    # this will not match since max-deletions is set to zero
    >>> find_near_matches('PATTERN', '---PATERN---', max_l_dist=1, max_deletions=0)
    []

    # note that a deletion + insertion may be combined to match a substution
    >>> find_near_matches('PATTERN', '---PAT-ERN---', max_deletions=1, max_insertions=1, max_substitutions=0)
    [Match(start=3, end=10, dist=1, matched="PAT-ERN")] # the Levenshtein distance is still 1

    # ... but deletion + insertion may also match other, non-substitution differences
    >>> find_near_matches('PATTERN', '---PATERRN---', max_deletions=1, max_insertions=1, max_substitutions=0)
    [Match(start=3, end=10, dist=2, matched="PATERRN")]




History
-------

0.8.1 (2025-11-11)
++++++++++++++++++

* Fixed sometimes raising ValueError with empty sequences.
* Added support for Python 3.13 and 3.14.

0.8.0 (2025-05-24)
++++++++++++++++++

* Fixed a long-standing occasional crash, mostly happening on Windows.
* Overhauled CI.
* Dropped support for Python 2.7, 3.5, 3.6 and 3.7.
* Added support for Python 3.9, 3.10, 3.11 and 3.12.

0.7.3 (2020-06-27)
++++++++++++++++++

* Fixed segmentation faults due to wrong handling of inputs in bytes-like-only
  functions in C extensions.

0.7.2 (2020-05-07)
++++++++++++++++++
* Added PyPy support.
* Several minor bug fixes.

0.7.1 (2020-04-05)
++++++++++++++++++
* Dropped support for Python 3.4.
* Removed deprecation warning with Python 3.8.
* Fixed a couple of nasty bugs.

0.7.0 (2020-01-14)
++++++++++++++++++

* Added ``matched`` attribue to ``Match`` objects containing the matched part
  of the sequence.
* Added support for CPython 3.8. Now supporting CPython 2.7 and 3.4-3.8.

0.6.2 (2019-04-22)
++++++++++++++++++

* Fix calling ``search_exact()`` without passing ``end_index``.
* Fix edge case: max. dist >= sub-sequence length.

0.6.1 (2018-12-08)
++++++++++++++++++

* Fixed some C compiler warnings for the C and Cython modules

0.6.0 (2018-12-07)
++++++++++++++++++

* Dropped support for Python versions 2.6, 3.2 and 3.3
* Added support and testing for Python 3.7
* Optimized the n-grams Levenshtein search for long sub-sequences
* Further optimized the n-grams Levenshtein search
* Cython versions of the optimized parts of the n-grams Levenshtein search

0.5.0 (2017-09-05)
++++++++++++++++++

* Fixed ``search_exact_byteslike()`` to support supplying start and end indexes
* Added support for lists, tuples and other Sequence types to ``search_exact()``
* Fixed a bug where ``find_near_matches()`` could return a wrong ``Match.end``
  with ``max_l_dist=0``
* Added more tests and improved some existing ones.

0.4.0 (2017-07-06)
++++++++++++++++++

* Added support and testing for Python 3.5 and 3.6
* Many small improvements to README, setup.py and CI testing

0.3.0 (2015-02-12)
++++++++++++++++++

* Added C extensions for several search functions as well as internal functions
* Use C extensions if available, or pure-Python implementations otherwise
* setup.py attempts to build C extensions, but installs without if build fails
* Added ``--noexts`` setup.py option to avoid trying to build the C extensions
* Greatly improved testing and coverage

0.2.2 (2014-03-27)
++++++++++++++++++

* Added support for searching through BioPython Seq objects
* Added specialized search function allowing only subsitutions and insertions
* Fixed several bugs

0.2.1 (2014-03-14)
++++++++++++++++++

* Fixed major match grouping bug

0.2.0 (2013-03-13)
++++++++++++++++++

* New utility function ``find_near_matches()`` for easier use
* Additional documentation

0.1.0 (2013-11-12)
++++++++++++++++++

* Two working implementations
* Extensive test suite; all tests passing
* Full support for Python 2.6-2.7 and 3.1-3.3
* Bumped status from Pre-Alpha to Alpha

0.0.1 (2013-11-01)
++++++++++++++++++

* First release on PyPI.
