Metadata-Version: 2.3
Name: string_grouper
Version: 0.7.1
Summary: String grouper contains functions to do string matching using TF-IDF and the cossine similarity.
License: MIT
Author: Chris van den Berg
Maintainer: Chris van den Berg
Requires-Python: >=3.9,<4.0
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Requires-Dist: loguru (>0.7.0)
Requires-Dist: numpy (>=1.26.0,<2.0.0)
Requires-Dist: pandas (>=2.0,<3.0)
Requires-Dist: scikit-learn (>=1.4.0,<2.0.0)
Requires-Dist: scipy (>=1.4.1)
Requires-Dist: sparse_dot_topn (>=1.1.0)
Description-Content-Type: text/markdown

# String Grouper  
<!-- Some cool decorations -->
[![pypi](https://badgen.net/pypi/v/string-grouper)](https://pypi.org/project/string-grouper)
[![license](https://badgen.net/pypi/license/string_grouper)](https://github.com/Bergvca/string_grouper)
[![lastcommit](https://badgen.net/github/last-commit/Bergvca/string_grouper)](https://github.com/Bergvca/string_grouper)
[![codecov](https://codecov.io/gh/Bergvca/string_grouper/branch/master/graph/badge.svg?token=AGK441CQDT)](https://codecov.io/gh/Bergvca/string_grouper)
[![PyPI Downloads](https://static.pepy.tech/badge/string-grouper)](https://pepy.tech/projects/string-grouper)
<!-- [![github](https://shields.io/github/v/release/Bergvca/string_grouper)](https://github.com/Bergvca/string_grouper) -->

<details>
<summary>Click to see image</summary>
<br>
<center><img width="100%" src="https://raw.githubusercontent.com/Bergvca/string_grouper/master/tutorials/sec__edgar_company_info_group003c.svg"></center>

The image displayed above is a visualization of the graph-structure of one of the groups of strings found by `string_grouper`.  Each circle (node) represents a string, and each connecting arc (edge) represents a match between a pair of strings with a similarity score above a given threshold score (here `0.8`).  

The ***centroid*** of the group, as determined by `string_grouper` (see [tutorials/group_representatives.md](https://github.com/Bergvca/string_grouper/blob/master/tutorials/group_representatives.md) for an explanation), is the largest node, also with the most edges originating from it.  A thick line in the image denotes a strong similarity between the nodes at its ends, while a faint thin line denotes weak similarity.

The power of `string_grouper` is discernible from this image: in large datasets, `string_grouper` is often able to resolve indirect associations between strings even when, say, due to memory-resource-limitations, direct matches between those strings cannot be computed using conventional methods with a lower threshold similarity score.    

<div style="text-align: center"> &mdash;&mdash;&mdash;</div>

<sup>This image was designed using the graph-visualization software Gephi 0.9.2 with data generated by `string_grouper` operating on the [sec__edgar_company_info.csv](https://www.kaggle.com/dattapiy/sec-edgar-companies-list/version/1) sample data file.</sup>

---
</details>

**`string_grouper`** is a library that makes finding groups of similar strings within a single, or multiple, lists of 
strings easy — and _fast_. **`string_grouper`** uses **tf-idf** to calculate [**cosine similarities**](https://towardsdatascience.com/understanding-cosine-similarity-and-its-application-fd42f585296a) 
within a single list or between two lists of strings. The full process is described in the blog [Super Fast String Matching in Python](https://bergvca.github.io/2017/10/14/super-fast-string-matching.html).


## Installing

`pip install string-grouper`

## Speed

**`string_grouper`** leverages the blazingly fast [sparse_dot_topn](https://github.com/ing-bank/sparse_dot_topn) libary
to calculate cosine similarities. 

```python
s = datetime.datetime.now()
matches = match_strings(names['Company Name'], number_of_processes = 4)

e = datetime.datetime.now()
diff = (e - s)
str(diff)
```
Results in: 

`00:05:34.65` On an Intel i7-6500U CPU @ 2.50GHz, where `len(names)` = 663 000

*in other words*,
the library is able to perform fuzzy matching of 663 000 names in _five and a half minutes_
on a 2015 consumer CPU using 4 cores. 

## Simple Match

```python
import pandas as pd
from string_grouper import match_strings

company_names = 'sec__edgar_company_info.csv'
companies = pd.read_csv(company_names)
# Create all matches:
matches = match_strings(companies['Company Name'])
# Look at only the non-exact matches:
matches[matches['left_Company Name'] != matches['right_Company Name']].head()
```

|     |   left_index | left_Company Name                                           |   similarity | right_Company Name                      |   right_index |
|----:|-------------:|:------------------------------------------------------------|-------------:|:----------------------------------------|--------------:|
|  15 |           14 | 0210, LLC                                                   |     0.870291 | 90210 LLC                               |          4211 |
| 167 |          165 | 1 800 MUTUALS ADVISOR SERIES                                |     0.931615 | 1 800 MUTUALS ADVISORS SERIES           |           166 |
| 168 |          166 | 1 800 MUTUALS ADVISORS SERIES                               |     0.931615 | 1 800 MUTUALS ADVISOR SERIES            |           165 |
| 172 |          168 | 1 800 RADIATOR FRANCHISE INC                                |     1        | 1-800-RADIATOR FRANCHISE INC.           |           201 |
| 178 |          173 | 1 FINANCIAL MARKETPLACE SECURITIES LLC                  /BD |     0.949364 | 1 FINANCIAL MARKETPLACE SECURITIES, LLC |           174 |


## Group Similar Strings and Find most Common

```python
companies[["group-id", "name_deduped"]] = group_similar_strings(companies['Company Name'])
companies.groupby('name_deduped')['Line Number'].count().sort_values(ascending=False).head(10)
```
| name_deduped                                       |   Line Number |
|:---------------------------------------------------|--------------:|
| ADVISORS DISCIPLINED TRUST                         |          1747 |
| NUVEEN TAX EXEMPT UNIT TRUST SERIES 1              |           916 |
| GUGGENHEIM DEFINED PORTFOLIOS, SERIES 1200         |           652 |
| U S TECHNOLOGIES INC                               |           632 |
| CAPITAL MANAGEMENT LLC                             |           628 |
| CLAYMORE SECURITIES DEFINED PORTFOLIOS, SERIES 200 |           611 |
| E ACQUISITION CORP                                 |           561 |
| CAPITAL PARTNERS LP                                |           561 |
| FIRST TRUST COMBINED SERIES 1                      |           560 |
| PRINCIPAL LIFE INCOME FUNDINGS TRUST 20            |           544 |

## Documentation

The documentation can be found [here](https://bergvca.github.io/string_grouper/)

