Metadata-Version: 2.1
Name: fuzzy_context_finder
Version: 0.1.2
Summary: search for keywords and their context
Home-page: https://github.com/sandeepmj/fuzzy_context_finder
Author: Sandeep Junnarkar
Author-email: sjnews@gmail.com
License: UNKNOWN
Platform: UNKNOWN
Classifier: Programming Language :: Python :: 3
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Requires-Python: >=3.7
Description-Content-Type: text/markdown
License-File: LICENSE

# Keyword Context Finder
![Searching for context doesn't have to be a chore!](https://sandeepmj.github.io/image-host/keyword-logo.png)

*Searching for context doesn't have to be a chore!*


A Python utility for finding keywords and their surrounding context in MangoCR markdown files. This tool supports fuzzy matching and provides flexible context extraction around matched terms.

## Features

- Fuzzy string matching for approximate keyword finding
- Customizable context window sizes (before, after, and around matches)
- Page number tracking for MangoCR formatted documents
- Adjustable similarity threshold for matches
- Returns results in a pandas DataFrame for easy analysis

## Installation

```bash
pip install pandas rapidfuzz
```

## Dependencies

- Python 3.6+
- pandas
- rapidfuzz
- regex

## Usage

```python
from fuzzy_context_finder import keyword_context_finder

# Example usage
content = """
## document_page_1
This is the content of page one with some keywords.
## document_page_2
More content on page two with different keywords.
"""

search_terms = ["keyword", "content"]
file_name = "example_document.pdf"

results = keyword_context_finder(
    content=content,
    terms=search_terms,
    file_name=file_name,
    words_before=250,
    words_after=250,
    words_around=50,
    match_threshold=80
)
```

## Parameters

- `content` (str): Document content with pages separated by MangoCR markers (`## filename_page_number`)
- `terms` (list): List of search terms to find in the document
- `file_name` (str): Name of the file being processed
- `words_before` (int, default=250): Number of words to capture before the term
- `words_after` (int, default=250): Number of words to capture after the term
- `words_around` (int, default=50): Number of words to capture around the term
- `match_threshold` (int, default=80): Minimum similarity score (0-100) for fuzzy matching

## Return Value

Returns a pandas DataFrame with the following columns:
- File Name
- Page Marker
- Page Number
- Matched Term
- Original Term
- Similarity Score
- Search Term with Context (configurable width)
- Previous Words Context
- Next Words Context

Returns `None` if no matches are found.

## Example Output

```python
>>> results.head()
   File Name    Page Marker  Page Number  Matched Term  Original Term  Similarity Score  ...
0  example.md  document_p_1          1      keyword       keyword              100      ...
```

## Document Format Requirements

The tool expects documents to follow <a href="https://pypi.org/project/mangoCR/">the MangoCR format</a> with page markers:
```markdown
## filename_page_1
Content for page 1
## filename_page_2
Content for page 2
```

## Error Handling

- Empty pages are automatically skipped
- Returns None if no matches are found
- Handles out-of-bounds context windows gracefully

## Contributing

Feel free to open issues or submit pull requests with improvements.

## License

MIT License


