Metadata-Version: 2.1
Name: natex
Version: 1.0.5
Summary: Regular Expressions turbo-charged with notations for part-of-speech and dependency tree tags
Home-page: https://github.com/polygoat/natex-py
Author: Dan Borufka
Author-email: danborufka@gmail.com
License: MIT
Keywords: regex regular expression nlp pos dep
Platform: UNKNOWN
Classifier: Programming Language :: Python :: 3
Classifier: Development Status :: 3 - Alpha
Classifier: Topic :: Utilities
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Requires-Python: >=3.6
Description-Content-Type: text/markdown
Requires-Dist: pydash
Requires-Dist: stanza

# Natural Language Expressions for Python
**NatEx**: Regular Expressions turbo-charged with notations for part-of-speech and dependency tree tags

## In a Nutshell
```python
from natex import natex

sentence = natex('Sloths eat steak in New York')

# check if string begins with noun:
sentence.match(r'@NOUN')
# returns <natex.Match object; span=(0, 6), match='Sloths'>

# find first occurrence of an adposition followed by a proper noun
sentence.search(r'@ADP <@PROPN>')  	
# returns <natex.Match object; span=(17, 28), match='in New York'>

# find all occurrences of nouns or proper nouns
sentence.findall(r'@(NOUN|PROPN)') 	
# returns ['Sloths', 'steak', 'New York']

# find all occurrences of nouns or proper nouns starting with an s (irregardless of casing)
sentence.findall(r's[^@]+@(NOUN|PROPN)', natex.I)
# returns ['Sloths', 'steak']

```

## Goals & Design
The goal of NatEx is quick and simple parsing of tokens using their literal representation, part-of-speech, and dependency tree tags.
Think of it as an extension of [regular expressions] for natural language processing. The generated part-of-speech and dependency tree tags are provided by [stanza] and merged into a string that can be searched through.

[regular expressions]: https://docs.python.org/3/library/re.html
[stanza]: https://stanfordnlp.github.io/stanza

### Why not [Tregex], [Semgrex], or [Tsurgeon]?
NatEx was designed primarily with simplicity in mind.
Libraries like [Tregex], [Semgrex], or [Tsurgeon] may be able to match more complex patterns, but they have a steep learning curve and the patterns are hard to read. Plus NatEx is written for Python. It wraps the built-in `re` package with an abstraction layer and thus provides almost the same performance as any normal regex.

[Tregex]: https://nlp.stanford.edu/software/tregex/The_Wonderful_World_of_Tregex.ppt/
[Semgrex]: https://nlp.stanford.edu/nlp/javadoc/javanlp/edu/stanford/nlp/semgraph/semgrex/SemgrexPattern.html
[Tsurgeon]: https://nlp.stanford.edu/software/tregex/Tsurgeon2.ppt/

## Examples
You can use it for simple tagging (**NLU**):

```python
from natex import natex

sentence = natex('book flights from Munich to Chicago')
origin_location, destination_location = sentence.findall('<@PROPN>')
# origin_location ='Munich', destination_location = 'Chicago'

sentence = natex('I am being called Dan Borufka')
firstname, lastname = sentence.findall('<@PROPN>')
# firstname = 'Dan', lastname = 'Borufka'

sentence = natex('I need to go to Italy')
clause = sentence.search('<@ADP> <@PROPN>').match
# clause = 'to Italy'
destination = clause.split(' ')[1]

```

Or for simple response template generation (**NLG**):

```python
from natex import natex

sentence = natex('Eat my shorts')

# look for token with imperative form
is_command = sentence.match(r'<!>')

if is_command:
	action_verb = sentence.search(r'<@VERB!>').lower()
	action_recipient = sentence.search(r'<#OBJ>')
	response = f'I will do my best to {action_verb} {action_recipient}!'

	# will contain 'I will do my best to eat shorts!'

```

Even more (random) examples:

```python
from natex import natex

sentence = natex('Sloths eat steak in New York')

# find first occurrence of character sequence "ea" in nouns only
sentence.search(r'ea@NOUN')			
# returns <natex.Match object; span=(11, 16), match='steak'>

# find first occurrence of character sequence "ea"
sentence.search(r'ea')
# returns <natex.Match object; span=(7, 9), match='ea'>

# find all occurrences of nouns or proper nouns starting with a lowercase s
sentence.findall(r's[^@]+@(NOUN|PROPN)') 
# returns ['steak']

sentence = natex('Ein Hund isst keinen Gurkensalat in New York.', 'de')

# replace the nominal subject with the literal 'Affe'
sentence.sub(r'#NSUBJ', 'Affe')
# returns 'Ein Affe isst keinen Gurkensalat in New York.'
```

Check out [`test.py`] for some more examples!

[`test.py`]: https://github.com/polygoat/natex-py/blob/main/test.py

## Installation
Run:

```bash
pip install natex
```

## Usage
NatEx provides the same API as the [`re` package], but adds the following special characters:

[`re` package]: https://docs.python.org/3/library/re.html

| Symbol | Meaning                  | Example pattern | Meaning 					    |
|:------:| ------------------------ |:---------------:| ------------------------------- |
| <      | token boundary (opening) | <New            | _Find tokens starting with "New"_ |
| :      | either @ or #  			| <:ADV 		  | _Find tokens with e.g. universal POS "ADV" or dep. tree tag "ADVMOD"_ |
| @      | part of speech tag       | @ADJ 		  	  | _Find tokens that are adjectives_ |
| #      | dependency tree tag      | #OBJ 			  | _Find the objects of the sentence_ |
| !      | imperative (mood)        | <!>			  | _Find any verbs that are in imperative form_ |
| >      | token boundary (closing) | York>			  | _Find all tokens ending in "York"_ |

If you combine features (e.g. seeking by part of speech and dependency tree simultaneously) make sure you provide them in the order of the table above.

✔ This will work:
```python
natex('There goes a test sentence').findall(r'<@NOUN#OBJ>')
```

✘ But this won't:
```python
natex('There goes a test sentence').findall(r'<#OBJ@NOUN>')
```

Calling the `natex()` function returns a `NatEx` instance. See [API] for a more detailed description.
Just as the `re.Match` returning methods provided by Python's built-in `re` package, NatEx' equivalents will return a `natex.Match` object containing a `span` and a `match` property referring to position and substring of the sentence respectively.

[API]: #api

### Configuration
You can set the **processing language** of NatEx using the second parameter `language_code` (defaults to 'en'). 
It accepts a two-letter language-code, supporting [all languages officially supported by stanza].

[all languages officially supported by stanza]: https://stanfordnlp.github.io/stanza/available_models.html

```python

sentence = natex('Das Faultier isst keinen Gurkensalat', 'de')

```

When you run NatEx for the first time, it will check for the existence of the corresponding language models and download them if necessary. All subsequent calls to `natex()` will exclude that step.

### API
The API is derived from Python's built-in `re` package:

**NatEx**

**.match(pattern, flags)**

Checks (from the beginning of the string) whether the sentence matches a _pattern_ and returns a `natex.Match` object or `None` otherwise.

**.search(pattern, flags)**
Returns a `natex.Match` object describing the first substring matching _pattern_.

**.findall(pattern, flags)**
Returns all found strings matching _pattern_ as a list.

**.split(pattern, flags)**
Splits the sentence by all occurrences of the found _pattern_ and returns a list of strings.

**.sub(pattern, replacement, flags)**
Replaces all occurrences of the found _pattern_ by _replacement_ and returns the changed string.

## Testing
You can use pytest in your terminal (simply type in `pytest`) to run the unit tests shipped with this package.
Install it by running `pip install pytest` in your terminal.

