Metadata-Version: 2.4
Name: pyrolysate
Version: 0.12.0
Summary: Parser made to convert lists of emails and urls into JSON and CSV formatted files
Project-URL: Homepage, https://github.com/dawnandrew100/pyrolysate
Project-URL: Issues, https://github.com/dawnandrew100/pyrolysate/issues
Author-email: Andrew Hennis <andrew.mr.hennis@gmail.com>
License-File: LICENSE
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3
Requires-Python: >=3.10
Requires-Dist: requests
Description-Content-Type: text/markdown

# Pyrolysate

Pyrolysate is a Python library and CLI tool for parsing and validating URLs and email addresses.
It breaks down URLs and emails into their component parts, validates against IANA's official TLD list,
and outputs structured data in JSON, CSV, or text format.

The library offers both a programmer-friendly API and a command-line interface,
making it suitable for both development integration and quick data processing tasks.
It handles single entries or large datasets efficiently using Python's generator functionality,
and provides flexible input/output options including file processing with custom delimiters.

## Features

### URL Parsing
  - Extract scheme, subdomain, domain, TLD, port, path, query, and fragment components
  - Support for complex URL patterns including ports, queries, and fragments
  - Support for IP addresses in URLs
  - Support for both direct input and file processing via CLI or API
  - Output as JSON, CSV, or text format through CLI or API

### Email Parsing
  - Extract username, mail server, and domain components
  - Support for both direct input and file processing via CLI or API
  - Output as JSON, CSV, or text format through CLI or API

### Top Level Domain Validation
  - Automatic updates from IANA's official TLD list
  - Local TLD file caching for offline use
  - Fallback to common TLDs if both online and local sources fail

### Flexible Input/Output
  - Process single or multiple entries
  - Support for government domain emails (.gov.tld)
  - Custom delimiters for file input
  - Multiple output formats with .txt format as default (JSON, CSV, text)
  - Pretty-printed or minified JSON output
  - Console output or file saving options
  - Memory-efficient processing of large datasets using Python generators

### Developer Friendly
  - Type hints for better IDE support
  - Comprehensive docstrings
  - Modular design for easy integration
  - Command-line interface for quick testing

## 🚀 Installation
### From PyPI
```bash
pip install pyrolysate
```
### For Development
1. **Clone the repository**
```bash
git clone https://github.com/dawnandrew100/pyrolysate.git
cd pyrolysate
```
2. **Create and activate a virtual environment**
```bash
# Using hatch (recommended)
hatch env create

# Or using venv
python -m venv .venv
# Windows
.venv\Scripts\activate
# Unix/MacOS
source .venv/bin/activate
```
3. **Install in development mode**
```bash
# Using hatch
hatch run dev

# Or using pip
pip install -e .
```
### Verify Installation
```bash
# Using hatch (recommended)
hatch run pyro -u example.com

# Or using the CLI directly
pyro -u example.com
```
The CLI command `pyro` will be available after installation. If the command isn't found, ensure Python's Scripts directory is in your PATH.

## Usage

### Input File Parsing
```python
from pyrolysate import parse_input_file
```
#### Parse file with default newline delimiter
```python
urls = parse_input_file("urls.txt")
```
#### Parse file with custom delimiter
```python
emails = parse_input_file("emails.csv", delimiter=",")
```
### Supported Outputs
- JSON (prettified or minified)
- CSV
- Text (default)
- File output with custom naming
- Console output
### Email Parsing
```python
from pyrolysate import email
```
#### Parse single email
```python
result = email.parse_email("user@example.com")
```
#### Parse multiple emails
```python
emails = ["user1@example.com", "user2@agency.gov.uk"]
result = email.parse_email_array(emails)
```
#### Convert to JSON
```python
json_output = email.to_json("user@example.com")
json_output = email.to_json(["user1@example.com", "user2@example.com"])
```
#### Save to JSON file
```python
email.to_json_file("output", "user@example.com")
email.to_json_file("output", ["user1@example.com", "user2@test.org"])
```
#### Convert to CSV
```python
csv_output = email.to_csv("user@example.com")
csv_output = email.to_csv(["user1@example.com", "user2@test.org"])

```
#### Save to CSV file
```python
email.to_csv_file("output", "user@example.com")
email.to_csv_file("output", ["user1@example.com", "user2@test.org"])
```

### URL Parsing
```python
from pyrolysate import url
```
#### Parse single URL
```python
result = url.parse_url("https://www.example.com/path?q=test#fragment")
```
#### Parse multiple URLs
```python
urls = ["example.com", "https://www.test.org"]
result = url.parse_url_array(urls)
```
#### Convert to JSON
```python
json_output = url.to_json("example.com")
json_output = url.to_json(["example.com", "test.org"])
```
#### Save to JSON file
```python
url.to_json_file("output", "example.com")
url.to_json_file("output", ["example.com", "test.org"])
```
#### Convert to CSV
```python
csv_output = url.to_csv("example.com")
csv_output = url.to_csv(["example.com", "test.org"])

```
#### Save to CSV file
```python
url.to_csv_file("output", "example.com")
url.to_csv_file("output", ["example.com", "test.org"])
```
### Command Line Interface
#### CLI help
```bash
pyro -h
```
#### Parse single URL
```bash
pyro -u example.com
```
#### Parse multiple URLs
```bash
pyro -u example1.com example2.com
```
#### Parse URLs from file (one per line by default)
```bash
pyro -u -i urls.txt
```
#### Parse URLs from CSV file with comma delimiter
```bash
pyro -u -i urls.csv -d ","
```
#### Parse multiple emails and save as JSON
```bash
pyro -e user1@example.com user2@example.com -j -o output
```
#### Parse URLs from file and save as CSV
```bash
pyro -u -i urls.txt -c -o parsed_urls
```
#### Parse emails from file with comma delimiter
```bash
pyro -e -i emails.txt -d "," -o output
```
#### Parse emails with non-prettified JSON output
```bash
pyro -e user@example.com -j -np
```

## API Reference

### Email Class

| Method                                           | Parameters                                              | Description                    |
|---------------------                             |---------------------                                    |-----------------               |
| `parse_email(email_str)`                         | `email_str: str`                                        | Parses single email address    |
| `parse_email_array(emails)`                      | `emails: list[str]`                                     | Parses list of email addresses |
| `to_json(emails, prettify=True)`                 | `emails: str\|list[str]`, `prettify: bool`              | Converts to JSON format        |
| `to_json_file(file_name, emails, prettify=True)` | `file_name: str`, `emails: list[str]`, `prettify: bool` | Converts and saves JSON to file|
| `to_csv(emails)`                                 | `emails: str\|list[str]`                                | Converts to CSV format         |
| `to_csv_file(file_name, emails)`                 | `file_name: str`, `emails: list[str]`                   | Converts and saves CSV to file |

### URL Class

| Method                                         | Parameters                                            | Description                                               |
|------------------                              |----------------------                                 |-------------------                                        |
| `parse_url(url_str, tlds=[])`                  | `url_str: str`, `tlds: list[str]`                     | Parses single URL                                         |
| `parse_url_array(urls, tlds=[])`               | `urls: list[str]`, `tlds: list[str]`                  | Parses list of URLs                                       |
| `to_json(urls, prettify=True)`                 | `urls: str\|list[str]`, `prettify: bool`              | Converts to JSON format                                   |
| `to_json_file(file_name, urls, prettify=True)` | `file_name: str`, `urls: list[str]`, `prettify: bool` | Converts and saves JSON to file                           |
| `to_csv(urls)`                                 | `urls: str\|list[str]`                                | Converts to CSV format                                    |
| `to_csv_file(file_name, urls)`                 | `file_name: str`, `urls: list[str]`                   | Converts and saves CSV to file                            |
| `get_tld(path_to_tlds_file='tld.txt')`         | `path_to_tlds_file: str = 'tld.txt'`                  | Fetches current TLD list from IANA                        |
| `local_tld_file(file_name)`                    | `file_name: str`                                      | Fetches and stores `get_tld()` output as a local txt file |

### Miscellaneous

| Method                                              | Parameters                               | Description                                     |
|------------------                                   |----------------------                    |-------------------------                        |
| `parse_input_file(input_file_name, delimiter='\n')` | `input_file_name: str`, `delimiter: str` | Parses input file into python list by delimiter |

## CLI Reference

| Argument               | Type   | Value when argument is omitted| Description                        |
|------------------------|--------|--------------------           |------------------------------------|
| `target`               | `str`  | `None`                        | Email or URL string(s) to process  |
| `-u`, `--url`          | `flag` | `False`                       | Specify URL input                  |
| `-e`, `--email`        | `flag` | `False`                       | Specify Email input                |
| `-i`, `--input_file`   | `str`  | `None`                        | Input file name with extension     |
| `-o`, `--output_file`  | `str`  | `None`                        | Output file name without extension |
| `-c`, `--csv`          | `flag` | `False`                       | Save output as CSV format          |
| `-j`, `--json`         | `flag` | `False`                       | Save output as JSON format         |
| `-np`, `--no_prettify` | `flag` | `True`                        | Turn off prettified JSON output    |
| `-d`, `--delimiter`    | `str`  | `'\n'`                        | Delimiter for input file parsing   |

## Output Formats

### Email Parse Output

| Field       | Description       | Example |
|-------------|-------------------|---------|
| username    | Part before @     | user    |
| mail_server | Domain before TLD | gmail   |
| domain      | Top-level domain  | com     |

Example output:
```json
{"user@gmail.com": 
    {
    "username": "user",
    "mail_server": "gmail",
    "domain": "com"
    }
}
```

```csv
email,username,mail_server,domain
user@gmail.com,user,gmail,com
```


### URL Parse Output

| Field               | Description      | Example   |
|--------------       |---------------   |---------  |
| scheme              | Protocol         | https     |
| subdomain           | Domain prefix    | www       |
| second_level_domain | Main domain      | example   |
| top_level_domain    | Domain suffix    | com       |
| port                | Port number      | 443       |
| path                | URL path         | blog/post |
| query               | Query parameters | q=test    |
| fragment            | URL fragment     | section1  |

Example output:
```json
{"https://www.example.com:443/blog/post?q=test#section1": 
    {
    "scheme": "https",
    "subdomain": "www",
    "second_level_domain": "example",
    "top_level_domain": "com",
    "port": "443",
    "path": "blog/post",
    "query": "q=test",
    "fragment": "section1"
    }
}
```

```csv
url,scheme,subdomain,second_level_domain,top_level_domain,port,path,query,fragment
https://www.example.com:443/blog/post?q=test#section1,https,www,example,com,443,blog/post,q=test,section1
```

## Supported Formats

### Email Formats
- Standard: `example@mail.com`
- Government: `example@agency.gov.uk`

### URL Formats
- Basic: `example.com`
- With subdomain: `www.example.com`
- With scheme: `https://example.com`
- With path: `example.com/path/to/file.txt`
- With port: `example.com:8080`
- With query: `example.com/search?q=test`
- With fragment: `example.com#section1`
- IP addresses: `192.168.1.1:8080`
- Government domains: `agency.gov.uk`
- Full complex URLs: `https://www.example.gov.uk:8080/path?q=test#section1`

### Outputs
- Text file (default)
- JSON file (prettified or minified)
- CSV file
- Console output
