Metadata-Version: 2.1
Name: zf-perse
Version: 0.1.2
Summary: perse converts HTML content into structured JSON data
Author: Zeff Muks
Author-email: zeffmuks@gmail.com
License: MIT
Description-Content-Type: text/markdown
License-File: LICENSE

# Perse

[![PyPI version](https://badge.fury.io/py/zf-perse.svg)](https://badge.fury.io/py/zf-perse)

![Perse](https://zf-static.s3.us-west-1.amazonaws.com/perse-logo128.png)</p>

Perse converts `HTML` to `JSON` using a mix of traditional html parsing and LLM based data extraction. It performs a few optimizations after fetching the html without accidently removing any important data.

These optimizations includes:

- Removal of styling, scripting and svg tags
- Collapsing Tags (e.g. divs) with only one child

## Installation

```bash
pip install zf-perse
```

## Usage

```bash
export PERSE_OPENAI_API_KEY="your-openai-api-key"
```

### CLI

```bash
perse --url https://example.com
```

### Python

```python
from perse import perse

url = "https://example.com"
html = requests.get(url).text
j = perse(html)
print(j)
```

## Example

### Input

```html
<!-- taken from https://zeffmuks.com -->

 <html lang="en" data-theme="light" style="color-scheme: light;">

<head>
    <meta charset="utf-8">
    <link rel="icon" href="/favicon.ico">
    <meta name="viewport" content="width=device-width,initial-scale=1">
    <meta name="theme-color" content="#000000">
    <meta name="description" content="Antifragile Entropy Assassin 🥷">
    <link rel="apple-touch-icon" href="/images/logo192.png">
    <link rel="manifest" href="/manifest.json">
    <script async="" src="https://www.googletagmanager.com/gtag/js?id=G-GNB6LQMFW3"></script>
    <script>function gtag() { dataLayer.push(arguments) } window.dataLayer = window.dataLayer || [], gtag("js", new Date), gtag("config", "G-GNB6LQMFW3")</script>
    <title>Zeff Muks</title>
    <script defer="defer" src="/static/js/main.4de0eae9.js"></script>
    <link href="/static/css/main.f6a8a2d9.css" rel="stylesheet">
    <style data-emotion="css-global" data-s=""></style>
    <style data-emotion="css-global" data-s=""></style>
    <style data-emotion="css-global" data-s=""></style>
    <style data-emotion="css" data-s=""></style>
    <meta property="og:type" content="website" data-rh="true">
    <meta property="og:title" content="Zeff Muks" data-rh="true">
    <meta property="og:description" content="Antifragile Entropy Assassin 🥷" data-rh="true">
    <meta property="og:url" content="https://www.zeffmuks.com/" data-rh="true">
    <meta property="og:image" content="https://www.zeffmuks.com/images/ZeffMuks-1920.png" data-rh="true">
    <meta property="og:site_name" content="Zeff Muks" data-rh="true">
    <meta name="twitter:card" content="summary_large_image" data-rh="true">
    <meta name="twitter:site" content="@zeffmuks" data-rh="true">
    <meta name="twitter:title" content="Zeff Muks" data-rh="true">
    <meta name="twitter:description" content="Antifragile Entropy Assassin 🥷" data-rh="true">
    <meta name="twitter:image" content="https://www.zeffmuks.com/images/ZeffMuks-1920.png" data-rh="true">
</head>

<body class="chakra-ui-light" cz-shortcut-listen="true"><noscript>You need to enable JavaScript to run this
        app.</noscript>
<div id="root">
    <div class="css-0">
        <div class="css-lt6aye">
            <div class="chakra-stack css-sqtrbi"><img src="/images/ZeffMuks-6912.png" class="chakra-image css-0">
                <h1 class="chakra-heading css-1g6enkz">Antifragile Entropy Assassin 🥷🏻</h1>
                <h2 class="chakra-heading css-shu5if"><a class="chakra-link css-spn4bz"
                        href="https://x.com/zeffmuks">𝕏</a></h2>
            </div>
        </div>
        <div class="css-1hielw0">
            <div class="chakra-stack css-5kt1vw">
                <h1 class="chakra-heading css-eh1ywz">Builds</h1>
                <div class="hover:scale-105 transition-all duration-300 ease-in-out css-rvaw7p">
                    <div class="css-10fvfu7">
                        <p class="chakra-text css-1wrsef2">08/30/2024</p>
                    </div>
                    <div class="chakra-stack css-399av8">
                        <div class="min-w-full h-auto">
                            <div class="css-0">
                                <h1
                                    class="chakra-heading hover:scale-105 transition-all hover:translate-x-3 duration-300 ease-in-out css-18nnx4p">
                                    <span class="text-2xl inline-flex items-center"><img
                                            src="https://zf-static.s3.us-west-1.amazonaws.com/cursor-git-logo128.png"
                                            alt="Cursor Git" class="h-8 w-8 mr-2">Cursor Git</span>
                                </h1>
                                <p class="chakra-text css-17vaxo2">Enhanced Git for Cursor AI Editor</p>
                            </div>
                        </div>
                        <div class="min-w-full h-auto">
                            <div class="css-0">
                                <div class="flex flex-row gap-2">
                                    <div><svg xmlns="http://www.w3.org/2000/svg" width="24" height="24"
                                            viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2"
                                            stroke-linecap="round" stroke-linejoin="round"
                                            class="lucide lucide-images">
                                            <path d="M18 22H4a2 2 0 0 1-2-2V6"></path>
                                            <path d="m22 13-1.296-1.296a2.41 2.41 0 0 0-3.408 0L11 18"></path>
                                            <circle cx="12" cy="8" r="2"></circle>
                                            <rect width="16" height="16" x="6" y="2" rx="2"></rect>
                                        </svg></div><a target="_blank" rel="noopener" class="chakra-link css-4a6x12"
                                        href="https://zf-static.s3.us-west-1.amazonaws.com/cursor-git-0.1.12.vsix"><svg
                                            xmlns="http://www.w3.org/2000/svg" width="24" height="24"
                                            viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2"
                                            stroke-linecap="round" stroke-linejoin="round"
                                            class="lucide lucide-external-link">
                                            <path d="M15 3h6v6"></path>
                                            <path d="M10 14 21 3"></path>
                                            <path d="M18 13v6a2 2 0 0 1-2 2H5a2 2 0 0 1-2-2V8a2 2 0 0 1 2-2h6">
                                            </path>
                                        </svg></a><svg xmlns="http://www.w3.org/2000/svg" width="24" height="24"
                                        viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2"
                                        stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-share">
                                        <path d="M4 12v8a2 2 0 0 0 2 2h12a2 2 0 0 0 2-2v-8"></path>
                                        <polyline points="16 6 12 2 8 6"></polyline>
                                        <line x1="12" x2="12" y1="2" y2="15"></line>
                                    </svg>
                                </div>
                            </div>
                        </div>
                    </div>
                </div>
                <div class="hover:scale-105 transition-all duration-300 ease-in-out css-rvaw7p">
                    <div class="css-10fvfu7">
                        <p class="chakra-text css-1wrsef2">08/18/2024</p>
                    </div>
                    <div class="chakra-stack css-399av8">
                        <div class="min-w-full h-auto">
                            <div class="css-0">
                                <h1
                                    class="chakra-heading hover:scale-105 transition-all hover:translate-x-3 duration-300 ease-in-out css-18nnx4p">
                                    <span class="text-2xl inline-flex items-center"><img
                                            src="https://zf-static.s3.us-west-1.amazonaws.com/pyzf-logo128.png"
                                            alt="PyZF" class="h-8 w-8 mr-2">PyZF</span>
                                </h1>
                                <p class="chakra-text css-17vaxo2">Enhancements for Python</p>
                            </div>
                        </div>
                        <div class="min-w-full h-auto">
                            <div class="css-0">
                                <div class="flex flex-row gap-2"><a target="_blank" rel="noopener"
                                        class="chakra-link css-4a6x12" href="https://pypi.org/project/PyZF"><svg
                                            xmlns="http://www.w3.org/2000/svg" width="24" height="24"
                                            viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2"
                                            stroke-linecap="round" stroke-linejoin="round"
                                            class="lucide lucide-external-link">
                                            <path d="M15 3h6v6"></path>
                                            <path d="M10 14 21 3"></path>
                                            <path d="M18 13v6a2 2 0 0 1-2 2H5a2 2 0 0 1-2-2V8a2 2 0 0 1 2-2h6">
                                            </path>
                                        </svg></a><svg xmlns="http://www.w3.org/2000/svg" width="24" height="24"
                                        viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2"
                                        stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-share">
                                        <path d="M4 12v8a2 2 0 0 0 2 2h12a2 2 0 0 0 2-2v-8"></path>
                                        <polyline points="16 6 12 2 8 6"></polyline>
                                        <line x1="12" x2="12" y1="2" y2="15"></line>
                                    </svg></div>
                            </div>
                        </div>
                    </div>
                </div>
                <div class="hover:scale-105 transition-all duration-300 ease-in-out css-rvaw7p">
                    <div class="css-10fvfu7">
                        <p class="chakra-text css-1wrsef2">08/05/2024</p>
                    </div>
                    <div class="chakra-stack css-399av8">
                        <div class="min-w-full h-auto">
                            <div class="css-0">
                                <h1
                                    class="chakra-heading hover:scale-105 transition-all hover:translate-x-3 duration-300 ease-in-out css-18nnx4p">
                                    <span class="text-2xl inline-flex items-center"><img
                                            src="https://zf-static.s3.us-west-1.amazonaws.com/xanthus-logo128.png"
                                            alt="Xanthus" class="h-8 w-8 mr-2">Xanthus</span>
                                </h1>
                                <p class="chakra-text css-17vaxo2">X (formerly Twitter) Assistant</p>
                            </div>
                        </div>
                        <div class="min-w-full h-auto">
                            <div class="css-0">
                                <div class="flex flex-row gap-2"><a target="_blank" rel="noopener"
                                        class="chakra-link css-4a6x12"
                                        href="https://pypi.org/project/zf-xanthus"><svg
                                            xmlns="http://www.w3.org/2000/svg" width="24" height="24"
                                            viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2"
                                            stroke-linecap="round" stroke-linejoin="round"
                                            class="lucide lucide-external-link">
                                            <path d="M15 3h6v6"></path>
                                            <path d="M10 14 21 3"></path>
                                            <path d="M18 13v6a2 2 0 0 1-2 2H5a2 2 0 0 1-2-2V8a2 2 0 0 1 2-2h6">
                                            </path>
                                        </svg></a><svg xmlns="http://www.w3.org/2000/svg" width="24" height="24"
                                        viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2"
                                        stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-share">
                                        <path d="M4 12v8a2 2 0 0 0 2 2h12a2 2 0 0 0 2-2v-8"></path>
                                        <polyline points="16 6 12 2 8 6"></polyline>
                                        <line x1="12" x2="12" y1="2" y2="15"></line>
                                    </svg></div>
                            </div>
                        </div>
                    </div>
                </div>
                <div class="hover:scale-105 transition-all duration-300 ease-in-out css-rvaw7p">
                    <div class="css-10fvfu7">
                        <p class="chakra-text css-1wrsef2">07/24/2024</p>
                    </div>
                    <div class="chakra-stack css-399av8">
                        <div class="min-w-full h-...
```

### Output

```json
{
    "title": "Zeff Muks",
    "description": "Antifragile Entropy Assassin 🥷",
    "og": {
        "type": "website",
        "title": "Zeff Muks",
        "description": "Antifragile Entropy Assassin 🥷",
        "url": "https://www.zeffmuks.com/",
        "image": "https://www.zeffmuks.com/images/ZeffMuks-1920.png",
        "site_name": "Zeff Muks",
    },
    "twitter": {
        "card": "summary_large_image",
        "site": "@zeffmuks",
        "title": "Zeff Muks",
        "description": "Antifragile Entropy Assassin 🥷",
        "image": "https://www.zeffmuks.com/images/ZeffMuks-1920.png",
    },
    "main_header": "Antifragile Entropy Assassin 🥷🏻",
    "header_link": "https://x.com/zeffmuks",
    "builds": [
        {
            "date": "08/30/2024",
            "project": {
                "name": "Cursor Git",
                "description": "Enhanced Git for Cursor AI Editor",
                "logo_url": "https://zf-static.s3.us-west-1.amazonaws.com/cursor-git-logo128.png",
                "download_link": "https://zf-static.s3.us-west-1.amazonaws.com/cursor-git-0.1.12.vsix",
                "external_link": "",
            },
        },
        {
            "date": "08/18/2024",
            "project": {
                "name": "PyZF",
                "description": "Enhancements for Python",
                "logo_url": "https://zf-static.s3.us-west-1.amazonaws.com/pyzf-logo128.png",
                "download_link": "",
                "external_link": "https://pypi.org/project/PyZF",
            },
        },
        {
            "date": "08/05/2024",
            "project": {
                "name": "Xanthus",
                "description": "X (formerly Twitter) Assistant",
                "logo_url": "https://zf-static.s3.us-west-1.amazonaws.com/xanthus-logo128.png",
                "download_link": "",
                "external_link": "https://pypi.org/project/zf-xanthus",
            },
        },
        {
            "date": "07/24/2024",
            "project": {
                "name": "Jenga",
                "description": "Fast JSON5 Python Library",
                "logo_url": "",
                "download_link": "https://pypi.org/project/zf-jenga",
                "external_link": "",
            },
        },
        ...
```

## License

[MIT License](./LICENSE)
