Metadata-Version: 2.1
Name: url-metadata
Version: 0.1.2
Summary: A cache which saves URL metadata and summarizes content
Home-page: https://github.com/seanbreckenridge/url_metadata
Author: Sean Breckenridge
Author-email: seanbrecke@gmail.com
License: http://www.apache.org/licenses/LICENSE-2.0
Keywords: url cache metadata youtube subtitles
Platform: UNKNOWN
Classifier: License :: OSI Approved :: Apache Software License
Classifier: Programming Language :: Python
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.7
Classifier: Programming Language :: Python :: 3.8
Description-Content-Type: text/markdown
Requires-Dist: appdirs (>=1.4.4)
Requires-Dist: lassie (>=0.11.7)
Requires-Dist: readability-lxml (>=0.8.1)
Requires-Dist: backoff (>=1.10.0)
Requires-Dist: beautifulsoup4 (>=4.5.3)
Requires-Dist: click (>=7.1.2)
Requires-Dist: logzero
Provides-Extra: testing
Requires-Dist: pytest ; extra == 'testing'
Requires-Dist: mypy ; extra == 'testing'
Requires-Dist: vcrpy ; extra == 'testing'

[![PyPi version](https://img.shields.io/pypi/v/url_metadata.svg)](https://pypi.python.org/pypi/url_metadata) [![Python3.7|Python 3.8](https://img.shields.io/pypi/pyversions/url_metadata.svg)](https://pypi.python.org/pypi/url_metadata) [![PRs Welcome](https://img.shields.io/badge/PRs-welcome-brightgreen.svg?style=flat-square)](http://makeapullrequest.com)

A cache which saves URL metadata and summarizes content

This is meant to provide more context to any of my tools which use URLs. If I [watched some youtube video](https://github.com/seanbreckenridge/mpv-sockets/blob/master/DAEMON.md) and I have a URL, I'd like to have the subtitles for it, so I can do a text-search over all the videos I've watched. If I [read an article](https://github.com/seanbreckenridge/ffexport), I want the article text! This requests, parses and abstracts away that data for me locally, so I can just do:

```python
>>> from url_metadata import metadata
>>> m = metadata("https://pypi.org/project/beautifulsoup4/")
>>> len(m.info["images"])
46
>>> m.info["title"]
'beautifulsoup4'
>>> m.text_summary[:57]
"Beautiful Soup is a library that makes it easy to scrape"
```

If I ever request the same URL again, that info is grabbed from a local directory cache instead.

---

## Installation

Requires `python3.7+`

To install with pip, run:

    pip install url_metadata

---

This uses:

- [`lassie`](https://github.com/michaelhelmick/lassie) to get generic metadata; the title, description, opengraph information, links to images/videos on the page
- [`readability`](https://github.com/buriy/python-readability) to convert HTML to a summary of the HTML content.
- [`bs4`](https://pypi.org/project/beautifulsoup4/) to convert the parsed HTML to text (to allow for nicer text searching)
- [`youtube_subtitles_downloader`](https://github.com/seanbreckenridge/youtube_subtitles_downloader) to get manual/autogenerated captions (converted to a `.srt` file) from Youtube URLs.

---

### Usage:

In Python, this can be configured by using the `url_metadata.URLMetadataCache` class:

```python
   URLMetadataCache(loglevel: int = logging.WARNING,
                    subtitle_language: str = 'en',
                    sleep_time: int = 5,
                    cache_dir: Union[str, pathlib.Path, NoneType] = None)
       """
       Main interface to the library

       Supply 'cache_dir' to overwrite the default location.
       """

   get(self, url: str) -> url_metadata.model.Metadata
       """
       Gets metadata/summary for a URL.
       Save the parsed information in a local data directory
       If the URL already has cached data locally, returns that instead.
       """

   in_cache(self, url: str) -> bool
       """
       Returns True if the URL already has cached information
       """

   request_data(self, url: str) -> url_metadata.model.Metadata
       """
       Given a URL:

       If this is a youtube URL, this requests youtube subtitles
       Uses lassie to grab metadata
       Parses the HTML text with readablity
       uses bs4 to parse that text into a plaintext summary
       """
```

For example:

```python
import logging
from url_metadata import URLMetadataCache

# make requests every 2 seconds
# debug logs
# save to a folder in my home directory
cache = URLMetadataCache(loglevel=logging.DEBUG, sleep_time=2, cache_dir="~/mydata")
c = cache.get("https://github.com/seanbreckenridge/url_metadata")
# just request information, don't read/save to cache
data = cache.request_data("https://www.wikipedia.org/")
```

---

The CLI interface lets you specify much the same; and provides some utility commands to get/list information from the cache.

```
$ url_metadata
Usage: url_metadata [OPTIONS] COMMAND [ARGS]...

Options:
  --cache-dir PATH          Override default directory cache location
  --debug / --no-debug      Increase log verbosity
  --sleep-time INTEGER      How long to sleep between requests
  --subtitle-language TEXT  Subtitle language for Youtube captions
  --help                    Show this message and exit.

Commands:
  cachedir  Prints the location of the local cache directory
  export    Print all cached information as JSON
  get       Get information for one or more URLs
  list      List all cached URLs
```
### CLI Examples

The `get` command emits `JSON`, so it could with other tools (e.g. [`jq`](https://stedolan.github.io/jq/)) used like:

```shell
$ url_metadata get "https://click.palletsprojects.com/en/7.x/arguments/" \
    | jq -r '.[] | .text_summary' | head -n5
Arguments
Arguments work similarly to options but are positional.
They also only support a subset of the features of options due to their
syntactical nature. Click will also not attempt to document arguments for
you and wants you to document them manually
```

```shell
$ url_metadata export | jq -r '.[] | .info | .title'
seanbreckenridge/youtube_subtitles_downloader
Arguments — Click Documentation (7.x)
```

```shell
$ url_metadata list --location
/home/sean/.local/share/url_metadata/data/b/a/a/c8e05501857a3c7d2d1a94071c68e/000
/home/sean/.local/share/url_metadata/data/9/4/4/1c380792a3d62302e1137850d177b/000
```

```shell
# to make a backup of the cache directory
$ tar -cvzf url_metadata.tar.gz "$(url_metadata cachedir)"
```

Accessible through the `url_metadata` script and `python3 -m url_metadata`

---

This stores all of this information as individual files in a cache directory (using [`appdirs`](https://github.com/ActiveState/appdirs)). In particular, it `MD5` hashes the URL and stores information like:

```
.
└── 7
    └── b
        └── 0
            └── d952fd7265e8e4cf10b351b6f8932
                └── 000
                    ├── epoch_timestamp.txt
                    ├── key
                    ├── metadata.json
                    ├── subtitles.srt
                    ├── summary.html
                    └── summary.txt
```

You're free to delete any of the directories in the cache if you want, this doesn't maintain a strict index, it uses a hash of the URL and then searches for a matching `key` file. See comments [here](https://github.com/seanbreckenridge/url_metadata/blob/master/src/url_metadata/cache.py) for implementation details.

By default this waits 5 seconds between requests. Since all the info is cached, I use this by requesting all the info from one data source (e.g. my bookmarks, or videos I've watched recently) in a loop in the background, which saves all the information to my computer. The next time I do that same loop, it doesn't have to make any requests and it just grabs all the info from local cache.

Originally created for [`HPI`](https://github.com/seanbreckenridge/HPI).

---

### Testing

    git clone 'https://github.com/seanbreckenridge/url_metadata'
    cd ./url_metadata
    git submodule update --init
    pip install '.[testing]'
    mypy ./src/url_metadata/
    pytest


