Metadata-Version: 2.1
Name: rollet
Version: 0.0.2a0
Summary: Collect data from various sources
Home-page: UNKNOWN
Author: Opscidia (Tech)
Author-email: tech@opscidia.com
Maintainer: Loïc Rakotoson
Maintainer-email: loic.rakotoson@opscidia.com
License: UNKNOWN
Keywords: fetch,pull,extract,scrap
Platform: UNKNOWN
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Developers
Classifier: Programming Language :: Python :: 3
Classifier: Operating System :: OS Independent
Requires-Python: >=3.7
Description-Content-Type: text/markdown
Requires-Dist: beautifulsoup4 (>=4.9.0)
Requires-Dist: dask
Requires-Dist: pandas
Requires-Dist: tldextract (>=2.2)
Requires-Dist: tqdm
Requires-Dist: requests

# Rollet
`Rollet` collects, standardizes and completes from various sources.

[![PyPI](https://img.shields.io/pypi/v/Rollet?logo=PyPI&style=for-the-badge&labelColor=%233775A9&logoColor=white)](https://pypi.org/project/rollet/)
![PyPI - Status](https://img.shields.io/pypi/status/rollet?style=for-the-badge)
[![PyPI - Python Version](https://img.shields.io/pypi/pyversions/rollet?logo=python&logoColor=yellow&style=for-the-badge)](https://pypi.org/project/rollet/)



# Installation
## Pypi
The safest way to install `rollet` is to go through pip
```bash
python -m pip install rollet
```

# How to use?
## Command script
```sh
usage: rollet {extract-txt,extract-csv,extract-json} path
              [-h] [-o [OUTFILE]] [-l [LINK]] [-f [FIELDS]] [--start [START]]
              [--size [SIZE]] [-t [TIMESLEEP]]

positional arguments:
  {extract-txt,extract-csv,extract-json} Choose file type option extraction
  path                                   file path

optional arguments:
  -h, --help                   show this help message and exit
  -o [OUTFILE], --outfile      output file path
  -l [LINK], --link  link      field if csv or json
  -f [FIELDS], --fields        fields to keep separated by comma
  --start [START]              number of rows to skip
  --size  [SIZE]               max number of rows to keep
  -t [TIMESLEEP], --timesleep  sleep time in seconds between two pulling
```

## Python
### Basic usage
```python
from rollet import get_content
from rollet.extractor import BaseExtractor

url = 'https://example.url.com/content-id'

content_dict = get_content(url)

content_object = BaseExtractor(url)
content_object.title            # Title
content_object.abstract         # Abstract
content_object.lang             # Language
content_object.content_type     # Type (pdf, json, html, ...)
content_object.to_dict()        # Same as get_content
```

### Custom extractors
```python
class CustomExtractor(BaseExtractor):

    @property
    def title(self):
        return self._page.find('title')
```

And More!

