Metadata-Version: 1.1
Name: scrapinghub-autoextract
Version: 0.1.1
Summary: Python interface to Scrapinghub Automatic Extraction API
Home-page: https://github.com/scrapinghub/scrapinghub-autoextract
Author: Mikhail Korobov
Author-email: kmike84@gmail.com
License: UNKNOWN
Description: =======================
        scrapinghub-autoextract
        =======================
        
        .. image:: https://img.shields.io/pypi/v/scrapinghub-autoextract.svg
           :target: https://pypi.python.org/pypi/scrapinghub-autoextract
           :alt: PyPI Version
        
        .. image:: https://img.shields.io/pypi/pyversions/scrapinghub-autoextract.svg
           :target: https://pypi.python.org/pypi/scrapinghub-autoextract
           :alt: Supported Python Versions
        
        .. image:: https://travis-ci.org/scrapinghub/scrapinghub-autoextract.svg?branch=master
           :target: https://travis-ci.org/scrapinghub/scrapinghub-autoextract
           :alt: Build Status
        
        .. image:: https://codecov.io/github/scrapinghub/scrapinghub-autoextract/coverage.svg?branch=master
           :target: https://codecov.io/gh/scrapinghub/scrapinghub-autoextract
           :alt: Coverage report
        
        
        Python client libraries for `Scrapinghub AutoExtract API`_.
        It allows to extract product and article information from any website.
        
        Both synchronous and asyncio wrappers are provided by this package.
        
        License is BSD 3-clause.
        
        .. _Scrapinghub AutoExtract API: https://scrapinghub.com/autoextract
        
        
        Installation
        ============
        
        ::
        
            pip install scrapinghub-autoextract
        
        scrapinghub-autoextract requires Python 3.6+ for CLI tool and for
        the asyncio API; basic, synchronous API works with Python 3.5.
        
        Usage
        =====
        
        First, make sure you have an API key. To avoid passing it in ``api_key``
        argument with every call, you can set ``SCRAPINGHUB_AUTOEXTRACT_KEY``
        environment variable with the key.
        
        Command-line interface
        ----------------------
        
        The most basic way to use the client is from a command line.
        First, create a file with urls, an URL per line (e.g. ``urls.txt``).
        Second, set ``SCRAPINGHUB_AUTOEXTRACT_KEY`` env variable with your
        AutoExtract API key (you can also pass API key as ``--api-key`` script
        argument).
        
        Then run a script, to get the results::
        
            python -m autoextract urls.txt --page-type article > res.jl
        
        Run ``python -m autoextract --help`` to get description of all supported
        options.
        
        Synchronous API
        ---------------
        
        Synchronous API provides an easy way to try autoextract in a script.
        For production usage asyncio API is strongly recommended.
        
        You can send requests as described in `API docs`_::
        
            from autoextract.sync import request_raw
            query = [{'url': 'http://example.com.foo', 'pageType': 'article'}]
            results = request_raw(query)
        
        Note that if there are several URLs in the query, results can be returned in
        arbitrary order.
        
        There is also a ``autoextract.sync.request_batch`` helper, which accepts URLs
        and page type, and ensures results are in the same order as requested URLs::
        
            from autoextract.sync import request_batch
            urls = ['http://example.com/foo', 'http://example.com/bar']
            results = request_batch(urls, page_type='article')
        
        .. note::
            Currently request_batch is limited to 100 URLs at time only.
        
        .. _API docs: https://doc.scrapinghub.com/autoextract.html
        
        
        asyncio API
        -----------
        
        Basic usage is similar to sync API (``request_raw``),
        but asyncio event loop is used::
        
            from autoextract.aio import request_raw
        
            async def foo():
                results1 = await request_raw(query)
                # ...
        
        There is also ``request_parallel`` function, which allows to process
        many URLs in parallel, using both batching and multiple connections::
        
            import sys
            from autoextract.aio import request_parallel, create_session
        
            async def foo():
                async with create_session() as session:
                    res_iter = request_parallel(urls, page_type='article',
                                                n_conn=10, batch_size=3,
                                                session=session)
                    for f in res_iter:
                        try:
                            batch_result = await f
                            for res in batch_result:
                                # do something with a result
                        except ApiError as e:
                            print(e, file=sys.stderr)
                            raise
        
        ``request_parallel`` and ``request_raw`` functions handle throttling
        (http 429 errors) and network errors, retrying a request in these cases.
        
        CLI interface implementation (``autoextract/__main__.py``) can serve
        as an usage example.
        
        Contributing
        ============
        
        * Source code: https://github.com/scrapinghub/scrapinghub-autoextract
        * Issue tracker: https://github.com/scrapinghub/scrapinghub-autoextract/issues
        
        Use tox_ to run tests with different Python versions::
        
            tox
        
        The command above also runs type checks; we use mypy.
        
        .. _tox: https://tox.readthedocs.io
        
        
        Changes
        =======
        
        0.1.1 (2020-03-12)
        ------------------
        
        * allow up to 100 elements in a batch, not up to 99
        * custom User-Agent header is added
        * Python 3.8 support is declared & tested
        
        0.1 (2019-10-09)
        ----------------
        
        Initial release.
Platform: UNKNOWN
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: BSD License
Classifier: Natural Language :: English
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.5
Classifier: Programming Language :: Python :: 3.6
Classifier: Programming Language :: Python :: 3.7
Classifier: Programming Language :: Python :: 3.8
