Metadata-Version: 2.0
Name: floscraper
Version: 0.1.15a1
Summary: Simple webscraper built on top of requests and beautifulsoup
Home-page: https://github.com/the01/python-floscraper
Author: the01
Author-email: jungflor@gmail.com
License: MIT License
Keywords: floscrapper scraping web cache requests beautifulsoup
Platform: UNKNOWN
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: MIT License
Classifier: Natural Language :: English
Classifier: Programming Language :: Python :: 2
Classifier: Programming Language :: Python :: 2.7
Requires-Dist: beautifulsoup4 (>=4.4.1,<4.5)
Requires-Dist: chardet (<2.4,>=2.3.0)
Requires-Dist: flotils (>=0.3.0a0,<0.3.1)
Requires-Dist: html2text (>=2016.1.8)
Requires-Dist: portalocker (>=0.5.5,<0.6)
Requires-Dist: python-dateutil (>=2.5.0,<2.6)
Requires-Dist: requests (<2.10,>=2.9.1)
Requires-Dist: wheel (>=0.26.0)

FLOSCRAPER
##########

Some basic webscraper I use in many projects.

.. image:: https://img.shields.io/pypi/v/floscraper.svg
    :target: https://pypi.python.org/pypi/floscraper

.. image:: https://img.shields.io/pypi/l/floscraper.svg
    :target: https://pypi.python.org/pypi/floscraper

.. image:: https://img.shields.io/pypi/dm/floscraper.svg
    :target: https://pypi.python.org/pypi/floscraper


webscraper
==========
Module to ease web efforts

**Supports**

* Cached web requests (Wrapper around requests)
* Bultin parsing/scraping (Wrapper around beautifulsoup)


**Constructor parameters**

* url: Default url, used if nothing else specified
* scheme: Default scheme for scrapping
* timeout
* cache_directory: Where to save cache files
* cache_time: How long is a cached resource vaild - in seconds (default: 7 minutes)
* cache_use_advanced
* auth_method: Authentication method (default: HTTPBasicAuth)
* auth_username: Authentication username. If set, enables authentication
* auth_password: Authentication password
* handle_redirect: Allow redirects (default: True)
* user_agent: User agent to use
* default_user_agents_browser: Browser to set in user agent (from ``default_user_agents`` dict)
* default_user_agents_os: Operating system to set in user agent (from ``default_user_agents`` dict)
* user_agents_browser: Browser to set in user agent (Overwrites default_user_agents_browser)
* user_agents_os: Operating system to set in user agent (Overwrites default_user_agents_os)
* html2text: HTML2text settings
* html_parser: What html parser to use (default: html.parser - built in)


**Example**

.. code-block:: python

    # Setup WebScraper with caching
    web = WebScraper({
        'cache_directory': "cache",
        'cache_time': 5*60
    })

    # First call to git -> hit internet
    web.get("https://github.com/")

    # Second call to git (within 5 minutes of first) -> hit cache
    web.get("https://github.com/")

Whitch results in the following output:

::

    2016-01-07 19:22:00 DEBUG   [WebScraper._getCached] From inet https://github.com
    2016-01-07 19:22:00 INFO    [requests.packages.urllib3.connectionpool] Starting new HTTPS connection (1): github.com
    2016-01-07 19:22:01 DEBUG   [requests.packages.urllib3.connectionpool] "GET / HTTP/1.1" 200 None
    2016-01-07 19:22:01 DEBUG   [WebScraper._getCached] From cache https://github.com


.. :changelog:

History
=======

0.1.15a0 (2016-03-08)
---------------------

* First release on PyPI.


