Metadata-Version: 2.1
Name: scrapd
Version: 1.2.0
Summary: Extract data from APD news site
Home-page: https://rgreinho.github.io/scrapd/
Author: rgreinho
Author-email: remy.greinhofer@gmail.com
License: UNKNOWN
Platform: UNKNOWN
Classifier: Development Status :: 5 - Production/Stable
Classifier: Environment :: Console
Classifier: Intended Audience :: End Users/Desktop
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: POSIX :: Linux
Classifier: Operating System :: MacOS :: MacOS X
Classifier: Operating System :: Microsoft :: Windows
Classifier: Programming Language :: Python
Classifier: Programming Language :: Python :: 3 :: Only
Classifier: Programming Language :: Python :: 3.7
Classifier: Topic :: Utilities
Requires-Dist: aiohttp (==3.5.4)
Requires-Dist: anyconfig (==0.9.8)
Requires-Dist: click (==7.0)
Requires-Dist: dateparser (==0.7.0)
Requires-Dist: google-api-python-client (==1.7.8)
Requires-Dist: google-auth-httplib2 (==0.0.3)
Requires-Dist: google-auth-oauthlib (==0.2.0)
Requires-Dist: gspread (==3.1.0)
Requires-Dist: loguru (==0.2.4)
Requires-Dist: lxml (==4.3.0)
Requires-Dist: pbr (==5.1.1)
Requires-Dist: tabulate (==0.8.2)
Requires-Dist: oauth2client
Requires-Dist: PyOpenSSL

ScrAPD
======

.. image:: https://badge.fury.io/py/scrapd.svg
   :target: https://badge.fury.io/py/scrapd

.. image:: https://circleci.com/gh/rgreinho/scrapd.svg?style=svg
   :target: https://circleci.com/gh/rgreinho/scrapd

.. image:: https://coveralls.io/repos/github/rgreinho/scrapd/badge.svg?branch=master
   :target: https://coveralls.io/github/rgreinho/scrapd?branch=master


Extract data from `APD news site <http://austintexas.gov/department/news/296>`_.

ScrAPD is a small utility designed to help organizations retrieving traffic fatality data in a friendly manner.

Installation
------------

ScrAPD requires Python 3.7+ to work.

::

  pip install scrapd

Quickstart
----------
Collect all the data as CSV::

  scrapd retrieve --format csv

By default, scrapd does not display anything until it is done collecting the data. If you want to get some feedback
about the process, you can enable logging, by adding the `-v` **BEFORE** the command you want to use. Multiple `-v`
options increase the verbosity. The maximum is 3 (`-vvv`)::

  scrapd -v retrieve --format csv

To save the results to a file, use the shell redirection::

  scrapd -v retrieve --format csv > results.csv

.. note::

  The logs are displayed to `stderr` and will not appear in the result file generated by the redirection. If you want to
  include this information add  `&2>1`.

Examples
^^^^^^^^

Retrieve the traffic fatalities that happened between January 15th 2019 and January 18th 2019, and output the results
in `json`::

  scrapd retrieve --from "Jan 15 2019" --to "Jan 18 2019" --format json

  [
    {
      "Age": 31,
      "Case": "19-0150158",
      "DOB": "07/09/1987",
      "Date": "January 15, 2019",
      "Ethnicity": "White",
      "Fatal crashes this year": "1",
      "First Name": "Hilburn",
      "Gender": "male",
      "Last Name": "Sell",
      "Link": "http://austintexas.gov/news/traffic-fatality-1-4",
      "Location": "10500 block of N IH 35 SB",
      "Time": "6:20 a.m."
    },
    {
      "Age": 58,
      "Case": "19-0161105",
      "DOB": "02/15/1960",
      "Date": "January 16, 2019",
      "Ethnicity": "White",
      "Fatal crashes this year": "2",
      "First Name": "Ann",
      "Gender": "female",
      "Last Name": "Bottenfield-Seago",
      "Link": "http://austintexas.gov/news/traffic-fatality-2-3",
      "Location": "West William Cannon Drive and Ridge Oak Road",
      "Time": "3:42 p.m."
    }
  ]

Do the same research but output as CSV::

    scrapd retrieve --from "Jan 15 2019" --to "Jan 18 2019" --format csv


    Fatal crashes this year,Case,Date,Time,Location,First Name,Last Name,Ethnicity,Gender,DOB,Age,Link
    1,19-0150158,"January 15, 2019",6:20 a.m.,10500 block of N IH 35 SB,Hilburn,Sell,White,male,07/09/1987,31,http://austintexas.gov/news/traffic-fatality-1-4
    2,19-0161105,"January 16, 2019",3:42 p.m.,West William Cannon Drive and Ridge Oak Road,Ann,Bottenfield-Seago,White,female,02/15/1960,58,http://austintexas.gov/news/traffic-fatality-2-3

Retrieve all the traffic fatalities from 2019 (*as of Jan 20th 2019*) in json, and enabling the logging to follow the progress
of the process::

  scrapd -v retrieve --from "1 1 2019" --format json

  Fetching page 1...
  Fetching page 2...
  Total: 2
  [
    {
      "Age": 31,
      "Case": "19-0150158",
      "DOB": "07/09/1987",
      "Date": "January 15, 2019",
      "Ethnicity": "White",
      "Fatal crashes this year": "1",
      "First Name": "Hilburn",
      "Gender": "male",
      "Last Name": "Sell",
      "Link": "http://austintexas.gov/news/traffic-fatality-1-4",
      "Location": "10500 block of N IH 35 SB",
      "Time": "6:20 a.m."
    },
    {
      "Age": 58,
      "Case": "19-0161105",
      "DOB": "02/15/1960",
      "Date": "January 16, 2019",
      "Ethnicity": "White",
      "Fatal crashes this year": "2",
      "First Name": "Ann",
      "Gender": "female",
      "Last Name": "Bottenfield-Seago",
      "Link": "http://austintexas.gov/news/traffic-fatality-2-3",
      "Location": "West William Cannon Drive and Ridge Oak Road",
      "Time": "3:42 p.m."
    }
  ]

Export the results to Google Sheets::

  scrapd -v retrieve \
    --from "Feb 1 2019" \
    --format gsheets \
    --gcredentials creds.json \
    --gcontributors "remy.greinhofer@gmail.com:user:writer"

Speed and accuracy
------------------

ScrAPD executes all the requests in an asynchronous manner. As a result it goes very fast.

It parses the information using both the text of the report itself and the Twitter tweet stored in the page metadata.
Combining these two methods provides a high degree of confidence in the parsing and allows us to reach **90% of success
rate**.

Some statistics:

* 125 entries in total
* 112 entries correctly parsed (90%)

  * 105 entries fully parsed (85%)
  * 7 entries where the fatalities were unidentified or had no info (5%)

* 7 entries failed the parsing (bug or incorrect regex)(5%)
* 6 entries were using natural language instead of field-like organization (5%)

  * i.e. "54 years of age" or "42 years old instead" of "DOB: 01/02/1972"
* processing time: ~1m40s

Who uses ScrAPD?
----------------

The Austin `Pedestrian Advisory Council <http://austintexas.gov/cityclerk/boards_commissions/meetings/121_1.htm>`_
used ScrAPD to compile a detailed presentation of the status of the traffic deaths in Austin, TX:

* `2018 PAC retrospective presentation <http://www.austintexas.gov/edims/document.cfm?id=314367>`_



