Metadata-Version: 2.0
Name: scrape
Version: 0.1.1
Summary: a web scraping tool
Home-page: https://github.com/huntrar/scrape
Author: Hunter Hammond
Author-email: huntrar@gmail.com
License: MIT
Keywords: scrape webpage website pdf text keyword crawl save page filter regex lxml html
Platform: UNKNOWN
Classifier: Development Status :: 5 - Production/Stable
Classifier: Environment :: Console
Classifier: Environment :: Web Environment
Classifier: Intended Audience :: End Users/Desktop
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: System Administrators
Classifier: Programming Language :: Python :: 2
Classifier: Programming Language :: Python :: 2.6
Classifier: Programming Language :: Python :: 2.7
Requires-Dist: lxml
Requires-Dist: pdfkit

# scrape

## 
a web scraping tool

## Installation
* `pip install scrape`

## Usage
    usage: scrape.py [-h] [-c [CRAWL [CRAWL ...]]] [-ca]
                     [-f [FILTER [FILTER ...]]] [-l LIMIT] [-p] [-s] [-v] [-vb]
                     [urls [urls ...]]

    a web scraping tool

    positional arguments:
      urls                  urls to scrape

    optional arguments:
      -h, --help            show this help message and exit
      -c [CRAWL [CRAWL ...]], --crawl [CRAWL [CRAWL ...]]
                            keywords to crawl links by
      -ca, --crawl-all      crawl all links
      -f [FILTER [FILTER ...]], --filter [FILTER [FILTER ...]]
                            filter lines of text by keywords
      -l LIMIT, --limit LIMIT
                            set crawl page limit
      -p, --pdf             write to pdf instead of text
      -r, --restrict        restrict domain to that of the seed url
      -v, --version         display current version
      -vb, --verbose        print pdfkit log messages

## Author
* Hunter Hammond (huntrar@gmail.com)

## Notes
* --pdf can be used to save web pages as pdf's, they are saved to text by default.

* Text can be filtered by passing one or more regexps to --filter.

* To crawl subsequent pages, enter --crawl followed by one or more regexps or instead enter --crawl-all.

* To restrict the domain to the seed url's domain, use --strict, otherwise any domain may be followed.

* There is no limit to the number of pages to be crawled unless one is set with --limit, thus to cancel crawling and begin processing simply press Ctrl-C.



News
====

0.1.1
------
 - uncommented import __version__

0.1.0
------

 - reformatting to conform with PEP 8
 - added regexp support for matching crawl keywords and filter text keywords
 - improved url resolution by correcting domains and schemes
 - added --restrict option to restrict crawler links to only those with seed domain
 - made text the default write option rather than pdf, can now use --pdf to change that
 - removed page number being written to text, separator is now just a single blank line
 - improved construction of output file name

0.0.11
------

 - fixed missing comma in install_requires in setup.py
 - also labeled now as beta as there are still some kinks with crawling

0.0.10
------

 - now ignoring pdfkit load errors only if more than one link to try to prevent an empty pdf being created in case of error

0.0.9
------

 - pdfkit now ignores load errors and writes as many pages as possible

0.0.8
------

 - better implementation of crawler, can now scrape entire websites
 - added OrderedSet class to utils.py

0.0.7
------

 - changed --keywords to --filter and positional arg url to urls

0.0.6
------

 - use --keywords flag for filtering text
 - can pass multiple links now
 - will not write empty files anymore

0.0.5
------

 - added --verbose argument for use with pdfkit
 - improved output file name processing

0.0.4
------

 - accepts 0 or 1 url's, allowing a call with just --version

0.0.3
------

 - Moved utils.py to scrape/

0.0.2
------

 - First entry




