Metadata-Version: 1.0
Name: spydey
Version: 0.5
Summary: A simple web spider with pluggable recursion strategies
Home-page: http://github.com/slinkp/spydey
Author: Paul M. Winkler
Author-email: slinkp@gmail.com
License: MIT
Description: Spydey
        =======
        
        A simple web spider with several recursion strategies.
        Home page is at http://github.com/slinkp/spydey.
        
        It doesn't do much except follow links and report status.  I mostly
        use it for quick and dirty smoke testing and link checking.
        
        The only unusual feature is the ``--traversal=pattern`` option, which
        does recursive traversal in an unusual order: It tries to recognize
        patterns in URLs, and will follow URLs of novel patterns before those
        with patterns it has seen before.  When there are no novel patterns to
        follow, it follows random links to URLs of known patterns. If you use
        this for smoke-testing a typical modern web app that maps URL
        patterns to views/controllers, this will very quickly hit all your
        views/controllers at least once... usually.  But it's not very
        interesting when pointed at a website that has arbitrarily deep trees
        (static files, VCS repositories, and the like).
        
        Also, it's designed so that adding a new recursion strategy is
        trivial. Spydey was originally written for the purpose of
        experimenting with different recursive crawling strategies. Read the
        source.
        
        Oh, and if you install Fabulous, console output is in color.
        
        For lazy, zero-configuration smoke testing, I typically run it like::
        
          spydey -r --stop-on-error --max-requests=200 --traversal=pattern --profile --log-referrer URL
        
        There are a number of other command-line options, many stolen from
        wget. Use ``--help`` to see what they are.
        
        Usage
        =======
        
        ::
        
         Usage: spydey [options] URL
         
         Options:
           -h, --help            show this help message and exit
           -r, --recursive       Recur into subdirectories
           -p, --page-requisites
                                 Get all images, etc. needed to display HTML page.
           --no-parent           Don't ascend to the parent directory.
           -R REJECT, --reject=REJECT
                                 Regex for filenames to reject. May be given multiple
                                 times.
           -A ACCEPT, --accept=ACCEPT
                                 Regex for filenames to accept. May be given multiple
                                 times.
           -t TRAVERSAL, --traversal=TRAVERSAL, --traverse=TRAVERSAL
                                 Recursive traversal strategy. Choices are: breadth-
                                 first, depth-first, hybrid, pattern, random
           -H, --span-hosts      Go to foreign hosts when recursive.
           -w WAIT, --wait=WAIT  Wait SECONDS between retrievals.
           --random-wait=RANDOM_WAIT
                                 Wait from 0...2*WAIT secs between retrievals.
           --loglevel=LOGLEVEL   Log level.
           --log-referrer, --log-referer
                                 Log referrer URL for each request.
           --transient-log       Use Fabulous transient logging config.
           --max-redirect=MAX_REDIRECT
                                 Maximum number of redirections to follow for a
                                 resource.
           --max-requests=MAX_REQUESTS
                                 Maximum number of requests to make before exiting. (-1
                                 used with --traversal=pattern means exit when out of
                                 new patterns)
           --stop-on-error       Stop after the first HTTP error (response code 400 or
                                 greater).
           -T TIMEOUT, --timeout=TIMEOUT
                                 Set the network timeout in seconds. 0 means no
                                 timeout.
           -P, --profile         Print the time to download each resource, and a
                                 summary of the 20 slowest at the end.
           --stats               Print a summary of traversal patterns, if
                                 --traversal=pattern
           -v, --version         Print version information and exit.
         
        Changelog
        =========
        
        0.5
        ---
        
        * Remove useless pattern stats unless --stats is given
        * Fix to prevent spanning hosts when following redirects, unless -H is on.
        
        0.4
        ---
        
        * Add ``--stop-on-error`` option
        * Add ``--max-requests=-1`` to mean stop after all patterns are seen (when used with --traversal=pattern)
        * Add usage text automatically to pkg info
        
        
        0.3
        ---
        
        * Better redirect handling: obeys -A, -R, --max-redirect, and --max-requests options
        * Minor bugfixes and refactoring
        
Platform: UNKNOWN
Classifier: Environment :: Console
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python
Classifier: Topic :: Internet :: WWW/HTTP
Classifier: Topic :: Internet :: WWW/HTTP :: Site Management :: Link Checking
