Metadata-Version: 1.0
Name: transmogrify.webcrawler
Version: 1.2
Summary: Crawling and feeding html content into a transmogrifier pipeline
Home-page: http://github.com/collective/transmogrify.webcrawler
Author: Dylan Jay
Author-email: software@pretaweb.com
License: GPL
Description: Crawling - html to import
        =========================
        
        `transmogrify.webcrawler` will crawl html to extract pages and files as a source for your transmogrifier pipeline.
        `transmogrify.webcrawler.typerecognitor` aids in setting '_type' based on the crawled mimetype.
        `transmogrify.webcrawler.cache` helps speed up crawling and reduce memory usage by storing items locally.
        
        These blueprints are designed to work with the `funnelweb` pipeline but can be used independently.
        
        
        
        transmogrify.webcrawler
        =======================
        
        A source blueprint for crawling content from a site or local html files.
        
        Webcrawler imports HTML either from a live website, for a folder on disk, or a folder
        on disk with html which used to come from a live website and may still have absolute
        links refering to that website.
        
        To crawl a live website supply the crawler with a base http url to start crawling with.
        This url must be the url which all the other urls you want from the site start with.
        
        For example ::
        
         [crawler]
         blueprint = transmogrify.webcrawler
         url  = http://www.whitehouse.gov
         max = 50
        
        will restrict the crawler to the first 50 pages.
        
        You can also crawl a local directory of html with relative links by just using a file: style url ::
        
         [crawler]
         blueprint = transmogrify.webcrawler
         url = file:///mydirectory
        
        or if the local directory contains html saved from a website and might have absolute urls in it
        the you can set this as the cache. The crawler will always look up the cache first ::
        
         [crawler]
         blueprint = transmogrify.webcrawler
         url = http://therealsite.com --crawler:cache=mydirectory
        
        The following will not crawl anything larget than 4Mb ::
        
          [crawler]
          blueprint = transmogrify.webcrawler
          url  = http://www.whitehouse.gov
          maxsize=400000
        
        To skip crawling links by regular expression ::
        
          [crawler]
          blueprint = transmogrify.webcrawler
          url=http://www.whitehouse.gov
          ignore = \.mp3
                           \.mp4
        
        If webcrawler is having trouble parsing the html of some pages you can preprocesses
        the html before it is parsed. e.g. ::
        
          [crawler]
          blueprint = transmogrify.webcrawler
          patterns = (<script>)[^<]*(</script>)
          subs = \1\2
        
        If you'd like to skip processing links with certain mimetypes you can use the
        drop:condition. This TALES expression determines what will be processed further.
        see http://pypi.python.org/pypi/collective.transmogrifier/#condition-section
        ::
        
         [drop]
         blueprint = collective.transmogrifier.sections.condition
         condition: python:item.get('_mimetype') not in ['application/x-javascript','text/css','text/plain','application/x-java-byte-code'] and item.get('_path','').split('.')[-1] not in ['class']
        
        
        Options:
        
        :site_url:
         - the top url to crawl
        
        :ignore:
         - list of regex for urls to not crawl
        
        :cache:
         - local directory to read crawled items from instead of accessing the site directly
        
        :patterns:
         - Regular expressions to substitute before html is parsed. New line seperated
        
        :subs:
         - Text to replace each item in patterns. Must be the same number of lines as patterns.  Due to the way buildout handles empty lines, to replace a pattern with nothing (eg to remove the pattern), use ``<EMPTYSTRING>`` as a substitution.
        
        :maxsize:
         - don't crawl anything larger than this
        
        :max:
         - Limit crawling to this number of pages
        
        :start-urls:
         - a list of urls to initially crawl
        
        :ignore-robots:
         - if set, will ignore the robots.txt directives and crawl everything
        
        WebCrawler will emit items like ::
        
         item = dict(_site_url = "Original site_url used",
                    _path = "The url crawled without _site_url,
                    _content = "The raw content returned by the url",
                    _content_info = "Headers returned with content"
                    _backlinks    = names,
                    _sortorder    = "An integer representing the order the url was found within the page/site
        	     )
        
        
        transmogrify.webcrawler.cache
        =============================
        
        A blueprint that saves crawled content into a directory structure
        
        Options:
        
        :path-key:
          Allows you to override the field path is stored in. Defaults to '_path'
        
        :output:
          Directory to store cached content in
        
        
        transmogrify.webcrawler.typerecognitor
        ======================================
        
        A blueprint for assigning content type based on the mime-type as given by the
        webcrawler
        
        Changelog
        =========
        
        1.2 (2012-12-28)
        ----------------
        - fix cache check to prevent overwriting cache [djay]
        - turn redirects into Link objects [djay]
        - summary stats of which mimetypes were crawled [djay]
        - fixed bug where redirected pages weren't getting uploaded [djay]
        - fixed bugs with storing default pages in cache [djay]
        - fixed bug with space chars in urls [ivanteoh]
        - better handling of charset detection [djay]
        
        
        1.1 (2012-04-17)
        ----------------
        
        - add start-urls option [djay]
        - add ignore_robots option [djay]
        - fixed bug in http-equiv refresh handling [djay]
        - fixes to disk caching [djay]
        - better logging [djay]
        - default maxsize is unlimited [djay]
        - Provide ability for the reformat function to substitute patterns with 
          empty strings (nothing).  Buildout does not support empty lines within
          configuration, so if a substitution is <EMPTYSTRING> this becomes an empty
          string. [davidjb]
        - Provide a logger in the LXMLPage class so the reformat function can 
          succeed [davidjb]
        - Reformat spacing in webcrawler reformat function [davidjb] 
        
        
        1.0 (2011-06-29)
        ----------------
        -    many fixes for importing from local directory w/ many languages [simahawk]
        -    fix UnicodeEncodeError when file name/language is not english [simahawk]
        -    fix iterating over non-sequence [simahawk]
        -    fix missing import for MyStringIO [simahawk]
        
        1.0b7 (2011-02-17)
        ------------------
        - fix bug in cache check [djay]
        
        1.0b6 (2011-02-12)
        ------------------
        -    only open cache files when needed so don't run out of handles [djay]
        -    follow http-equiv refresh links [djay]
        
        1.0b5 (2011-02-06)
        ------------------
        - files use file pointers to reduce memory usage [djay]
        - cache saves .metadata files to record and playback headersx [djay]
        
        1.0b4 (2010-12-13)
        ------------------
        - improve logging [djay]
        - fix encoding bug caused by cache [djay]
        
        1.0b3 (2010-11-10)
        ------------------
        
        - Fixed bug in cache that caused many links to be ignored in some cases [djay]
        - Fix documentation up [djay]
        
        1.0b2 (2010-11-09)
        ------------------
        
        - Stopped localhost output when no output set [djay]
        
        1.0b1 (2010-11-08)
        ------------------
        
        - change site_url to just url. [djay]
        
        - rename maxpage to maxsize [djay]
        
        - fix file: style urls  [djay]
        
        - Added cache option to replace base_alias [djay]
        
        - fix _origin key set by webcrawler, instead of url now it is path as expected by further blue
          [Vitaliy Podoba]
        
        - add _orig_path to pipeline item to keep original path for any further purposes, we will need
          [Vitaliy Podoba]
        
        - make all url absolute taking into account base tags inside webcrawler blueprint
           [Vitaliy Podoba] 
        
        
        0.1 (2008-09-25)
        ----------------
        
        - renamed package from pretaweb.blueprints to transmogrify.webcrawler.
              [djay]
        
        - enhanced import view [djay]
        
        
        
Keywords: transmogrifier blueprint funnelweb source plone import conversion microsoft office
Platform: UNKNOWN
Classifier: Programming Language :: Python
Classifier: Topic :: Software Development :: Libraries :: Python Modules
