Metadata-Version: 1.0
Name: importio-gsei
Version: 0.4.0
Summary: Command line to feed URLs from a Google Sheet into an Import.io Extractor
Home-page: http://github.io/import.io/google-sheets-extractor-integration
Author: David Gwartney
Author-email: david.gwartney@import.io
License: LICENSE
Description: 
        
        IMPORT.IO GOOGLE SHEETS EXTRACTOR INTEGRATION
        
        
        This repository includes a script solution that uses URLs taken from a
        Google sheet and copies to an existing Import.io Extractor, then lastly
        executes the Extractor to collect data from the supplied URLs.
        
        The command script is implemented using Python, which can be executed on
        any operating system that supports a Python version 3.5 or later.
        
        
        Installation
        
        The command requires the installation of Python 3.5 or later with some
        additional third-party packages that can be installed using pip.
        
            $ pip install importio_gsei
        
        
        Configuration
        
        Before the integration can be used you need to setup the Google Sheets
        API and Import.io key.
        
        Import.io Authentication
        
        The Import.io API needs to be made available via an environment variable
        that can be configured as follows:
        
            $ export IMPORT_IO_API_KEY=<api key>
        
        Google Sheets API Setup
        
        Generate Credentials
        
        NOTE: You must have Google Account to use the Google Sheets API, you can
        sign up here.
        
        1.  Use this wizard to create or select a project in the Google
            Developers Console and automatically turn on the API. Click
            CONTINUE, then GO TO CREDENTIALS.
        
        2.  On the ADD CREDENTIALS TO YOUR PROJECT page, click the
            Cancel button.
        
        3.  At the top of the page, select the OAuth consent screen tab.
        
        4.  Select an Email address, enter a Product name if not already set,
            and click the Save button.
        
        5.  Select the CREDENTIALS tab, click the CREATE CREDENTIALS button and
            select OAUTH CLIENT ID.
        
        6.  Select the application type OTHER, enter the name "Google Sheets
            Extractor Integration", and click the Create button.
        
        7.  Click OK to dismiss the resulting dialog.
        
        8.  Click the file_download (Download JSON) button to the right of the
            client ID.
        
        9.  Move this file to a known directory on your computer and rename to
            client-secret.json
        
        Authorize Access
        
        1.  Authorize access to the Google Sheets API by running the following
            command:
        
                $ gextractor authorize -f client-secret.json -a 'Google Sheets Extractor Integration'
        
        2.  This will launch a browser and display an authorization page
            (see below).
        
        3.  Click on the ALLOW button.
        
        4.  Close the browser window.
        
        [Google Authorization Page]
        
        
        Operation
        
        The command has 7 different sub-commands that are invoked similiarly as
        follows:
        
            $ gsextractor <sub-command>
        
        where sub-command is one of the following:
        
        -   _authorize_ - Performs the Oauth handshake to allow permit access to
            Google Sheets API
        -   _copy-urls_ - Copies URLs from a specified Google sheet to
            designated Extractor.
        -   _extractor-start_ - Initiates a crawl-run of an Extractor.
        -   _extractor-status_ - Provides a status of the crawl-run(s) of
            an Extractor.
        -   _extractor-urls_ - Displays the list of URLs associated with
            an Extractor.
        -   _extract_ - Performs the complete extraction operation from copying
            the URLs from the Google sheet to running the Extractor to extract
            data from web pages.
        -   _sheet-urls_ - Displays a list of the URLs in a Google sheet given
            the sheet id and range.
        
        Help can be displayed which shows the available sub-commands as follows:
        
             gsextractor -h
            usage: gsextractor [-h] [-v]
                               {authorize,copy-urls,extract,extractor-start,extractor-status,extractor-urls,sheet-urls}
                               ...
        
            Import.io Google Sheets Extractor Integration
        
            positional arguments:
              {authorize,copy-urls,extract,extractor-start,extractor-status,extractor-urls,sheet-urls}
                                    commands
                authorize           Configures authentication Google Sheets API
                copy-urls           Copies URLs from google sheet to an extractor
                extract             Runs the full extraction process
                extractor-start     Starts an extractor
                extractor-status    Displays the status of recent craw runs
                extractor-urls      Displays the URLs from an extractor
                sheet-urls          Displays the URLs from a google sheet
        
            optional arguments:
              -h, --help            show this help message and exit
              -v, --version         show program's version number and exit
        
        Detail help on each of the commands is provided by the -h with the
        corresponding sub-command:
        
            $ gsextractor <sub-command> -h
        
        The version of the program can be displayed by running the command with
        the -v option:
        
            $ gsextractor <sub-command> -v
            0.4.0
        
        auth
        
        Configures OAuth
        
            $ gsextractor auth -c <path to credentials file>
        
        copy-urls
        
        Copy the URLs in the Google Sheet to the Extractor URLs
        
            $ gsextractor copy-urls -i <spreasdheet_id> -r <spreadsheet_range> -e <extractor_id>
        
        NOTE: _spreadsheet_range_ needs to be surrounded by single quotes to
        prevent a range of the
        
        extractor-start
        
        Initiate a crawl-run on a specific Extractor
        
            $ gsextractor extractor-start -e <extractor_id>
        
        extractor-status
        
        Display the status of a crawl-run(s) on a specific Extractor
        
            $ gsextractor extractor-status  -e <extractor id>
        
        extractor-urls
        
        Displays the URLs associated with a specific Extractor
        
            $ gsextractor extractor-urls  -e <extractor id>
        
        extract
        
        Runs the complete operation of copying URLs from a Google sheet to an
        Extractor, and starting the Extractor
        
            $ gsextractor extract -i <spreasdheet_id> -r <spreadsheet_range> -e <extractor_id>
        
        sheet-urls
        
        Displays the URLs from a specifice Google sheet and range
        
            $ gsextractor sheeturls -i <spreasdheet_id> -r <spreadsheet_range>
        
        
        Programmatic Execution
        
        The same operations as listed above can performed programmatically from
        Python similar to the following example:
        
            from importio_gsei import GsExtractorUrls
        
            g = GsExtractorUrls()
            g.extractor_start(extractor_id)
        
        copy_urls()
        
            from importio_gsei import GsExtractorUrls
        
            g = GsExtractorUrls()
            g.copy_urls(spread_sheet_id, spread_sheet_range, extractor_id)
        
        extractor_start()
        
            from importio_gsei GsExtractorUrls
        
            g = GsExtractorUrls()
            g.extractor_start(extractor_id)
        
        extractor_status()
        
            from importio_gsei import GsExtractorUrls
        
            g = GsExtractorUrls()
            crawl_runs = g.extractor_status(extractor_id)
            for crawl_run in crawl_runs:
                print(crawl_run)
        
        extractor_urls()
        
            from importio_gsei import GsExtractorUrls
        
            g = GsExtractorUrls()
            urls = g.extractor_urls(extractor_id)
            for url in urls:
                print(url)
        
        extract
        
            from importio_gsei import GsExtractorUrls
        
            g = GsExtractorUrls()
            g.extract(spread_sheet_id, spread_sheet_range, extractor_id)
        
        sheet_urls
        
            from importio_gsei import GsExtractorUrls
        
            g = GsExtractorUrls()
            urls = g.extract(spread_sheet_id, spread_sheet_range)
            for url in urls:
                print(url)
        
Platform: UNKNOWN
