Metadata-Version: 1.1
Name: spark_emr
Version: 0.1.2
Summary: Run python packages on AWS EMR
Home-page: https://www.github.com/delijati/spark-emr
Author: Josip Delic
Author-email: delijati@gmx.net
License: MIT
Description: # Spark-EMR
        
        [![Build Status](https://api.travis-ci.org/delijati/spark-emr.svg?branch=master)](https://travis-ci.org/delijati/spark-emr)
        
        Run an python package on AWS EMR
        
        ## Install
        
        Develop install:
        
            $ pip install -e .
        
        Testing:
        
            $ pip install tox
            $ tox
        
        ## Setup
        
        The easiest way to get EMR up and running is to go through the Web-Interface
        and create a ssh key, and start a cluster by hand. This will then create the
        needed subnet_key and EMR roles.
        
        ## Config yaml file
        
        Create a ``config.yaml`` per project or as a default into
        `~/.config/spark-emr.yaml`
        
            bootstrap_uri: s3://foo/bar
            master: 
              instance_type: m4.large
              size_in_gb: 100
            core: 
              instance_type: m4.large
              instance_count: 2
              size_in_gb: 100
            ssh_key: XXXXX
            subnet_id: subnet-XXXXXX
            python_version: python36
            emr_version: emr-5.20.0
            consistent: false
            optimization: false
            region: eu-central-1
            job_flow_role: EMR_EC2_DefaultRole 
            service_role: EMR_DefaultRole
        
        ## CLI-Interface
        
        ### Start
        
        To run a python code on EMR you need to build a proper python package aka
        `setup.py` with `console_scripts` the script needs to end on `.py` or yarn
        won't be able to execute it |-(
        
        Bootstrap a cluster, install the pypackage, execute the task in cmdline, poll
        cluster until finished, stop cluster:
        
            $ spark-emr start \
            [--config config.yaml] \
            --name "Spark-ETL" \
            --bid-master 0.04 \
            --bid-core 0.04 \
            --cmdline "etl.py --input s3://in/in.csv --output s3://out/out.csv" \
            --tags foo 2 bar 4 \
            --poll \
            --yarn-log \
            --package "../"
        
        Running with a released pypackage version (pip):
        
            $ spark-emr start \
            ... \
            --package pip+etl_pypackage
        
        ### Status
        
        Returns the status of a cluster (also terminated ones):
        
            $ spark-emr status --cluster-id j-XXXXX
        
        ### List
        
        List all cluster and filter optionally by tag:
        
            $ spark-emr list [--config config.yaml] [--filter somekey somevalue]
        
        ### Stop
        
        Stop a running cluster:
        
            $ spark-emr stop --cluster-id j-XXXXX
        
        ### Spot price check
        
        This call returns for all regions and configured instances the spot price:
        
            $ spark-emr spot
        
        # Appendix
        
        ### Running commands on EMR
        
        The created command can also be run directly from the master:
        
            $ /usr/bin/spark-submit \
            --deploy-mode cluster \
            --master yarn \
            --conf spark.yarn.appMasterEnv.PYSPARK_PYTHON=python35 \
            --conf spark.executorEnv.PYSPARK_PYTHON=python35 \
            /usr/local/bin/etl.py --input s3://in/in.csv --output s3://out/out.csv
        
        ### Running commands on docker
        
        To test if our spark is running as expected we can run it locally in docker.
        
            $ git clone https://github.com/delijati/spark-docker
            $ cd spark-docker
            $ docker build . --pull -t spark
        
        Now we can run our spark job locally.
        
            $ docker run --rm -ti -v `pwd`/test/dummy:/app/work spark \
            bash -c "cd /app/work && pip3 install -e . && spark_emr_dummy.py 10"
        
        
        # CHANGES
        
        0.1.2 (2019-03-10)
        ------------------
        
        - Add spot price check cli.
        - Add spot BidPrice.
        - Show estimated cost.
        - Filter by tag for list cli.
        
        
        0.1.1 (2019-02-21)
        ------------------
        
        - Fixed url in setup.py.
        
        
        0.1.0 (2019-02-21)
        ------------------
        
        - Initial release.
        
Keywords: AWS EMR SPARK PYSPARK
Platform: UNKNOWN
Classifier: Development Status :: 4 - Beta
Classifier: Environment :: Console
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: MIT License
Classifier: Natural Language :: English
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.5
Classifier: Programming Language :: Python :: 3.6
