Metadata-Version: 2.1
Name: s3select
Version: 0.0.5
Summary: S3 select utility package
Home-page: https://github.com/marko-bast/s3select
Author: Marko Baštovanović
Author-email: marko.bast@gmail.com
License: UNKNOWN
Platform: UNKNOWN
Classifier: Programming Language :: Python :: 2.7
Classifier: Programming Language :: Python :: 3.7
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Description-Content-Type: text/markdown
Requires-Dist: requests[security] (>=2.18.3)
Requires-Dist: pyasn1 (>=0.4.2)
Requires-Dist: boto3 (>=1.7.79)

## S3 select

![alt text](https://github.com/marko-bast/s3select/raw/master/s3select_example_run.gif "Example run")
Example query run on 10GB of GZIP compressed JSON data (>60GB uncompressed)

### Motivation
[Amazon S3 select](https://docs.aws.amazon.com/AmazonS3/latest/dev/s3-glacier-select-sql-reference-select.html) is one of the coolest features AWS released in 2018. It's benefits are:
1) Very fast and low on network utilization as it allows you to return only subset of file contents from S3 using limited SQL select query. Since filtering of the data takes place on AWS machine where S3 file resides, network data transfer can be significantly limited depending on query issued.
2) Is lightweight on client side because all filtering is done on machine where S3 data is located 
4) It's [cheap](https://aws.amazon.com/s3/pricing/#Request_pricing_.28varies_by_region.29) at $0.002 per GB scanned and $0.0007 per GB returned<br>
For more details about S3 select see this [presentation](https://www.youtube.com/watch?v=uxcyoc6uaLM).<p>
Unfortunately S3 select API query call is limited to only one file on S3 and syntax is quite cumbersome, making it very impractical for daily usage. These are and more flaws are intended to be fixed with this s3select command.    

### Features at a glance
Most important features:
 1) Queries all files beneath given S3 prefix
 2) Whole process is multi threaded and fast. Scan of 1.1TB of data in stored in 20,000 files takes 5 minutes). Threads don't slow down client much as heavy lifting is done on AWS.
 3) Format of the file is automatically inferred for you picking GZIP or plain text depending on file extension 
 4) Real time progress
 5) Exact cost of the query returned for each run
 6) Ability to only count records matching the filter in fast and efficient manner
 7) You can easily limit number of results returned while still keeping multi threaded execution
 8) Failed requests are properly handled and repeated if they are retriable (e.g. throttled calls) 

### Installation
s3select is built in Python and uses [pip](http://www.pip-installer.org/en/latest/). Here is how to install and updated it:
<pre>
$ pip install -U s3select
</pre>

### Authentication

s3select uses the same authentication and endpoint configuration as [aws-cli](https://github.com/aws/aws-cli#getting-started). If aws command is working on your machine, there is no need for any additional configuration.

### Example usage


### License

Distributed under the MIT license. See `LICENSE` for more information.


