Metadata-Version: 2.1
Name: de1
Version: 0.0.7
Summary: DE1's curated collection of kedro tools.
Home-page: https://github.com/dataengineerone/de1-python
Author: DataEngineerOne
Author-email: dataengineerone@gmail.com
License: MIT
Platform: UNKNOWN
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3.6
Classifier: Programming Language :: Python :: 3.7
Description-Content-Type: text/markdown
Requires-Dist: kedro (>=0.16.0)

# de1
Curated collection of DE1's favorite kedro utilities.


## EmptyPartitionedDataSet

For those times when data is not yet available in a particular folder, or if no data is a valid value.

Particularly useful when doing sub-node parallelization.

```
empty_json_collection:
    type: de1.empty.EmptyPartitionedDataSet
    path: data/02_intermediate/json_collection
    dataset: json.JSONDataSet
```


## LazyPartitionedDataSet

For when the data is too big to calculate all at once, and requires at least some clean-up in the process.

```
lazy_json_collection:
    type: de1.lazy.LazyPartitionedDataSet
    path: data/02_intermediate/json_collection
    dataset: json.JSONDataSet
```


## PDFDataSet

A dataset that uses `pdfplumber` to extract text and tables from pdf files.

Data gets returned as a `PDFPage` object.

```
invoice_pdf:
    type: de1.pdf.PDFDataSet
    filepath: data/01_raw/invoice.pdf
```


## ZipFileDataSet

A dataset that extracts a single file from a zip file and returns the bytes.
By default will return a byte array, but a dataset can be passed in to change unzip behavior.

```
invoice_pdf:
    type: de1.zip.ZipFileDataSet
    filepath: data/01_raw/invoice.zip
    filename: invoice.pdf
```



