Metadata-Version: 2.4
Name: iterabledata
Version: 1.0.6
Summary: Iterable data processing Python library
Home-page: https://github.com/apicrafter/pyiterable/
Download-URL: https://github.com/apicrafter/pyiterable/
Author: Ivan Begtin
Author-email: ivan@begtin.tech
License: MIT
Keywords: json jsonl csv bson parquet orc xml xls xlsx dataset etl data-pipelines
Classifier: Development Status :: 5 - Production/Stable
Classifier: Programming Language :: Python
Classifier: Programming Language :: Python :: 3 :: Only
Classifier: Programming Language :: Python :: 3.8
Classifier: Environment :: Console
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: System Administrators
Classifier: License :: OSI Approved :: BSD License
Classifier: Topic :: Software Development
Classifier: Topic :: Text Processing
Requires-Python: >=3.10
Description-Content-Type: text/x-rst
License-File: LICENSE
Requires-Dist: xlrd
Requires-Dist: pyorc
Requires-Dist: parquet
Requires-Dist: openpyxl
Requires-Dist: jsonlines
Requires-Dist: orjson
Requires-Dist: lz4
Requires-Dist: chardet
Requires-Dist: avro
Requires-Dist: lxml
Requires-Dist: pyarrow
Requires-Dist: pymongo
Requires-Dist: python-snappy
Requires-Dist: brotli
Requires-Dist: brotli_file
Requires-Dist: zstandard
Dynamic: author
Dynamic: author-email
Dynamic: classifier
Dynamic: description
Dynamic: description-content-type
Dynamic: download-url
Dynamic: home-page
Dynamic: keywords
Dynamic: license
Dynamic: license-file
Dynamic: requires-dist
Dynamic: requires-python
Dynamic: summary

Iterable Data
=============

*Work in progress. Documentation in progress*

Iterable data is a Python lib to read data files row by row and write
data files. Iterable classes are similar to files or csv.DictReader or
reading parquet files row by row.

This library was written to simplify data processing and conversion
between formats.

Supported file types: \* BSON \* JSON \* NDJSON (JSON lines) \* XML \*
XLS \* XLSX \* Parquet \* ORC \* Avro \* Pickle

Supported file compression: GZip, BZip2, LZMA (.xz), LZ4, ZIP, Brotli,
ZStandard

Why writing this lib?
---------------------

Python has many high-quality data processing tools and libraries,
especially pandas and other data frames lib. The only issue with most of
them is flat data. Data frames don’t support complex data types, and you
must *flatten* data each time.

pyiterable helps you read any data as a Python dictionary instead of
flattening data. It makes it much easier to work with such data sources
as JSON, NDJSON, or BSON files.

This code is used in several tools written by its author. It’s command
line tool `undatum <https://github.com/datacoon/undatum>`__ and data
processing ETL engine
`datacrafter <https://github.com/apicrafter/datacrafter>`__

Requirements
------------

Python 3.8+

Installation
------------

``pip install iterabledata`` or use this repository

Documentation
-------------

In progress. Please see usage and examples.

Usage and examples
------------------

Read compressed CSV file
~~~~~~~~~~~~~~~~~~~~~~~~

Read compressed csv.xz file

\```{python}

from iterable.helpers.detect import open_iterable

source = open_iterable(‘data.csv.xz’) n = 0 for row in iterable: n += 1
# Add data processing code here if n % 1000 == 0: print(‘Processing %d’
% (n))

::


   ### Detect encoding and file delimiter

   Detects encoding and delimiter of the selected CSV file and use it to open as iterable

   ```{python}

   from iterable.helpers.detect import open_iterable
   from iterable.helpers.utils import detect_encoding, detect_delimiter

   delimiter = detect_delimiter('data.csv')
   encoding = detect_encoding('data.csv')

   source = open_iterable('data.csv', iterableargs={'encoding' : encoding['encoding'], 'delimiter' : delimiter)
   n = 0
   for row in iterable:
       n += 1
       # Add data processing code here
       if n % 1000 == 0: print('Processing %d' % (n))

Convert Parquet file to BSON compressed with LZMA using pipeline
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Uses pipeline class to iterate through parquet file and convert its
selected fields to JSON lines (NDJSON)

\```{python}

from iterable.helpers.detect import open_iterable from iterable.pipeline
import pipeline

source = open_iterable(‘data/data.parquet’) destination =
open_iterable(‘data/data.jsonl.xz’, mode=‘w’)

def extract_fields(record, state): out = {} record = dict(record)
print(record) for k in [‘name’,]: out[k] = record[k] return out

def print_process(stats, state): print(stats)

pipeline(source, destination=destination, process_func=extract_fields,
trigger_on=2, trigger_func=print_process, final_func=print_process,
start_state={})

::


   ### Convert gzipped JSON lines (NDJSON) file to BSON compressed with LZMA 

   Reads each row from JSON lines file using Gzip codec and writes BSON data using LZMA codec

   ```{python}

   from iterable.datatypes import JSONLinesIterable, BSONIterable
   from iterable.codecs import GZIPCodec, LZMACodec


   codecobj = GZIPCodec('data.jsonl.gz', mode='r', open_it=True)
   iterable = JSONLinesIterable(codec=codecobj)        
   codecobj = LZMACodec('data.bson.xz', mode='wb', open_it=False)
   write_iterable = BSONIterable(codec=codecobj, mode='w')
   n = 0
   for row in iterable:
       n += 1
       if n % 10000 == 0: print('Processing %d' % (n))
       write_iterable.write(row)

More examples and tests
-----------------------

See `tests <tests/>`__ for example usage and tests
