Metadata-Version: 2.4
Name: iterabledata
Version: 1.0.9
Summary: Iterable data processing Python library
Author: Ivan Begtin
License: MIT
Project-URL: Homepage, https://github.com/datenoio/iterabledata/
Project-URL: Repository, https://github.com/datenoio/iterabledata/
Keywords: json,jsonl,csv,bson,parquet,orc,xml,xls,xlsx,dbf,dataset,etl,data-pipelines
Classifier: Development Status :: 5 - Production/Stable
Classifier: Programming Language :: Python
Classifier: Programming Language :: Python :: 3 :: Only
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Environment :: Console
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: System Administrators
Classifier: License :: OSI Approved :: MIT License
Classifier: Topic :: Software Development
Classifier: Topic :: Text Processing
Requires-Python: >=3.10
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: chardet>=5.0.0
Requires-Dist: tqdm>=4.64.0
Provides-Extra: compression
Requires-Dist: lz4>=4.0.0; extra == "compression"
Requires-Dist: python-snappy>=0.6.0; extra == "compression"
Requires-Dist: python-lzo>=1.12.0; extra == "compression"
Requires-Dist: brotli>=1.0.0; extra == "compression"
Requires-Dist: brotli_file>=1.0.0; extra == "compression"
Requires-Dist: zstandard>=0.19.0; extra == "compression"
Provides-Extra: parquet
Requires-Dist: pyarrow>=10.0.0; extra == "parquet"
Provides-Extra: orc
Requires-Dist: pyorc>=1.6.0; extra == "orc"
Provides-Extra: excel
Requires-Dist: xlrd>=2.0.0; extra == "excel"
Requires-Dist: openpyxl>=3.0.0; extra == "excel"
Provides-Extra: xml
Requires-Dist: lxml>=4.9.0; extra == "xml"
Provides-Extra: bson
Requires-Dist: pymongo>=4.0.0; extra == "bson"
Provides-Extra: dbf
Requires-Dist: dbfread>=2.0.0; extra == "dbf"
Provides-Extra: warc
Requires-Dist: warcio>=1.7.0; extra == "warc"
Provides-Extra: duckdb
Requires-Dist: duckdb>=0.9.0; extra == "duckdb"
Provides-Extra: stats
Requires-Dist: pyreadstat>=1.2.0; extra == "stats"
Provides-Extra: protobuf
Requires-Dist: protobuf>=4.0.0; extra == "protobuf"
Provides-Extra: ion
Requires-Dist: ion-python>=1.0.0; extra == "ion"
Provides-Extra: hdf5
Requires-Dist: h5py>=3.0.0; extra == "hdf5"
Provides-Extra: geospatial
Requires-Dist: geojson>=3.0.0; extra == "geospatial"
Requires-Dist: pyshp>=2.3.0; extra == "geospatial"
Requires-Dist: fiona>=1.9.0; extra == "geospatial"
Requires-Dist: mapbox-vector-tile>=1.2.0; extra == "geospatial"
Requires-Dist: topojson>=1.5.0; extra == "geospatial"
Provides-Extra: toml
Requires-Dist: tomli>=2.0.0; extra == "toml"
Requires-Dist: tomli-w>=1.0.0; extra == "toml"
Requires-Dist: toml>=0.10.0; extra == "toml"
Provides-Extra: msgpack
Requires-Dist: msgpack>=1.0.0; extra == "msgpack"
Provides-Extra: yaml
Requires-Dist: pyyaml>=6.0.0; extra == "yaml"
Provides-Extra: pcap
Requires-Dist: dpkt>=1.9.0; extra == "pcap"
Provides-Extra: netcdf
Requires-Dist: netCDF4>=1.6.0; extra == "netcdf"
Provides-Extra: mvt
Requires-Dist: mapbox-vector-tile>=1.2.0; extra == "mvt"
Provides-Extra: topojson
Requires-Dist: topojson>=1.5.0; extra == "topojson"
Provides-Extra: feed
Requires-Dist: feedparser>=6.0.0; extra == "feed"
Provides-Extra: dxf
Requires-Dist: ezdxf>=1.0.0; extra == "dxf"
Provides-Extra: all
Requires-Dist: lz4>=4.0.0; extra == "all"
Requires-Dist: python-snappy>=0.6.0; extra == "all"
Requires-Dist: python-lzo>=1.12.0; extra == "all"
Requires-Dist: brotli>=1.0.0; extra == "all"
Requires-Dist: brotli_file>=1.0.0; extra == "all"
Requires-Dist: zstandard>=0.19.0; extra == "all"
Requires-Dist: pyarrow>=10.0.0; extra == "all"
Requires-Dist: pyorc>=1.6.0; extra == "all"
Requires-Dist: xlrd>=2.0.0; extra == "all"
Requires-Dist: openpyxl>=3.0.0; extra == "all"
Requires-Dist: lxml>=4.9.0; extra == "all"
Requires-Dist: pymongo>=4.0.0; extra == "all"
Requires-Dist: dbfread>=2.0.0; extra == "all"
Requires-Dist: warcio>=1.7.0; extra == "all"
Requires-Dist: duckdb>=0.9.0; extra == "all"
Requires-Dist: pyreadstat>=1.2.0; extra == "all"
Requires-Dist: protobuf>=4.0.0; extra == "all"
Requires-Dist: ion-python>=1.0.0; extra == "all"
Requires-Dist: h5py>=3.0.0; extra == "all"
Requires-Dist: geojson>=3.0.0; extra == "all"
Requires-Dist: pyshp>=2.3.0; extra == "all"
Requires-Dist: fiona>=1.9.0; extra == "all"
Requires-Dist: tomli>=2.0.0; extra == "all"
Requires-Dist: tomli-w>=1.0.0; extra == "all"
Requires-Dist: toml>=0.10.0; extra == "all"
Requires-Dist: msgpack>=1.0.0; extra == "all"
Requires-Dist: pyyaml>=6.0.0; extra == "all"
Requires-Dist: dpkt>=1.9.0; extra == "all"
Requires-Dist: netCDF4>=1.6.0; extra == "all"
Requires-Dist: mapbox-vector-tile>=1.2.0; extra == "all"
Requires-Dist: topojson>=1.5.0; extra == "all"
Requires-Dist: feedparser>=6.0.0; extra == "all"
Requires-Dist: ezdxf>=1.0.0; extra == "all"
Provides-Extra: dev
Requires-Dist: pytest>=7.0.0; extra == "dev"
Requires-Dist: pytest-benchmark>=4.0.0; extra == "dev"
Requires-Dist: pytest-cov>=4.1.0; extra == "dev"
Requires-Dist: pytest-xdist>=3.3.0; extra == "dev"
Requires-Dist: pytest-timeout>=2.2.0; extra == "dev"
Requires-Dist: mock>=4.0.0; extra == "dev"
Requires-Dist: ruff>=0.1.0; extra == "dev"
Requires-Dist: mypy>=1.0.0; extra == "dev"
Requires-Dist: types-chardet>=5.0.0; extra == "dev"
Requires-Dist: types-openpyxl>=3.0.0; extra == "dev"
Requires-Dist: hypothesis>=6.0.0; extra == "dev"
Requires-Dist: pandas>=1.5.0; extra == "dev"
Requires-Dist: bandit[toml]>=1.7.5; extra == "dev"
Requires-Dist: pip-audit>=2.6.1; extra == "dev"
Requires-Dist: safety>=2.3.0; extra == "dev"
Requires-Dist: vulture>=2.10; extra == "dev"
Requires-Dist: radon>=6.0.0; extra == "dev"
Requires-Dist: pydocstyle[toml]>=6.3.0; extra == "dev"
Requires-Dist: coverage[toml]>=7.3.0; extra == "dev"
Requires-Dist: pip-tools>=7.3.0; extra == "dev"
Requires-Dist: pipdeptree>=2.9.0; extra == "dev"
Requires-Dist: pip-licenses>=4.3.0; extra == "dev"
Requires-Dist: build>=1.0.0; extra == "dev"
Requires-Dist: twine>=4.0.0; extra == "dev"
Requires-Dist: check-wheel-contents>=0.5.0; extra == "dev"
Dynamic: license-file

# Iterable Data

Iterable Data is a Python library for reading and writing data files row by row in a consistent, iterator-based interface. It provides a unified API for working with various data formats (CSV, JSON, Parquet, XML, etc.) similar to `csv.DictReader` but supporting many more formats.

This library simplifies data processing and conversion between formats while preserving complex nested data structures (unlike pandas DataFrames which require flattening).

## Features

- **Unified API**: Single interface for reading/writing multiple data formats
- **Automatic Format Detection**: Detects file type and compression from filename
- **Support for Compression**: Works seamlessly with compressed files
- **Preserves Nested Data**: Handles complex nested structures as Python dictionaries
- **DuckDB Integration**: Optional DuckDB engine for high-performance queries
- **Pipeline Processing**: Built-in pipeline support for data transformation
- **Encoding Detection**: Automatic encoding and delimiter detection for text files
- **Bulk Operations**: Efficient batch reading and writing
- **Context Manager Support**: Use `with` statements for automatic resource cleanup

## Supported File Types

### Core Formats
- **JSON** - Standard JSON files
- **JSONL/NDJSON** - JSON Lines format (one JSON object per line)
- **JSON-LD** - JSON for Linking Data (RDF format)
- **CSV/TSV** - Comma and tab-separated values
- **Annotated CSV** - CSV with type annotations and metadata
- **CSVW** - CSV on the Web (with metadata)
- **PSV/SSV** - Pipe and semicolon-separated values
- **LTSV** - Labeled Tab-Separated Values
- **FWF** - Fixed Width Format
- **XML** - XML files with configurable tag parsing
- **ZIP XML** - XML files within ZIP archives

### Binary Formats
- **BSON** - Binary JSON format
- **MessagePack** - Efficient binary serialization
- **CBOR** - Concise Binary Object Representation
- **UBJSON** - Universal Binary JSON
- **SMILE** - Binary JSON variant
- **Bencode** - BitTorrent encoding format
- **Avro** - Apache Avro binary format
- **Pickle** - Python pickle format

### Columnar & Analytics Formats
- **Parquet** - Apache Parquet columnar format
- **ORC** - Optimized Row Columnar format
- **Arrow/Feather** - Apache Arrow columnar format
- **Lance** - Modern columnar format optimized for ML and vector search
- **Delta Lake** - Delta Lake format
- **Iceberg** - Apache Iceberg format
- **Hudi** - Apache Hudi format

### Database Formats
- **SQLite** - SQLite database files
- **DBF** - dBase/FoxPro database files
- **MySQL Dump** - MySQL dump files
- **PostgreSQL Copy** - PostgreSQL COPY format
- **DuckDB** - DuckDB database files

### Statistical Formats
- **SAS** - SAS data files
- **Stata** - Stata data files
- **SPSS** - SPSS data files
- **R Data** - R RDS and RData files
- **PX** - PC-Axis format

### Scientific Formats
- **NetCDF** - Network Common Data Form for scientific data
- **HDF5** - Hierarchical Data Format

### Geospatial Formats
- **GeoJSON** - Geographic JSON format
- **GeoPackage** - OGC GeoPackage format
- **GML** - Geography Markup Language
- **KML** - Keyhole Markup Language
- **Shapefile** - ESRI Shapefile format
- **MVT/PBF** - Mapbox Vector Tiles
- **TopoJSON** - Topology-preserving GeoJSON extension

### RDF & Semantic Formats
- **JSON-LD** - JSON for Linking Data
- **RDF/XML** - RDF in XML format
- **Turtle** - Terse RDF Triple Language
- **N-Triples** - Line-based RDF format
- **N-Quads** - N-Triples with context

### Feed Formats
- **Atom** - Atom Syndication Format
- **RSS** - Rich Site Summary feed format

### Network Formats
- **PCAP** - Packet Capture format
- **PCAPNG** - PCAP Next Generation format

### Log & Event Formats
- **Apache Log** - Apache access/error logs
- **CEF** - Common Event Format
- **GELF** - Graylog Extended Log Format
- **WARC** - Web ARChive format
- **CDX** - Web archive index format
- **ILP** - InfluxDB Line Protocol

### Email Formats
- **EML** - Email message format
- **MBOX** - Mailbox format
- **MHTML** - MIME HTML format

### Configuration Formats
- **INI** - INI configuration files
- **TOML** - Tom's Obvious Minimal Language
- **YAML** - YAML Ain't Markup Language
- **HOCON** - Human-Optimized Config Object Notation
- **EDN** - Extensible Data Notation

### Office Formats
- **XLS/XLSX** - Microsoft Excel files
- **ODS** - OpenDocument Spreadsheet

### CAD Formats
- **DXF** - AutoCAD Drawing Exchange Format

### Streaming & Big Data Formats
- **Kafka** - Apache Kafka format
- **Pulsar** - Apache Pulsar format
- **Flink** - Apache Flink format
- **Beam** - Apache Beam format
- **RecordIO** - RecordIO format
- **SequenceFile** - Hadoop SequenceFile
- **TFRecord** - TensorFlow Record format

### Protocol & Serialization Formats
- **Protocol Buffers** - Google Protocol Buffers
- **Cap'n Proto** - Cap'n Proto serialization
- **FlatBuffers** - FlatBuffers serialization
- **FlexBuffers** - FlexBuffers format
- **Thrift** - Apache Thrift format
- **ASN.1** - ASN.1 encoding format
- **Ion** - Amazon Ion format

### Other Formats
- **VCF** - Variant Call Format (genomics)
- **iCal** - iCalendar format
- **LDIF** - LDAP Data Interchange Format
- **TXT** - Plain text files

## Supported Compression Codecs

- **GZip** (.gz)
- **BZip2** (.bz2)
- **LZMA** (.xz, .lzma)
- **LZ4** (.lz4)
- **ZIP** (.zip)
- **Brotli** (.br)
- **ZStandard** (.zst, .zstd)
- **Snappy** (.snappy, .sz)
- **LZO** (.lzo, .lzop)
- **SZIP** (.sz)
- **7z** (.7z)

## Requirements

Python 3.10+

## Installation

```bash
pip install iterabledata
```

Or install from source:

```bash
git clone https://github.com/datenoio/iterabledata.git
cd pyiterable
pip install .
```

## Quick Start

### Basic Reading

```python
from iterable.helpers.detect import open_iterable

# Automatically detects format and compression
# Using context manager (recommended)
with open_iterable('data.csv.gz') as source:
    for row in source:
        print(row)
        # Process your data here
# File is automatically closed

# Or manually (still supported)
source = open_iterable('data.csv.gz')
for row in source:
    print(row)
source.close()
```

### Writing Data

```python
from iterable.helpers.detect import open_iterable

# Write compressed JSONL file
# Using context manager (recommended)
with open_iterable('output.jsonl.zst', mode='w') as dest:
    for item in my_data:
        dest.write(item)
# File is automatically closed

# Or manually (still supported)
dest = open_iterable('output.jsonl.zst', mode='w')
for item in my_data:
    dest.write(item)
dest.close()
```

## Usage Examples

### Reading Compressed CSV Files

```python
from iterable.helpers.detect import open_iterable

# Read compressed CSV file (supports .gz, .bz2, .xz, .zst, .lz4, .br, .snappy, .lzo)
source = open_iterable('data.csv.xz')
n = 0
for row in source:
    n += 1
    # Process row data
    if n % 1000 == 0:
        print(f'Processed {n} rows')
source.close()
```

### Reading Different Formats

```python
from iterable.helpers.detect import open_iterable

# Read JSONL file
jsonl_file = open_iterable('data.jsonl')
for row in jsonl_file:
    print(row)
jsonl_file.close()

# Read Parquet file
parquet_file = open_iterable('data.parquet')
for row in parquet_file:
    print(row)
parquet_file.close()

# Read XML file (specify tag name)
xml_file = open_iterable('data.xml', iterableargs={'tagname': 'item'})
for row in xml_file:
    print(row)
xml_file.close()

# Read Excel file
xlsx_file = open_iterable('data.xlsx')
for row in xlsx_file:
    print(row)
xlsx_file.close()
```

### Format Detection and Encoding

```python
from iterable.helpers.detect import open_iterable, detect_file_type
from iterable.helpers.utils import detect_encoding, detect_delimiter

# Detect file type and compression
result = detect_file_type('data.csv.gz')
print(f"Type: {result['datatype']}, Codec: {result['codec']}")

# Detect encoding for CSV files
encoding_info = detect_encoding('data.csv')
print(f"Encoding: {encoding_info['encoding']}, Confidence: {encoding_info['confidence']}")

# Detect delimiter for CSV files
delimiter = detect_delimiter('data.csv', encoding=encoding_info['encoding'])

# Open with detected settings
source = open_iterable('data.csv', iterableargs={
    'encoding': encoding_info['encoding'],
    'delimiter': delimiter
})
```

### Format Conversion

```python
from iterable.helpers.detect import open_iterable
from iterable.convert.core import convert

# Simple format conversion
convert('input.jsonl.gz', 'output.parquet')

# Convert with options
convert(
    'input.csv.xz',
    'output.jsonl.zst',
    iterableargs={'delimiter': ';', 'encoding': 'utf-8'},
    batch_size=10000
)

# Convert and flatten nested structures
convert(
    'input.jsonl',
    'output.csv',
    is_flatten=True,
    batch_size=50000
)
```

### Using Pipeline for Data Processing

```python
from iterable.helpers.detect import open_iterable
from iterable.pipeline.core import pipeline

source = open_iterable('input.parquet')
destination = open_iterable('output.jsonl.xz', mode='w')

def transform_record(record, state):
    """Transform each record"""
    # Add processing logic
    out = {}
    for key in ['name', 'email', 'age']:
        if key in record:
            out[key] = record[key]
    return out

def progress_callback(stats, state):
    """Called every trigger_on records"""
    print(f"Processed {stats['rec_count']} records, "
          f"Duration: {stats.get('duration', 0):.2f}s")

def final_callback(stats, state):
    """Called when processing completes"""
    print(f"Total records: {stats['rec_count']}")
    print(f"Total time: {stats['duration']:.2f}s")

pipeline(
    source=source,
    destination=destination,
    process_func=transform_record,
    trigger_func=progress_callback,
    trigger_on=1000,
    final_func=final_callback,
    start_state={}
)

source.close()
destination.close()
```

### Manual Format and Codec Usage

```python
from iterable.datatypes.jsonl import JSONLinesIterable
from iterable.datatypes.bsonf import BSONIterable
from iterable.codecs.gzipcodec import GZIPCodec
from iterable.codecs.lzmacodec import LZMACodec

# Read gzipped JSONL
read_codec = GZIPCodec('input.jsonl.gz', mode='r', open_it=True)
reader = JSONLinesIterable(codec=read_codec)

# Write LZMA compressed BSON
write_codec = LZMACodec('output.bson.xz', mode='wb', open_it=False)
writer = BSONIterable(codec=write_codec, mode='w')

for row in reader:
    writer.write(row)

reader.close()
writer.close()
```

### Using DuckDB Engine

```python
from iterable.helpers.detect import open_iterable

# Use DuckDB engine for CSV, JSON, JSONL files
# Supported formats: csv, jsonl, ndjson, json
# Supported codecs: gz, zstd, zst
source = open_iterable(
    'data.csv.gz',
    engine='duckdb'
)

# DuckDB engine supports totals
total = source.totals()
print(f"Total records: {total}")

for row in source:
    print(row)
source.close()
```

### Bulk Operations

```python
from iterable.helpers.detect import open_iterable

source = open_iterable('input.jsonl')
destination = open_iterable('output.parquet', mode='w')

# Read and write in batches for better performance
batch = []
for row in source:
    batch.append(row)
    if len(batch) >= 10000:
        destination.write_bulk(batch)
        batch = []

# Write remaining records
if batch:
    destination.write_bulk(batch)

source.close()
destination.close()
```

### Working with Excel Files

```python
from iterable.helpers.detect import open_iterable

# Read Excel file (specify sheet or page)
xls_file = open_iterable('data.xlsx', iterableargs={'page': 0})

for row in xls_file:
    print(row)
xls_file.close()

# Read specific sheet in XLSX
xlsx_file = open_iterable('data.xlsx', iterableargs={'page': 'Sheet2'})
```

### XML Processing

```python
from iterable.helpers.detect import open_iterable

# Parse XML with specific tag name
xml_file = open_iterable(
    'data.xml',
    iterableargs={
        'tagname': 'book',
        'prefix_strip': True  # Strip XML namespace prefixes
    }
)

for item in xml_file:
    print(item)
xml_file.close()
```

### Advanced: Converting Compressed XML to Parquet

```python
from iterable.datatypes.xml import XMLIterable
from iterable.datatypes.parquet import ParquetIterable
from iterable.codecs.bz2codec import BZIP2Codec

# Read compressed XML
read_codec = BZIP2Codec('data.xml.bz2', mode='r')
reader = XMLIterable(codec=read_codec, tagname='page')

# Write to Parquet with schema adaptation
writer = ParquetIterable(
    'output.parquet',
    mode='w',
    use_pandas=False,
    adapt_schema=True,
    batch_size=10000
)

batch = []
for row in reader:
    batch.append(row)
    if len(batch) >= 10000:
        writer.write_bulk(batch)
        batch = []

if batch:
    writer.write_bulk(batch)

reader.close()
writer.close()
```

## API Reference

### Main Functions

#### `open_iterable(filename, mode='r', engine='internal', codecargs={}, iterableargs={})`

Opens a file and returns an iterable object.

**Parameters:**
- `filename` (str): Path to the file
- `mode` (str): File mode ('r' for read, 'w' for write)
- `engine` (str): Processing engine ('internal' or 'duckdb')
- `codecargs` (dict): Arguments for codec initialization
- `iterableargs` (dict): Arguments for iterable initialization

**Returns:** Iterable object for the detected file type

#### `detect_file_type(filename)`

Detects file type and compression codec from filename.

**Returns:** Dictionary with `success`, `datatype`, and `codec` keys

#### `convert(fromfile, tofile, iterableargs={}, scan_limit=1000, batch_size=50000, silent=True, is_flatten=False)`

Converts data between formats.

**Parameters:**
- `fromfile` (str): Source file path
- `tofile` (str): Destination file path
- `iterableargs` (dict): Options for iterable
- `scan_limit` (int): Number of records to scan for schema detection
- `batch_size` (int): Batch size for bulk operations
- `silent` (bool): Suppress progress output
- `is_flatten` (bool): Flatten nested structures

### Iterable Methods

All iterable objects support:

- `read()` - Read single record
- `read_bulk(num)` - Read multiple records
- `write(record)` - Write single record
- `write_bulk(records)` - Write multiple records
- `reset()` - Reset iterator to beginning
- `close()` - Close file handles

## Engines

### Internal Engine (Default)

The internal engine uses pure Python implementations for all formats. It supports all file types and compression codecs.

### DuckDB Engine

The DuckDB engine provides high-performance querying capabilities for supported formats:
- **Formats**: CSV, JSONL, NDJSON, JSON
- **Codecs**: GZIP, ZStandard (.zst)
- **Features**: Fast querying, totals counting, SQL-like operations

Use `engine='duckdb'` when opening files:

```python
source = open_iterable('data.csv.gz', engine='duckdb')
```

## Examples Directory

See the [examples](examples/) directory for more complete examples:

- `simplewiki/` - Processing Wikipedia XML dumps

## More Examples and Tests

See the [tests](tests/) directory for comprehensive usage examples and test cases.

## AI Integration Guides

IterableData can be integrated with AI platforms and frameworks for intelligent data processing:

- **[AI Frameworks](docs/integrations/AI_FRAMEWORKS.md)** - Integration with LangChain, CrewAI, and AutoGen
  - Tool creation for data reading and format conversion
  - Schema inference and data quality analysis
  - Multi-agent workflows for data processing
  
- **[OpenAI](docs/integrations/OPENAI.md)** - Direct OpenAI API integration (GPT-4, GPT-3.5, etc.)
  - Function calling and Assistants API
  - Structured outputs for consistent results
  - Natural language data analysis and transformation
  
- **[Claude](docs/integrations/CLAUDE.md)** - Anthropic Claude AI integration
  - Claude API integration with tools support
  - Intelligent data analysis and schema inference
  - Format conversion with AI guidance
  - Data quality assessment and documentation
  
- **[Gemini](docs/integrations/GEMINI.md)** - Google Gemini AI integration
  - Natural language data analysis
  - Intelligent format conversion with AI guidance
  - Schema documentation and data quality assessment
  - Function calling integration

These guides provide patterns, examples, and best practices for combining IterableData's unified data interface with AI capabilities.

## Related Projects

This library is used in:
- [undatum](https://github.com/datacoon/undatum) - Command line data processing tool
- [datacrafter](https://github.com/apicrafter/datacrafter) - Data processing ETL engine

## License

MIT License

## Contributing

Contributions are welcome! Please feel free to submit pull requests or open issues.

## Changelog

See [CHANGELOG.md](CHANGELOG.md) for detailed version history.

### Version 1.0.8 (2026-01-05)
- **AI Integration Guides**: Added comprehensive guides for LangChain, CrewAI, AutoGen, and Google Gemini AI
- **Documentation**: Added capability matrix and enhanced API documentation
- **Development Tools**: Added benchmarking and utility scripts
- **Code Improvements**: Enhanced format detection, codecs, and data type handlers
- **Examples**: Added ZIP XML processing example

### Version 1.0.7 (2024-12-15)
- **Major Format Expansion**: Added support for 50+ new data formats across multiple categories
- **Enhanced Compression**: Added LZO, Snappy, and SZIP codec support
- **CI/CD**: Added GitHub Actions workflows for automated testing and deployment
- **Documentation**: Complete documentation site with Docusaurus
- **Testing**: Comprehensive test suite for all formats

### Version 1.0.6
- Comprehensive documentation enhancements
- GitHub Actions release workflow
- Improved examples and use cases

### Version 1.0.5
- DuckDB engine support
- Enhanced format detection
- Pipeline processing framework
- Bulk operations support
