Metadata-Version: 2.1
Name: df_to_rs
Version: 0.1.29
Summary: A package to upload Pandas DataFrame to Redshift
Home-page: https://github.com/ankitgoel888/df_to_rs
Author: Ankit Goel
Author-email: ankitgoel888@gmail.com
Classifier: Programming Language :: Python :: 3
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Requires-Python: >=3.6, <4.1
Description-Content-Type: text/markdown
License-File: LICENSE.txt

# df_to_rs

`df_to_rs` is a Python package that provides efficient methods to upload, upsert and manage Pandas DataFrames in Amazon Redshift using S3 as an intermediary.

## Key Features

- Direct DataFrame to Redshift upload
- Upsert functionality (update + insert)
- Delete and insert operations
- Large dataset handling with chunking
- Support for JSON/dict/list columns (Redshift SUPER)
- AWS IAM Role support for secure authentication
- Automatic cleanup of temporary S3 files
- Optimized NULL handling in upsert operations
- Proper NULL value preservation across all data types

## Installation

```bash
pip install df_to_rs
```

## Usage

### 1. Initialize with AWS Credentials

```python
from df_to_rs import df_to_rs
import psycopg2

# Connect to Redshift
redshift_conn = psycopg2.connect(
    dbname='your_db',
    host='your-cluster.region.redshift.amazonaws.com',
    port=1433,
    user='your_user',
    password='your_password'
)
redshift_conn.set_session(autocommit=True)

# Initialize with explicit credentials
uploader = df_to_rs(
    region_name='ap-south-1',
    s3_bucket='your-s3-bucket',
    aws_access_key_id='your-access-key-id',
    aws_secret_access_key='your-secret-access-key',
    redshift_c=redshift_conn
)
```

### 2. Initialize using EC2 Instance Role (Recommended)

```python
# No AWS credentials needed when using instance role
uploader = df_to_rs(
    region_name='ap-south-1',
    s3_bucket='your-s3-bucket',
    redshift_c=redshift_conn
)
```

### 3. Basic Upload

Upload a DataFrame to a Redshift table:

```python
# Simple upload
uploader.upload_to_redshift(
    df=your_dataframe,
    dest='schema.table_name'
)
```

### 4. Upsert Operation

Update existing records and insert new ones based on key columns:

```python
# Upsert based on specific columns
uploader.upsert_to_redshift(
    df=your_dataframe,
    dest_table='schema.table_name',
    upsert_columns=['id', 'unique_key'],  # Columns to match existing records
    clear_dest_table=False  # Set True to truncate table before insert
)
```

#### Optimized NULL Handling in Upserts

The package includes optimized handling for NULL values in upsert key columns:

- Automatically splits processing for records with and without NULL values in key columns
- Uses simplified SQL for non-NULL records (better performance)
- Correctly matches records where keys contain NULL values
- Handles compound keys with a mix of NULL and non-NULL values
- Preserves NULL values in all data types (including numeric columns) during transfer

```python
# Example with NULL values in key columns
df = pd.DataFrame({
    'id': [1, 2, 3, None],
    'code': ['A', 'B', None, 'D'],
    'value': [100, 200, 300, 400]
})

# Correctly handles NULL matching in any key column
# and preserves NULLs in all column types
uploader.upsert_to_redshift(
    df=df,
    dest_table='schema.table_name',
    upsert_columns=['id', 'code']
)
```

### 5. Delete and Insert

Delete records matching a condition and insert new data:

```python
# Delete and insert with condition
uploader.delete_and_insert_to_redshift(
    df=your_dataframe,
    dest_table='schema.table_name',
    filter_cond="date >= CURRENT_DATE - 7"  # SQL condition for deletion
)

# Delete and insert with timestamp precision
uploader.delete_and_insert_to_redshift(
    df=your_dataframe,
    dest_table='schema.table_name',
    filter_cond="date >= CURRENT_DATE - 7",  # SQL condition for broad deletion
    min_timestamp='2024-06-01 00:00:00',     # Minimum timestamp value from DataFrame
    timestamp_col='created_at'               # Column to use for timestamp filtering
)
```

## Special Data Types

### NULL Value Handling

The package properly preserves NULL values in all data types:

```python
# DataFrame with NULL values in different data types
df = pd.DataFrame({
    'id': [1, 2, None, 4],                       # Integer with NULL
    'value': [10.5, None, 30.75, 40.25],         # Float with NULL
    'code': ['A', 'B', None, 'D'],               # String with NULL
    'date': [date(2025,1,1), None, date(2025,3,1), date(2025,4,1)]  # Date with NULL
})

# All NULL values will be properly preserved in Redshift
uploader.upload_to_redshift(df, 'schema.table_name')
```

### JSON/Dictionary Columns

The package automatically handles JSON/dict/list columns for Redshift SUPER type:

```python
# DataFrame with JSON column
df = pd.DataFrame({
    'id': [1, 2],
    'json_data': [{'key': 'value'}, {'other': 'data'}]
})

# Will be automatically converted for Redshift SUPER column
uploader.upload_to_redshift(df, 'schema.table_name')
```

## Large Dataset Handling

The package automatically handles large datasets by:

- Chunking data into 1 million row segments
- Streaming to S3 in memory
- Automatic cleanup of temporary files
- Progress tracking with timestamps

## Error Handling

- Automatic transaction rollback on errors
- S3 temporary file cleanup
- Detailed error messages and timestamps
- Safe staging table management for upserts

## Logging and Diagnostics

All operations include detailed logging with timestamps and table names in the format:
```
[YYYY-MM-DD HH:MM:SS] [table_name] Operation message
```

This helps in debugging and monitoring data transfer operations across multiple tables.

## AWS IAM Role Requirements

When using instance roles, ensure your role has these permissions:

- S3: PutObject, GetObject, DeleteObject on the specified bucket
- Redshift: COPY command permissions
- IAM: AssumeRole permissions if needed

## Best Practices

1. Use instance roles instead of access keys when possible
2. Set appropriate column types in Redshift, especially for SUPER columns
3. Create tables with appropriate sort and dist keys before uploading
4. Monitor the Redshift query logs for performance optimization

## License

This project is licensed under the MIT License - see the LICENSE file for details

# Changelog

All notable changes to df_to_rs will be documented in this file.

## [0.1.28] - 2025-01-16

### Fixed

- Fixed FutureWarning by replacing deprecated DataFrame.applymap() with DataFrame.map() for pandas 2.x compatibility
- Enhanced all print statements with timestamps and table names for better diagnostics
- Added datetime module import for timestamp formatting

### Changed

- All logging output now includes [timestamp] [table_name] prefix format for easier debugging
- Improved diagnostic capability with consistent timestamp format '%Y-%m-%d %H:%M:%S'

## [0.1.27] - 2025-05-05

### Fixed

- Improved NULL value handling in DataFrames to ensure proper conversion to Redshift NULL values
- Fixed issue where NULL values were being converted to empty strings in Redshift
- Added proper NULL handling for numeric columns during data transfer
- Standardized NULL representation using 'NULL' in CSV files and COPY commands

## [0.1.26] - 2025-05-04

### Added

- Optimized NULL handling in upsert operations
- Split processing for records with and without NULL values in key columns 
- Improved SQL generation for better performance with NULL key values
- Added explicit handling of NULL columns in upsert operations

### Changed

- Enhanced documentation for NULL handling features
- Improved logging and timing information for NULL vs non-NULL operations

## [0.1.25] - 2025-04-15

### Added

- Enhanced `delete_and_insert_to_redshift` with additional timestamp-based deletion to ensure no duplicates
- Added support for min_timestamp and timestamp_col parameters for more precise data control

## [0.1.24] - 2025-01-26

### Added

- Documentation Improved

## [0.1.23] - 2025-01-26

### Added

- Support for instance role-based authentication in AWS
- Handling of JSON/dict/list objects for Redshift SUPER columns
- Proper cleanup of S3 temporary files

### Changed

- Made AWS credentials optional in constructor
- Optimized DataFrame processing with unified applymap operations
- Improved string column handling for better type safety

### Fixed

- S3 resource cleanup in error scenarios
- Transaction handling in delete_and_insert_to_redshift
