Metadata-Version: 2.4
Name: parquet-path-rewriter
Version: 0.1.2
Summary: A library to rewrite relative Parquet file paths in Python code using AST manipulation.
Author-email: Rafael Sales <rafael.sales@gmail.com>
Project-URL: Homepage, https://github.com/dmux/parquet-path-rewriter
Project-URL: Bug Tracker, https://github.com/dmux/parquet-path-rewriter/issues
Keywords: ast,parquet,rewrite,refactor,pyspark,pandas
Classifier: Programming Language :: Python :: 3
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Classifier: Topic :: Software Development :: Libraries :: Python Modules
Classifier: Topic :: Software Development :: Code Generators
Classifier: Intended Audience :: Developers
Requires-Python: >=3.9
Description-Content-Type: text/markdown
Provides-Extra: dev
Requires-Dist: pytest>=8.3.5; extra == "dev"
Requires-Dist: pytest-cov>=6.1.1; extra == "dev"
Requires-Dist: build>=0.10; extra == "dev"

# Parquet Path Rewriter

[![PyPI version](https://badge.fury.io/py/parquet-path-rewriter.svg)](https://badge.fury.io/py/parquet-path-rewriter)
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)

A Python library to automatically rewrite relative Parquet file paths within Python code strings. It uses Abstract Syntax Tree (AST) manipulation to find calls like `spark.read.parquet('relative/path')` or `df.write.parquet(path='other/path')` and prepends a specified base directory path, making them absolute _within that base context_.

This is useful in scenarios where code needs to be adapted to run in different environments (e.g., local vs. cluster) where data root directories differ, without modifying the original relative path logic directly in the source code strings _before_ execution or analysis.

## Features

- Identifies `.parquet()` method calls (heuristic based on common Spark/Pandas patterns like `read.parquet` and `write.parquet`).
- Rewrites relative string literal paths passed as the first positional argument or using the `path=` keyword argument.
- Prepends a specified `base_path` (string or `pathlib.Path`).
- Ignores absolute paths.
- Ignores paths that are not string literals (e.g., variables, f-strings).
- Keeps track of which paths were rewritten (original -> new mapping).
- Identifies original paths used in likely _read_ operations.
- Uses Python's `ast` module for safe code transformation.

## Installation

```bash
pip install parquet-path-rewriter
```

## Usage

The primary way to use the library is through the rewrite_parquet_paths_in_code function.

```python
from pathlib import Path

# Make sure src is in path if running directly without installation
# sys.path.insert(0, str(Path(__file__).resolve().parent.parent / 'src'))

from parquet_path_rewriter import rewrite_parquet_paths_in_code

# --- Example Code ---
# Simulate a Python script that uses Spark or Pandas to read/write Parquet
original_python_code = """
import pyspark.sql

# Assume spark session is created elsewhere
# spark = SparkSession.builder.appName("ETLExample").getOrCreate()

print("Starting ETL process...")

# Read input data
customers_df = spark.read.parquet("raw_data/customers")
orders_df = spark.read.parquet(path="raw_data/orders_2023")

# Some transformations (placeholder)
processed_df = customers_df.join(orders_df, "customer_id")

# Write intermediate results
processed_df.write.mode("overwrite").parquet("staging/customer_orders")

# Read another input for final step
products_df = spark.read.parquet('reference_data/products.parquet')

# Final join and write output
final_df = processed_df.join(products_df, "product_id")
output_path = "final_output/report_data" # Not a literal in call
final_df.write.mode("overwrite").parquet(path="final_output/report_data") # Uses keyword

# Example with an absolute path (should not be changed)
logs_df = spark.read.parquet("/mnt/shared/logs/app_logs.parquet")

# S3 example (should be rewritten)
s3_df = spark.read.parquet("s3://mybucket/data/2023/spark_logs")

# Write to S3 (should be rewritten)
s3_df.write.mode("overwrite").parquet("s3://mybucket/output/processed_logs")

print("ETL process finished.")
"""

# --- Library Usage ---

# Define the base directory where the relative paths should point
# This would typically be determined by your execution environment or configuration
# Use absolute paths for clarity
data_root_directory = Path("/user/project/data").resolve()

s3_rewrite_prefix = "s3://newbucket/data/2023"

print("-" * 30)
print(f"Base Path: {data_root_directory}")
print("-" * 30)
print("Original Code:")
print(original_python_code)
print("-" * 30)

try:
    # Call the library function to rewrite the code
    modified_code, rewritten_map, identified_inputs = rewrite_parquet_paths_in_code(
        code_string=original_python_code, base_path=data_root_directory, s3_rewrite_prefix=s3_rewrite_prefix
    )

    print("Modified Code:")
    print(modified_code)
    print("-" * 30)

    print("Rewritten Paths (Original -> New):")
    if rewritten_map:
        for original, new in rewritten_map.items():
            print(f"  '{original}' -> '{new}'")
    else:
        print("  No paths were rewritten.")
    print("-" * 30)

    print("Identified Input Paths (Original):")
    if identified_inputs:
        for path in identified_inputs:
            print(f"  '{path}'")
    else:
        print("  No input paths were identified.")
    print("-" * 30)

except SyntaxError as e:
    print(f"\nError: Invalid Python syntax in the input code.\n{e}")
except TypeError as e:
    print(f"\nError: Invalid base_path provided.\n{e}")
except Exception as e:
    print(f"\nAn unexpected error occurred: {e}")

```

## How it Works

The library parses the input Python code string into an Abstract Syntax Tree (AST) using Python's built-in `ast` module. It then walks through this tree using a custom `ast.NodeTransformer`. When it encounters a function call node:

1. It checks if the called attribute is named `parquet`.

2. It analyzes the call chain (e.g., `spark.read.parquet`) to heuristically determine whether it's a **read** or **write** operation.

3. It searches for a string literal path in the arguments (either as the first positional argument or as a keyword argument like `path='...'`).

4. If a valid path string is found, the path is transformed based on the configuration:

   - If the path is **relative**, it is rewritten to:  
     `base_path / <filename>.parquet`
   - If the path is an **S3 URI** and `s3_rewrite_prefix` is provided, it is rewritten to:  
     `<s3_rewrite_prefix>/<filename>.parquet`
   - If the path is **absolute** (e.g., `/data/file.parquet` or starts with `s3://`) and does not match the rewrite criteria, it is left untouched.

5. It replaces the original path node in the AST with a new node containing the modified path string.

6. Finally, the modified AST is converted back into a Python code string using `ast.unparse()` (Python 3.9+).

## Limitations

- **Call Pattern Specificity:** Only identifies calls where the method name is directly `.parquet(...)`. It does **not** currently support more dynamic usage like `spark.read.format("parquet").load("...")`. Extending this requires deeper AST pattern matching.

- **String Literals Only:** Only rewrites paths passed as **direct string literals** (e.g., `'path/to/file'`, `"data/file"`). It **ignores** paths built via variables, f-strings, or function returns.

- **Heuristic Read/Write Detection:** Read vs. write detection is **heuristic**, based on checking if `read` or `write` exists in the call chain. While it works for typical Spark/Pandas patterns, it might not apply universally.

- **AST Unparsing:** Relies on `ast.unparse` (Python 3.9+) to reconstruct the modified code. If using Python <3.9, consider using [`astunparse`](https://pypi.org/project/astunparse/). Minor formatting differences in the output code may occur.

## Contributing

Contributions are welcome! If you encounter a bug or have an enhancement idea, feel free to [open an issue](https://github.com/dmux/parquet-path-rewriter/issues) or submit a pull request.

## License

This project is licensed under the MIT License. See the [LICENSE](LICENSE) file for details.
