Metadata-Version: 2.1
Name: pysparkformat
Version: 0.0.3
Summary: Collection of Apache Spark Custom Data Formats
Author-email: Ilya Aniskovets <ilya@aniskovets.com>
License: Apache-2.0
Project-URL: Homepage, https://github.com/aig/pysparkformat
Project-URL: Issues, https://github.com/aig/pysparkformat/issues
Requires-Python: >=3.9
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: requests
Requires-Dist: pyarrow
Requires-Dist: pandas
Requires-Dist: grpcio
Requires-Dist: grpcio-status

# pysparkformat

Apache Spark 4.0 introduces a new data source API called V2 and even more now we can use python to create custom data sources. 
This is a great feature that allows us to create custom data sources that can be used in any pyspark projects.

This project is intended to collect all custom pyspark formats that I have created for my projects.

Here is what we have so far:
 * http-csv : A custom data source that reads CSV files from HTTP.

You are welcome to contribute with new formats or improvements in the existing ones.

Usage:
```bash
pip install pyspark==4.0.0.dev2
pip install pysparkformat
```

You also can use this package in Databricks notebooks. Tested with Databricks Runtime 15.4 LTS.
Just install it using the following command to general-purpose cluster:
```bash
%pip install pysparkformat
```

```python
from pyspark.sql import SparkSession
from pysparkformat.http.csv import HTTPCSVDataSource

# you can comment the following line if you are running this code in Databricks
spark = SparkSession.builder.appName("custom-datasource-example").getOrCreate()

# uncomment to disable format check for Databricks Runtime
# spark.conf.set("spark.databricks.delta.formatCheck.enabled", False)

spark.dataSource.register(HTTPCSVDataSource)

url = "https://www.stats.govt.nz/assets/Uploads/Annual-enterprise-survey/Annual-enterprise-survey-2023-financial-year-provisional/Download-data/annual-enterprise-survey-2023-financial-year-provisional.csv"
df = spark.read.format("http-csv").option("header", True).load(url)
df.show() # or use display(df) in Databricks
```
