Metadata-Version: 2.3
Name: pyspark-data-sources
Version: 0.1.6
Summary: Custom Spark data sources for reading and writing data in Apache Spark, using the Python Data Source API
License: Apache-2.0
Author: allisonwang-db
Author-email: allison.wang@databricks.com
Requires-Python: >=3.9,<=3.12
Classifier: License :: OSI Approved :: Apache Software License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Provides-Extra: all
Provides-Extra: databricks
Provides-Extra: datasets
Provides-Extra: faker
Provides-Extra: kaggle
Provides-Extra: lance
Requires-Dist: databricks-sdk (>=0.28.0,<0.29.0) ; extra == "databricks" or extra == "all"
Requires-Dist: datasets (>=2.17.0,<3.0.0) ; extra == "datasets" or extra == "all"
Requires-Dist: faker (>=23.1.0,<24.0.0) ; extra == "faker" or extra == "all"
Requires-Dist: kagglehub[pandas-datasets] (>=0.3.10,<0.4.0) ; extra == "kaggle" or extra == "all"
Requires-Dist: mkdocstrings[python] (>=0.24.0,<0.25.0)
Requires-Dist: pyarrow (>=11.0.0)
Requires-Dist: requests (>=2.31.0,<3.0.0)
Description-Content-Type: text/markdown

# pyspark-data-sources

[![pypi](https://img.shields.io/pypi/v/pyspark-data-sources.svg?color=blue)](https://pypi.org/project/pyspark-data-sources/)

This repository showcases custom Spark data sources built using the new [**Python Data Source API**](https://spark.apache.org/docs/4.0.0/api/python/tutorial/sql/python_data_source.html) introduced in Apache Spark 4.0.
For an in-depth understanding of the API, please refer to the [API source code](https://github.com/apache/spark/blob/master/python/pyspark/sql/datasource.py).
Note this repo is **demo only** and please be aware that it is not intended for production use.
Contributions and feedback are welcome to help improve the examples.


## Installation
```
pip install pyspark-data-sources
```

## Usage
Make sure you have pyspark >= 4.0.0 installed. 

```
pip install pyspark
```

Or use [Databricks Runtime 15.4 LTS](https://docs.databricks.com/aws/en/release-notes/runtime/15.4lts) or above versions, or [Databricks Serverless](https://docs.databricks.com/aws/en/compute/serverless/).


```python
from pyspark_datasources.fake import FakeDataSource

# Register the data source
spark.dataSource.register(FakeDataSource)

spark.read.format("fake").load().show()
```

## Example Data Sources

| Data Source | Short Name | Description | Dependencies |
|-------------|------------|-------------|--------------|
| [GithubDataSource](pyspark_datasources/github.py) | `github` | Read pull requests from a Github repository | None |
| [FakeDataSource](pyspark_datasources/fake.py) | `fake` | Generate fake data using the `Faker` library | None |
| [StockDataSource](pyspark_datasources/stock.py) | `stock` | Read stock data from Alpha Vantage | None |
| [SimpleJsonDataSource](pyspark_datasources/simplejson.py) | `simplejson` | Read JSON data from a file | `databricks-sdk` |

See more here: https://allisonwang-db.github.io/pyspark-data-sources/.

## Official Data Sources

For production use, consider these official data source implementations built with the Python Data Source API:

| Data Source | Repository | Description | Features |
|-------------|------------|-------------|----------|
| **HuggingFace Datasets** | [@huggingface/pyspark_huggingface](https://github.com/huggingface/pyspark_huggingface) | Production-ready Spark Data Source for 🤗 Hugging Face Datasets | • Stream datasets as Spark DataFrames<br>• Select subsets/splits with filters<br>• Authentication support<br>• Save DataFrames to Hugging Face<br> |

## Contributing
We welcome and appreciate any contributions to enhance and expand the custom data sources.:

- **Add New Data Sources**: Want to add a new data source using the Python Data Source API? Submit a pull request or open an issue.
- **Suggest Enhancements**: If you have ideas to improve a data source or the API, we'd love to hear them!
- **Report Bugs**: Found something that doesn't work as expected? Let us know by opening an issue.


## Development
### Environment Setup
```
poetry install --all-extras
poetry shell
```

### Build Docs
```
mkdocs serve
```

