Metadata-Version: 2.4
Name: django-bigquery-exporter
Version: 0.2.3
Summary: A Django plugin for exporting CMS data to Google BigQuery.
Home-page: https://www.industrydive.com/
Author: Dave Frame
Author-email: dframe22@gmail.com
License: MIT License
Classifier: Environment :: Web Environment
Classifier: Framework :: Django
Classifier: Framework :: Django :: 3.2
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: BSD License
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3 :: Only
Classifier: Programming Language :: Python :: 3.8
Classifier: Programming Language :: Python :: 3.9
Classifier: Topic :: Internet :: WWW/HTTP
Classifier: Topic :: Internet :: WWW/HTTP :: Dynamic Content
Requires-Python: >=3.8
License-File: LICENSE
Requires-Dist: Django>=3.2.4
Requires-Dist: google-auth>=1.32.1
Requires-Dist: google-cloud-bigquery>=2.30.1
Requires-Dist: pytz>=2019.1
Dynamic: license-file
Dynamic: requires-dist

===============================
Django BigQuery Exporter
===============================

A Django application that provides a convenient way to export data from your Django models to Google BigQuery.

Features
-----------

* Exports Django model data to BigQuery tables
* Processes data in configurable batch sizes to manage memory usage
* Handles date/time formats and UUID fields automatically
* Allows custom field transformations with a simple decorator
* Validates that model fields match BigQuery table schema
* Provides retry mechanisms for resilient exports
* Supports incremental exports with date filtering
* Handles potential exceptions during data export with detailed error reporting

Installation
------------

::

    pip install django-bigquery-exporter

Requirements
------------

* Python 3.8+
* Django
* google-cloud-bigquery
* google-api-python-client

Authentication
--------------

You need to authenticate with Google Cloud to use BigQuery. There are two main ways:

1. Using environment variables (recommended for production)::

    export GOOGLE_APPLICATION_CREDENTIALS="/path/to/your/credentials.json"

2. Providing credentials directly in code (useful for development)::

    exporter = MyExporter(
        project="your-google-cloud-project-id",
        credentials="/path/to/your/credentials.json"
    )

Basic Usage
-----------

Create a subclass of ``BigQueryExporter`` and define the necessary attributes::

    from bigquery_exporter.base import BigQueryExporter, custom_field

    class BookExporter(BigQueryExporter):
        model = Book
        fields = ['id', 'title', 'author', 'publication_date', 'genre', 'rating']
        batch = 1000
        table_name = 'your_project.your_dataset.books'
        replace_nulls_with_empty = False

        @custom_field
        def genre(self, instance):
            """Custom field to transform the genre into a structured format"""
            return {
                'code': instance.genre,
                'name': instance.get_genre_display()
            }

Then, export the data::

    exporter = BookExporter()
    exporter.export()

Available Properties
--------------------

:model: Django model to export (required), default: ``None``
:fields: List of field names to export (required), default: ``[]``
:batch: Number of records to process in each batch, default: ``1000``
:table_name: Full BigQuery table name (required), default: ``''``
:replace_nulls_with_empty: Whether to replace ``None`` values with empty strings, default: ``False``
:include_pull_date: Whether to include pull date in the export, default: ``False``
:pull_date_field_name: Name of the field to store the export timestamp, default: ``'pull_date'``

Available Methods
-----------------

define_queryset()
~~~~~~~~~~~~~~~~~

Define the queryset to export. Override this method to filter or order your data::

    def define_queryset(self):
        # Only export books published in the last year
        one_year_ago = datetime.date.today() - datetime.timedelta(days=365)
        return self.model.objects.filter(publication_date__gte=one_year_ago).order_by('id')

.. important::
   When using batching (batch ≠ None), you **MUST** define proper ordering in your queryset
   to ensure consistent pagination results.

export(pull_date=None, queryset=None)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Export data to BigQuery.

* ``pull_date``: Optional timestamp to record when the data was exported (only included if ``include_pull_date=True``)
* ``queryset``: Optional queryset to override the default. Useful for backfilling specific data.

::

    # Standard export
    exporter = BookExporter()
    errors = exporter.export()

    # Export with specified pull_date
    from datetime import datetime
    exporter.export(pull_date=datetime.now())

    # Backfilling specific data
    historical_queryset = Book.objects.filter(
        publication_date__year=2020
    ).order_by('id')
    exporter.export(queryset=historical_queryset)

    if errors:
        print(f"Encountered {len(errors)} errors during export")

table_has_data(pull_date=None)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Check if the BigQuery table has data. When both ``pull_date`` is provided AND ``include_pull_date`` is True, it checks for data with that specific pull date. Otherwise, it just checks if the table has any data at all.

::

    exporter = BookExporter()

    # Check with explicit pull date (only works if include_pull_date=True)
    pull_date = datetime.datetime.now()
    if not exporter.table_has_data(pull_date):
        exporter.export(pull_date=pull_date)
    else:
        print("Data already exported for today")

    # Check for any data
    if not exporter.table_has_data():
        exporter.export()
    else:
        print("Table already has data")

Dependency Injection
--------------------

Django BigQuery Exporter supports injection of the BigQuery client for better testability and flexibility::

    # Injecting a custom BigQuery client
    from google.cloud import bigquery
    custom_client = bigquery.Client(project='my-project')

    exporter = BookExporter(
        client=custom_client
    )

Custom Fields
--------------

Use the ``@custom_field`` decorator to create methods that transform data during export::

    @custom_field
    def full_name(self, instance):
        return f"{instance.first_name} {instance.last_name}"

    @custom_field
    def category_details(self, instance):
        # Return complex nested data
        return {
            'id': instance.category_id,
            'name': instance.category.name,
            'parent': instance.category.parent.name if instance.category.parent else None
        }

Complete Example
----------------

Here's a complete example with a Book model::

    import datetime
    from bigquery_exporter.base import BigQueryExporter, custom_field
    from myapp.models import Book

    class BookExporter(BigQueryExporter):
        model = Book
        batch = 1000
        table_name = 'my_project.bookstore.books'
        fields = [
            'id', 'title', 'author', 'publication_date', 'is_bestseller',
            'genre', 'page_count', 'created_at', 'updated_at', 'rating'
        ]
        # Pull date configuration
        include_pull_date = True             # Include pull date in the export
        pull_date_field_name = 'export_date' # Custom field name

        def define_queryset(self):
            # Only export books updated in the last 30 days
            thirty_days_ago = datetime.date.today() - datetime.timedelta(days=30)
            return Book.objects.filter(updated_at__gte=thirty_days_ago).order_by('id')

        @custom_field
        def genre(self, instance):
            """Return both the code and display name for the genre"""
            GENRES = {
                'SFF': 'Science Fiction & Fantasy',
                'MYS': 'Mystery',
                'ROM': 'Romance',
                # ... other genres
            }
            return {
                'code': instance.genre,
                'name': GENRES.get(instance.genre, 'Unknown')
            }

        @custom_field
        def rating(self, instance):
            """Calculate and return the average rating"""
            avg_rating = instance.reviews.aggregate(avg=Avg('rating'))['avg'] or 0
            return round(avg_rating, 1)

    # In a task or management command
    def export_books_to_bigquery():
        pull_date = datetime.datetime.now()

        exporter = BookExporter(
            project='my-gcp-project',
            credentials='/path/to/credentials.json'
        )

        # Check if data already exists for today
        if exporter.table_has_data(pull_date) and not force_export:
            print(f"Data already exists for {pull_date.date()}, skipping export")
            return

        # Perform the export
        errors = exporter.export(pull_date=pull_date)

        if errors:
            print(f"Export completed with {len(errors)} errors")
        else:
            print(f"Successfully exported books to BigQuery")

Error Handling
--------------

The ``export()`` method returns a list of error objects for any failed row insertions. Each error includes:

* The row index
* The error message
* The affected data

You can use this information to log errors or retry specific records.

Best Practices
--------------

1. **ALWAYS** define an ordering in ``define_queryset()`` when using batching - this is critical for consistent results
2. Set appropriate batch sizes based on your model's complexity
3. Use custom fields to preprocess data before export
4. Implement idempotency checks with ``table_has_data()``
5. Use the ``queryset`` parameter for backfilling historical data rather than modifying your exporter class
6. Consider using dependency injection for the BigQuery client for better testability
7. Catch and handle ``GoogleAPICallError`` and ``BigQueryExporterError`` exceptions

License
-------

This project is licensed under the MIT License - see the LICENSE file for details.
