Metadata-Version: 2.1
Name: vinum
Version: 0.2.0
Summary: Vinum is a SQL processor written in pure Python, designed for data analysis workflows and in-memory analytics. 
Home-page: https://github.com/dmitrykoval/vinum
Author: Dmitry Koval
Author-email: dima@koval.space
License: UNKNOWN
Project-URL: Bug Tracker, https://github.com/dmitrykoval/vinum/issues
Project-URL: Documentation, https://vinum.readthedocs.io/en/latest/index.html
Project-URL: Source Code, https://github.com/dmitrykoval/vinum
Keywords: sql,python,numpy,data analysis,olap
Platform: UNKNOWN
Classifier: Development Status :: 3 - Alpha
Classifier: Environment :: Console
Classifier: Operating System :: OS Independent
Classifier: Intended Audience :: Science/Research
Classifier: Programming Language :: Python
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.7
Classifier: Programming Language :: Python :: 3.8
Classifier: Programming Language :: SQL
Classifier: Topic :: Scientific/Engineering
Requires-Python: >=3.7
Description-Content-Type: text/x-rst
Requires-Dist: pyarrow (>=3.0.0)
Requires-Dist: numpy (>=1.19.0)
Requires-Dist: moz-sql-parser (==3.32.20026)

*****
Vinum
*****

|PyPi|_ |CI|_ |Grade_Python|_ |Codecov|_


.. |PyPi| image:: https://img.shields.io/pypi/v/vinum.svg
.. _PyPi: https://pypi.org/project/vinum/

.. |CI| image:: https://github.com/dmitrykoval/vinum/actions/workflows/ci.yml/badge.svg
.. _CI: https://github.com/dmitrykoval/vinum/actions/workflows/ci.yml

.. |Grade_Python| image:: https://img.shields.io/lgtm/grade/python/g/dmitrykoval/vinum.svg?logo=lgtm&logoWidth=18
.. _Grade_Python: https://lgtm.com/projects/g/dmitrykoval/vinum/context:python

.. |Codecov| image:: https://codecov.io/gh/dmitrykoval/vinum/branch/main/graphs/badge.svg?branch=main&service=github
.. _Codecov: https://codecov.io/gh/dmitrykoval/vinum?branch=main



**Vinum** is a SQL processor for Python, designed for
data analysis workflows and in-memory analytics. 
Conceptually, Vinum's design goal is to provide deeper integration of 
Python data analysis tools such as `Numpy <https://numpy.org/>`_,
`Pandas <https://pandas.pydata.org/>`_ or in general any Python code with
the SQL language. Key features include native support of
vectorized Numpy and pure Python functions as UDFs in SQL queries.


Key Features:
=============

* Natively supports vectorized Numpy and Python functions inside of
  SELECT, WHERE, GROUP BY, HAVING and ORDER BY clauses.
  All the numpy functions are available by default via the 'np.*' namespace.

* Written in pure Python and built from ground up on top of
  `Apache Arrow <https://arrow.apache.org/>`_ and
  `Numpy <https://numpy.org/>`_.

* Apache Arrow provides the foundation for "moving" data and enables minimal
  overhead for transferring data to and from Numpy and Pandas.

* Designed for in-memory analytics workflows, based on columnar memory layout.


Design
======
Vinum's query planner compiles SQL SELECT query into a tree of
vectorized Arrow and Numpy based operators and therefore integration
with Numpy, Arrow or native Python functions comes naturally.
In Vinum, all Numpy functions are first class citizens and can be used inside
of SELECT, WHERE, GROUP BY, HAVING and ORDER BY clauses.

Execution model is a vectorized version of the class "Volcano" style
iterator model, which enables intra-operator parallelism.

Below is an example of a possible simplified query plan.

.. image:: https://github.com/dmitrykoval/vinum/raw/main/doc/source/_static/dag_ex.png


Install
=======

``pip install vinum``


Examples
========

Query python dict
-----------------

Create a Table from a python dict and return result of the query
as a Pandas DataFrame.

    >>> import vinum as vn
    >>> data = {'value': [300.1, 2.8, 880], 'mode': ['air', 'bus', 'air']}
    >>> tbl = vn.Table.from_pydict(data)
    >>> tbl.sql_pd("SELECT value, np.log(value) FROM t WHERE mode='air'")
       value    np.log
    0  300.1  5.704116
    1  880.0  6.779922


Query pandas dataframe
----------------------

    >>> import pandas as pd
    >>> import vinum as vn
    >>> data = {'col1': [1, 2, 3], 'col2': [7, 13, 17]}
    >>> pdf = pd.DataFrame(data=data)
    >>> tbl = vn.Table.from_pandas(pdf)
    >>> tbl.sql_pd('SELECT * FROM t WHERE col2 > 10 ORDER BY col1 DESC')
       col1  col2
    0     3    17
    1     2    13


Run query on streamed csv
-------------------------
when dataset doesn't fit into memory - use stream_csv...


Query csv
---------
    >>> import vinum as vn
    >>> tbl = vn.read_csv('test.csv')
    >>> res_tbl = tbl.sql('SELECT * FROM t WHERE fare > 5 LIMIT 3')
    >>> res_tbl.to_pandas()
       id                            ts        lat        lng  fare
    0   1   2010-01-05 16:52:16.0000002  40.711303 -74.016048  16.9
    1   2  2011-08-18 00:35:00.00000049  40.761270 -73.982738   5.7
    2   3   2012-04-21 04:30:42.0000001  40.733143 -73.987130   7.7


Compute Euclidean distance with numpy functions
-----------------------------------------------

Use any numpy functions via the 'np.*' namespace.

    >>> import vinum as vn
    >>> tbl = vn.Table.from_pydict({'x': [1, 2, 3], 'y': [7, 13, 17]})
    >>> tbl.sql_pd('SELECT *, np.sqrt(np.square(x) + np.square(y)) dist '
    ...            'FROM t ORDER BY dist DESC')
       x   y       dist
    0  3  17  17.262677
    1  2  13  13.152946
    2  1   7   7.071068


Compute Euclidean distance with vectorized UDF
----------------------------------------------

Register UDF performing vectorized operations on Numpy arrays.

    >>> import vinum as vn
    >>> vn.register_numpy('distance',
    ...                   lambda x, y: np.sqrt(np.square(x) + np.square(y)))
    >>> tbl = vn.Table.from_pydict({'x': [1, 2, 3], 'y': [7, 13, 17]})
    >>> tbl.sql_pd('SELECT *, distance(x, y) AS dist '
    ...            'FROM t ORDER BY dist DESC')
       x   y       dist
    0  3  17  17.262677
    1  2  13  13.152946
    2  1   7   7.071068


Compute Euclidean distance with python UDF
------------------------------------------

Register Python lambda function as UDF.

    >>> import math
    >>> import vinum as vn
    >>> vn.register_python('distance', lambda x, y: math.sqrt(x**2 + y**2))
    >>> tbl = vn.Table.from_pydict({'x': [1, 2, 3], 'y': [7, 13, 17]})
    >>> tbl.sql_pd('SELECT x, y, distance(x, y) AS dist FROM t')
       x   y       dist
    0  1   7   7.071068
    1  2  13  13.152946
    2  3  17  17.262677


Group by z-score
----------------

    >>> import numpy as np
    >>> import vinum as vn
    >>> def z_score(x: np.ndarray):
    ...     """Compute Standard Score"""
    ...     mean = np.mean(x)
    ...     std = np.std(x)
    ...     return (x - mean) / std
    ...
    >>> vn.register_numpy('score', z_score)
    >>> tbl = vn.read_csv('test.csv')
    >>> tbl.sql_pd('select int(score(fare)) AS bucket, avg(fare), count(*) '
    ...            'FROM t GROUP BY bucket ORDER BY bucket')
       bucket        avg  count
    0       0   8.111630     92
    1       1  19.380000      3
    2       2  27.433333      3
    3       3  34.670000      1
    4       6  58.000000      1



Documentation
=============
* `Vinum documentation <https://vinum.readthedocs.io/en/latest/>`_
* `Getting started <https://vinum.readthedocs.io/en/latest/getting_started.html>`_


What Vinum is not
=================
Vinum is not a Database Management System, there are no plans to support
data mutation, INSERT or UPDATE statements, as well as transactions.

Dependencies
============
* `Pyarrow <https://arrow.apache.org/docs/python/>`_
* `NumPy <https://numpy.org/>`_
* `Moz SQL Parser <https://github.com/mozilla/moz-sql-parser>`_

Future plans
============
* Performance improvements.
* Support sub-queries and JOINs.
* Parallel execution (intra-operator parallelism).


