.. module:: petl
.. moduleauthor:: Alistair Miles <alimanfoo@googlemail.com>

petl - Extract, Transform and Load (Tables of Data)
===================================================

:mod:`petl` is a Python package for extracting, transforming and
loading tables of data.

- Documentation: http://petl.readthedocs.org/
- Source Code: https://github.com/alimanfoo/petl
- Download: http://pypi.python.org/pypi/petl
- Mailing List: http://groups.google.com/group/python-etl 

For examples of :mod:`petl` in use, see the case studies below:

* `Comparing Tables <case_study_1.html>`_

Overview
--------

The tables below gives an overview of the main functions in the
:mod:`petl` module.

See also the alphabetic :ref:`genindex` of all functions in the package.

Introduction
------------

Installation
~~~~~~~~~~~~

This module is available from the `Python Package Index
<http://pypi.python.org/pypi/petl>`_. On Linux distributions you
should be able to do ``easy_install petl`` or ``pip install petl``. On
other platforms you can download manually, extract and run ``python
setup.py install``.


Dependencies and extensions
~~~~~~~~~~~~~~~~~~~~~~~~~~~

This package has been written with no dependencies other than the
Python core modules, for ease of installation and
maintenance. However, there are many third party packages which could
usefuly be used with :mod:`petl`, e.g., providing access to data from
Excel or other file types. Some extensions with these additional
dependencies are provided by the `petlx
<http://pypi.python.org/pypi/petlx>`_ package, a companion package to
:mod:`petl`.


Conventions - row containers and row iterators
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

This package defines the following convention for objects acting as
containers of tabular data and supporting row-oriented iteration over
the data.

A *row container* (also referred to here informally as a *table*) is
any object which satisfies the following:

1. implements the `__iter__` method

2. `__iter__` returns a *row iterator* (see below)

3. all row iterators returned by `__iter__` are independent, i.e., consuming items from one iterator will not affect any other iterators

A *row iterator* is an iterator which satisfies the following:

4. each item returned by the iterator is either a list or a tuple

5. the first item returned by the iterator is a *header row* comprising a list or tuple of *fields*

6. each subsequent item returned by the iterator is a *data row* comprising a list or tuple of *data values*

7. a *field* is typically a string (`str` or `unicode`) but may be an object of any type as long as it implements `__str__`, is immutable (can be used as a dictionary key) and is pickleable

8. a *data value* is any pickleable object

So, for example, the list of lists shown below is a row container::

>>> table = [['foo', 'bar'], ['a', 1], ['b', 2]]

Note that, under this convention, an object returned by the
:func:`csv.reader` function from the standard Python :mod:`csv` module
is a row iterator and *not* a row container, because it can only be
iterated over once, e.g.::

 >>> from StringIO import StringIO
 >>> import csv
 >>> csvdata = """foo,bar
 ... a,1
 ... b,2
 ... """
 >>> rowiterator = csv.reader(StringIO(csvdata))
 >>> for row in rowiterator:
 ...     print row
 ... 
 ['foo', 'bar']
 ['a', '1']
 ['b', '2']
 >>> for row in rowiterator:
 ...     print row
 ... 
 >>> # can only iterate once

However, it is straightforward to define functions that support the
above convention for row containers and provide access to data from
CSV or other types of file or data source, see e.g. the
:func:`fromcsv` function in this package.

The main reason for requiring that row containers support independent
row iterators (point 3) is that data from a table may need to be
iterated over several times within the same program or interactive
session. E.g., when using `petl` in an interactive session to build up
a sequence of data transformation steps, the user might want to
examine outputs from several intermediate steps, before all of the
steps are defined and the transformation is executed in full.

Note that this convention does not place any restrictions on the
lengths of header and data rows. A table may return a header row
and/or data rows of varying lengths. 


Transformation pipelines
~~~~~~~~~~~~~~~~~~~~~~~~

Note that this package makes extensive use of lazy evaluation and
iterators. This means, generally, that a transformation will not
actually be executed until data is requested.

E.g., given the following data in a file at 'example1.csv' in the
current working directory::

	foo,bar,baz
	a,1,3.4
	b,2,7.4
	c,6,2.2
	d,9,8.1

...the following code does not actually read the file, nor does it
load any of its contents into memory::

	>>> from petl import *
	>>> table1 = fromcsv('example1.csv')

Rather, `table1` is a row container object, which can be iterated over. 

Similarly, if one or more transformation functions are applied, e.g.:::

	>>> table2 = convert(table1, 'foo', 'upper')
	>>> table3 = convert(table2, 'bar', int)
	>>> table4 = convert(table3, 'baz', float)
	>>> table5 = addfield(table4, 'quux', expr('{bar} * {baz}'))

...no actual transformation work will be done, until data are
requested from `table5` or any of the other row containers returned by
the intermediate steps. 

So in effect, a 5 step transformation pipeline has been set up, and
rows will pass through the pipeline on demand, as they are pulled from
the end of the pipeline via iteration.

A call to a function like :func:`look`, or any of the functions which
write data to a file or database (e.g., :func:`tocsv`, :func:`totext`,
:func:`tosqlite3`, :func:`todb`), will pull data through the pipeline
and cause all of the transformation steps to be executed on the
requested rows, e.g.::

	>>> look(table5)
	+-------+-------+-------+--------------------+
	| 'foo' | 'bar' | 'baz' | 'quux'             |
	+=======+=======+=======+====================+
	| 'A'   | 1     | 3.4   | 3.4                |
	+-------+-------+-------+--------------------+
	| 'B'   | 2     | 7.4   | 14.8               |
	+-------+-------+-------+--------------------+
	| 'C'   | 6     | 2.2   | 13.200000000000001 |
	+-------+-------+-------+--------------------+
	| 'D'   | 9     | 8.1   | 72.89999999999999  |
	+-------+-------+-------+--------------------+

...although note that :func:`look` will by default only request the
first 10 rows, and so at most only 10 rows will be processed. Calling
:func:`look` to inspect the first few rows of a table is often an
efficient way to examine the output of a transformation pipeline,
without having to execute the transformation over all of the input
data.


Caching
~~~~~~~

This package tries to make efficient use of memory by using iterators
and lazy evaluation where possible. However, some transformations
cannot be done without building data structures, either in memory or
on disk.

An example is the :func:`sort` function, which will either sort a
table entirely in memory, or will sort the table in memory in chunks,
writing chunks to disk and performing a final merge sort on the
chunks. (Which strategy is used will depend on the arguments passed
into the :func:`sort` function when it is called.)

In either case, the sorting can take some time, and if the sorted data
will be used more than once, it is obviously undesirable to throw away
the sorted data and start again from scratch each time. It is better
to cache the sorted data, if possible, so it can be re-used.

The :func:`sort` function and all functions which use :func:`sort`
internally provide a `cache` keyword argument, which can be used to
turn on or off the caching of sorted data.

There is also a :func:`cache` function, which can be used to cache in
memory up to a configurable number of rows from a table.

.. versionchanged:: 0.16

Use of the cachetag() method is now deprecated. 


Extract - reading tables from files, databases and other sources
----------------------------------------------------------------

The following functions extract a table from a file-like source or database. For everything except :func:`fromdb` the
``source`` argument provides information about where to read the underlying data from. If the ``source`` argument
is ``None`` or a string it is interpreted as follows:

* ``None`` - read from stdin
* string starting with 'http://', 'https://' or 'ftp://' - read from URL
* string ending with '.gz' or '.bgz' - read from file via gzip decompression
* string ending with '.bz2' - read from file via bz2 decompression
* any other string - read directly from file

Some helper classes are also available for reading from other types of file-like sources, e.g., reading data from a Zip
file, a string or a subprocess, see the section on I/O helper classes below for more information.

.. autofunction:: petl.fromcsv
.. autofunction:: petl.fromtsv
.. autofunction:: petl.fromucsv
.. autofunction:: petl.fromutsv
.. autofunction:: petl.frompickle
.. autofunction:: petl.fromsqlite3
.. autofunction:: petl.fromdb
.. autofunction:: petl.fromtext
.. autofunction:: petl.fromutext
.. autofunction:: petl.fromxml
.. autofunction:: petl.fromjson
.. autofunction:: petl.fromdicts


Transform - transforming tables
-------------------------------

.. autofunction:: petl.rename
.. autofunction:: petl.setheader
.. autofunction:: petl.extendheader
.. autofunction:: petl.pushheader
.. autofunction:: petl.skip
.. autofunction:: petl.skipcomments
.. autofunction:: petl.rowslice
.. autofunction:: petl.head
.. autofunction:: petl.tail
.. autofunction:: petl.cut
.. autofunction:: petl.cutout
.. autofunction:: petl.select
.. autofunction:: petl.selectop
.. autofunction:: petl.selecteq
.. autofunction:: petl.selectne
.. autofunction:: petl.selectlt
.. autofunction:: petl.selectle
.. autofunction:: petl.selectgt
.. autofunction:: petl.selectge
.. autofunction:: petl.selectrangeopen
.. autofunction:: petl.selectrangeopenleft
.. autofunction:: petl.selectrangeopenright
.. autofunction:: petl.selectrangeclosed
.. autofunction:: petl.selectcontains
.. autofunction:: petl.selectin
.. autofunction:: petl.selectnotin
.. autofunction:: petl.selectis
.. autofunction:: petl.selectisnot
.. autofunction:: petl.selectisinstance
.. autofunction:: petl.selectre
.. autofunction:: petl.rowselect
.. autofunction:: petl.recordselect
.. autofunction:: petl.rowlenselect
.. autofunction:: petl.fieldselect
.. autofunction:: petl.facet
.. autofunction:: petl.rangefacet
.. autofunction:: petl.replace
.. autofunction:: petl.replaceall
.. autofunction:: petl.convert
.. autofunction:: petl.convertall
.. autofunction:: petl.fieldconvert
.. autofunction:: petl.convertnumbers
.. autofunction:: petl.search
.. autofunction:: petl.sub
.. autofunction:: petl.split
.. autofunction:: petl.capture
.. autofunction:: petl.unpack
.. autofunction:: petl.unpackdict
.. autofunction:: petl.fieldmap
.. autofunction:: petl.rowmap
.. autofunction:: petl.recordmap
.. autofunction:: petl.rowmapmany
.. autofunction:: petl.recordmapmany
.. autofunction:: petl.cat
.. autofunction:: petl.duplicates
.. autofunction:: petl.unique
.. autofunction:: petl.conflicts
.. autofunction:: petl.sort
.. autofunction:: petl.join
.. autofunction:: petl.leftjoin
.. autofunction:: petl.lookupjoin
.. autofunction:: petl.rightjoin
.. autofunction:: petl.outerjoin
.. autofunction:: petl.crossjoin
.. autofunction:: petl.antijoin
.. autofunction:: petl.complement
.. autofunction:: petl.diff
.. autofunction:: petl.recordcomplement
.. autofunction:: petl.recorddiff
.. autofunction:: petl.intersection
.. autofunction:: petl.aggregate
.. autofunction:: petl.rangeaggregate
.. autofunction:: petl.rangecounts
.. autofunction:: petl.rowreduce
.. autofunction:: petl.recordreduce
.. autofunction:: petl.rangerowreduce
.. autofunction:: petl.rangerecordreduce
.. autofunction:: petl.mergeduplicates
.. autofunction:: petl.mergesort
.. autofunction:: petl.merge
.. autofunction:: petl.melt
.. autofunction:: petl.recast
.. autofunction:: petl.transpose
.. autofunction:: petl.pivot
.. autofunction:: petl.hashjoin
.. autofunction:: petl.hashleftjoin
.. autofunction:: petl.hashlookupjoin
.. autofunction:: petl.hashrightjoin
.. autofunction:: petl.hashantijoin
.. autofunction:: petl.hashcomplement
.. autofunction:: petl.hashintersection
.. autofunction:: petl.flatten
.. autofunction:: petl.unflatten
.. autofunction:: petl.annex
.. autofunction:: petl.fold
.. autofunction:: petl.addrownumbers
.. autofunction:: petl.addcolumn
.. autofunction:: petl.addfield
.. autofunction:: petl.filldown
.. autofunction:: petl.fillright
.. autofunction:: petl.fillleft
.. autofunction:: petl.multirangeaggregate
.. autofunction:: petl.unjoin
.. autofunction:: petl.distinct
.. autofunction:: petl.rowgroupmap
.. autofunction:: petl.groupcountdistinctvalues
.. autofunction:: petl.groupselectfirst
.. autofunction:: petl.groupselectmin
.. autofunction:: petl.groupselectmax


Load - writing tables to files and databases
--------------------------------------------

The following functions write data from a table to a file-like source or database. For functions that accept a ``source``
argument, if the ``source`` argument is ``None`` or a string it is interpreted as follows:

* ``None`` - write to stdout
* string ending with '.gz' or '.bgz' - write to file via gzip decompression
* string ending with '.bz2' - write to file via bz2 decompression
* any other string - write directly to file

Some helper classes are also available for writing to other types of file-like sources, e.g., writing to a Zip
file or string buffer, see the section on I/O helper classes below for more information.

.. autofunction:: petl.tocsv
.. autofunction:: petl.appendcsv
.. autofunction:: petl.totsv
.. autofunction:: petl.appendtsv
.. autofunction:: petl.toucsv
.. autofunction:: petl.appenducsv
.. autofunction:: petl.toutsv
.. autofunction:: petl.appendutsv
.. autofunction:: petl.topickle
.. autofunction:: petl.appendpickle
.. autofunction:: petl.tosqlite3
.. autofunction:: petl.appendsqlite3
.. autofunction:: petl.todb
.. autofunction:: petl.appenddb
.. autofunction:: petl.totext
.. autofunction:: petl.appendtext
.. autofunction:: petl.toutext
.. autofunction:: petl.appendutext
.. autofunction:: petl.tojson
.. autofunction:: petl.tojsonarrays


Utility functions
-----------------

.. autofunction:: petl.header
.. autofunction:: petl.data
.. autofunction:: petl.iterdata
.. autofunction:: petl.dataslice
.. autofunction:: petl.fieldnames
.. autofunction:: petl.nrows
.. autofunction:: petl.look
.. autofunction:: petl.lookall
.. autofunction:: petl.see
.. autofunction:: petl.values
.. autofunction:: petl.itervalues
.. autofunction:: petl.valueset
.. autofunction:: petl.valuecount
.. autofunction:: petl.valuecounts
.. autofunction:: petl.valuecounter
.. autofunction:: petl.dicts
.. autofunction:: petl.iterdicts
.. autofunction:: petl.namedtuples
.. autofunction:: petl.iternamedtuples
.. autofunction:: petl.records
.. autofunction:: petl.iterrecords
.. autofunction:: petl.columns
.. autofunction:: petl.facetcolumns
.. autofunction:: petl.isunique
.. autofunction:: petl.isordered
.. autofunction:: petl.limits
.. autofunction:: petl.stats
.. autofunction:: petl.lenstats
.. autofunction:: petl.stringpatterns
.. autofunction:: petl.stringpatterncounter
.. autofunction:: petl.rowlengths
.. autofunction:: petl.typecounts
.. autofunction:: petl.typecounter
.. autofunction:: petl.typeset
.. autofunction:: petl.parsecounts
.. autofunction:: petl.parsecounter
.. autofunction:: petl.dateparser
.. autofunction:: petl.timeparser
.. autofunction:: petl.datetimeparser
.. autofunction:: petl.boolparser
.. autofunction:: petl.parsenumber
.. autofunction:: petl.lookup
.. autofunction:: petl.lookupone
.. autofunction:: petl.dictlookup
.. autofunction:: petl.dictlookupone
.. autofunction:: petl.expr
.. autofunction:: petl.strjoin
.. autofunction:: petl.randomtable
.. autofunction:: petl.dummytable
.. autofunction:: petl.diffheaders
.. autofunction:: petl.diffvalues
.. autofunction:: petl.heapqmergesorted
.. autofunction:: petl.shortlistmergesorted
.. autofunction:: petl.progress
.. autofunction:: petl.clock
.. autofunction:: petl.rowgroupby
.. autofunction:: petl.nthword
.. autofunction:: petl.cache


I/O helper classes
------------------

The following classes are helpers for extract (``from...()``) and load (``to...()``) functions that use a file-like
data source. An instance of any
of the following classes can be used as the ``source`` argument to data extraction functions like :func:`fromcsv` etc.,
with the exception of :class:`StdoutSource` which is write-only. An instance of any of the following classes can also be
used as the ``source`` argument to data loading functions like :func:`tocsv` etc., with the exception of
:class:`StdinSource`, :class:`URLSource` and :class:`PopenSource` which are read-only. The behaviour of each source can
usually be configured by passing arguments to the constructor, see the source code of the :mod:`petl.io` module for
full details.

.. autoclass:: petl.FileSource
.. autoclass:: petl.GzipSource
.. autoclass:: petl.BZ2Source
.. autoclass:: petl.ZipSource
.. autoclass:: petl.StdinSource
.. autoclass:: petl.StdoutSource
.. autoclass:: petl.URLSource
.. autoclass:: petl.StringSource
.. autoclass:: petl.PopenSource


Further Reading
---------------

.. toctree::
   :maxdepth: 2

   fluent
   interactive
   push
   case_study_1
   related_work


Indices and tables
------------------

* :ref:`genindex`
* :ref:`modindex`
* :ref:`search`
