Metadata-Version: 2.1
Name: pod5
Version: 0.0.43
Summary: Oxford Nanopore Technologies Pod5 File Format Python API and Tools
Author-email: support@nanoporetech.com
Keywords: nanopore
Classifier: Environment :: Console
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: Mozilla Public License 2.0 (MPL 2.0)
Classifier: Natural Language :: English
Classifier: Programming Language :: Python :: 3
Classifier: Topic :: Scientific/Engineering :: Bio-Informatics
Requires-Python: ~=3.7
Description-Content-Type: text/markdown
Requires-Dist: iso8601
Requires-Dist: jsonschema
Requires-Dist: lib-pod5 (~=0.0.43)
Requires-Dist: more-itertools
Requires-Dist: numpy (>=1.20.0)
Requires-Dist: pyarrow (~=8.0.0)
Requires-Dist: pytz
Requires-Dist: packaging
Requires-Dist: pandas
Requires-Dist: vbz-h5py-plugin
Provides-Extra: dev
Requires-Dist: black (==22.3.0) ; extra == 'dev'
Requires-Dist: mypy ; extra == 'dev'
Requires-Dist: myst-parser ; extra == 'dev'
Requires-Dist: pre-commit ; extra == 'dev'
Requires-Dist: pytest (~=6.2) ; extra == 'dev'
Requires-Dist: pytest-cov (~=3.0) ; extra == 'dev'
Requires-Dist: sphinx (==v5.1.1) ; extra == 'dev'
Requires-Dist: sphinx-rtd-theme ; extra == 'dev'

POD5 Python Package
===================

The `pod5` Python package contains the tools and python API wrapping the compiled bindings
for the POD5 file format from `lib_pod5`.

POD5 Format Tools
-----------------

The ``pod5`` package provides the following tools for inspecting and manipulating
`.pod5` files as well as converting between `.pod5` and `.fast5` file formats.

1. [pod5 update](#pod5%20update)
2. [pod5 inspect](#pod5%20inspect)
3. [pod5 subset](#pod5%20subset)
4. [pod5 repack](#pod5%20repack)
5. [pod5 convert fast5](#pod5%20convert%20fast5)
6. [pod5 convert to_fast5](#pod5%20convert%20to_fast5)

pod5 update
-----------

The `pod5 update` tool can be used to update a file in an older pod5 format to the latest available format.

```bash
# View help on pod5 update tools
> pod5 update --help
> pod5 update my-old-pod5-file.pod5 ./migrated_files/
```

pod5inspect
------------

The `pod5 inspect` tool can be used to extract details and summaries of the contents of `.pod5` files. There are three programs for users within `pod5 inspect` and these are [`reads`](#pod5%20inspect%20reads), [`read`](#pod5%20inspect%20read), and [`summary`](#pod5%20inspect%20summary),

```bash
# View help on pod5 inspect tools
> pod5 inspect --help
> pod5 inspect {reads, read, summary} --help
```

### pod5 inspect reads

Inspect all reads and print a csv table of the details of all reads in the given `.pod5` files.

```bash
> pod5 inspect reads pod5_file.pod5

# Sample Output:
read_id,channel,well,pore_type,read_number,start_sample,end_reason,median_before,calibration_offset,calibration_scale,sample_count,byte_count,signal_compression_ratio
00445e58-3c58-4050-bacf-3411bb716cc3,908,1,not_set,100776,374223800,signal_positive,205.3,-240.0,0.1,65582,58623,0.447
00520473-4d3d-486b-86b5-f031c59f6591,220,1,not_set,7936,16135986,signal_positive,192.0,-233.0,0.1,167769,146495,0.437
...
```

### pod5 inspect read

Inspect the pod5 file, find a specific read and print its details.

```bash
> pod5 inspect read pod5_file.pod5 00445e58-3c58-4050-bacf-3411bb716cc3

# Sample Output:
File: out-tmp/output.pod5
read_id: 0e5d6827-45f6-462c-9f6b-21540eef4426
read_number:    129227
start_sample:   367096601
median_before:  171.889404296875
channel data:
  channel: 2366
  well: 1
  pore_type: not_set
end reason:
  name: signal_positive
  forced False
calibration:
  offset: -243.0
  scale: 0.1462070643901825
samples:
  sample_count: 81040
  byte_count: 71989
  compression ratio: 0.444
run info
    acquisition_id: 2ca00715f2e6d8455e5174cd20daa4c38f95fae2
    acquisition_start_time: 2021-07-23 13:48:59.780000
    adc_max: 0
    adc_min: 0
    context_tags
      barcoding_enabled: 0
      basecall_config_filename: dna_r10.3_450bps_hac_prom.cfg
      experiment_duration_set: 2880
...
```

### pod5 inspect summary

Inspect the pod5 file, printing summary information on the reads in each batch

pod5 subset
----------

`pod5 subset` is a tool for separating the reads in `.pod5` files into one or more
output files. This tool can be used to create new `.pod5` files which contain a
user-defined subset of reads from the input.

The `pod5 subset` tool requires a mapping which defines which read_ids should be
written to which output. There are multiple ways of specifying this mapping which are
defined in either a `.csv` or `.json` file or by using a tab-separated table
(e.g. basecaller sequencing summary) and instructions on how to interpret it.

```bash
# View help
> pod5 subset --help

# Subset input(s) using a pre-defined mapping
> pod5 subset example_1.pod5 --csv mapping.csv
> pod5 subset examples_*.pod5 --json mapping.json

# Subset input(s) using a dynamic mapping created at runtime
> pod5 subset example_1.pod5 --summary summary.txt --columns barcode alignment_genome
```

### Important note on read_id clashes

Care should be taken to ensure that when providing multiple input `.pod5` files to `pod5 subset`
that there are no read_id UUID clashes. If this occurs both reads are written to the output.

### Creating a Subset mapping

The `.csv` or `.json` inputs should define a mapping of destination filename to an array
of read_ids which will be written to the destination.

In the example below of a `.csv` subset mapping, note that the output filename can be specified on multiple lines. This allows multi-line specifications to avoid excessively long lines.

```bash
# --csv mapping filename to array of read_id
output_1.pod5, 132b582c-56e8-4d46-9e3d-48a275646d3a, 12a4d6b1-da6e-4136-8bb3-1470ef27e311, ...
output_2.pod5, 0ff4dc01-5fa4-4260-b54e-1d8716c7f225
output_2.pod5, 0e359c40-296d-4edc-8f4a-cca135310ab2, 0e9aa0f8-99ad-40b3-828a-45adbb4fd30c
```

See below an example of a `.json` subset mapping. This file must of course be well-formatted
`json` in addition to the formatting standard required by the tool. The formatting requirements
for the `.json` subset mapping are that keys should be unique filenames mapped to an array
of read_id strings.

```json
{
    "output_1.pod5": [
        "0000173c-bf67-44e7-9a9c-1ad0bc728e74",
        "006d1319-2877-4b34-85df-34de7250a47b"
    ],
    "output_2.pod5": [
        "00925f34-6baf-47fc-b40c-22591e27fb5c",
        "009dc9bd-c5f4-487b-ba4c-b9ce7e3a711e"
    ]
}
```

### Subsetting from a summary

`pod5 subset` can dynamically generate output targets and collect associated reads
based on a tab-separated file (e.g. sequencing summary) which contains a header row
and a series of columns on which to group unique collections of values. Internally
this process uses the [`pandas.Dataframe.groupby`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.groupby.html) function where the `by` parameter is the sequence of column names
specified using the ``--columns` argument.

The column names specified in `--columns` should be **categorical** in nature.
There is no restriction in-place however there may be an excessive number of output files
generated if a continuous variable was used for subsetting.

Given the following example summary file, observe the resultant outputs given various
arguments:

```text
read_id mux barcode     length
read_a  1   barcode_a   4321
read_b  1   barcode_b   1000
read_c  2   barcode_b   1200
read_d  2   barcode_c   1234
```

```bash
> pod5 subset example_1.pod5 --output barcode_subset --summary summary.txt --columns barcode
> ls barcode_subset
barcode-barcode_a.pod5 # Contains: read_a
barcode-barcode_b.pod5 # Contains: read_b, read_c
barcode-barcode_c.pod5 # Contains: read_d

> pod5 subset example_1.pod5 --output mux_subset --summary summary.txt --columns mux
> ls mux_subset
mux-1.pod5 # Contains: read_a, read_b
mus-2.pod5 # Contains: read_c, read_d

> pod5 subset example_1.pod5 --output barcode_mux_subset --summary summary.txt --columns barcode mux
> ls barcode_mux_subset
barcode-barcode_a_mux-1.pod5 # Contains: read_a
barcode-barcode_b_mux-1.pod5 # Contains: read_b
barcode-barcode_b_mux-2.pod5 # Contains: read_c
barcode-barcode_c_mux-2.pod5 # Contains: read_d
```

The output filename is generated from a template string. The automatically generated
template is the sequential concatenation of column_name-column_value followed by the
`.pod5` file extension. The user can set their own filename template using the ``--template``
argument. This argument accepts a string in the Python f-string style where the subsetting
variables are used for keyword placeholder substitution. Keywords should be placed
within curly-braces. For example:

From the examples above:

```bash
> pod5 subset example_1.pod5 --output barcode_subset --summary summary.txt --columns barcode
# default template used = "barcode-{barcode}.pod5"

> pod5 subset example_1.pod5 --output barcode_mux_subset --summary summary.txt --columns barcode mux
# default template used = "barcode-{barcode}_mux-{mux}.pod5"
```

Custom template example:
```bash
> pod5 subset example_1.pod5 --output barcode_subset --summary summary.txt --columns barcode --template "{barcode}.subset.pod5"
> ls barcode_subset
barcode_a.subset.pod5 # Contains: read_a
barcode_b.subset.pod5 # Contains: read_b, read_c
barcode_c.subset.pod5 # Contains: read_d
```

pod5 repack
-----------

`pod5 repack` will simply repack `.pod5` files into one-for-one output files of the same name.

``` bash
> pod5 repack pod5s/*.pod5 repacked_pods/
```

pod5 merge
-----------

`pod5 merge` will merge multiple `.pod5` files into one output file.

``` bash
> pod5 merge pod5s/*.pod5 merged.pod5
```


pod5 convert fast5
-----------------------

The `pod5 convert fast5` tool takes one or more `.fast5` files and converts them
to one or more `.pod5` files.

**Some content previously stored in fast5 files is not compatible with the pod5 format and will not be converted**

``` bash
# View help
> pod5 convert fast5 --help

# Convert fast5 files into a monolithic output file
> pod5 convert fast5 ./input/*.fast5 converted.pod5

# Convert fast5 files into a monolithic output in an existing directory
> pod5 convert fast5 ./input/*.fast5 outputs/
> ls outputs/
outputs/output.pod5 # default name

# Convert each fast5 to its relative converted output. The output files are written
# into the output directory at paths relatve to the path given to the
# --output-one-to-one argument. Note: This path must be a relative parent to all
# input paths.
> ls input/*.fast5
fast5_1.fast5 fast5_2.fast5 ... fast5_N.fast5
> pod5 convert fast5 ./input/*.fast5 output_pod5s --output-one-to-one input/
> ls output_pod5s/
fast5_1.pod5 fast5_2.pod5 ... fast5_N.pod5

# Note the different --output-one-to-one path which is now the current working directory.
# The new sub-directory output_pod5/input is created.
> pod5 convert fast5 ./input/*.fast5 output_pod5s --output-one-to-one ./
> ls output_pod5s/
input/fast5_1.pod5 input/fast5_2.pod5 ... input/fast5_N.pod5

```

pod5 convert to_fast5
---------------------

The `pod5 convert to_fast5` tool takes one or more `.pod5` files and converts them
to multiple `.fast5` files. The default behaviour is to write 4000 reads per output file
but this can be controlled with the `--file-read-count` argument.

``` bash
# View help
> pod5 convert to_fast5 --help

# Convert pod5 files to fast5 files with default 4000 reads per file
> pod5 convert to_fast5 example.pod5 pod5_to_fast5
> ls pod5_to_fast5/
output_1.fast5 output_2.fast5 ... output_N.fast5
```
