Metadata-Version: 2.1
Name: pzip
Version: 0.9.7
Summary: Crytographically secure file compression.
Home-page: https://github.com/imsweb/pzip
Author: Dan Watson
Author-email: watsond@imsweb.com
License: MIT
Platform: UNKNOWN
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python
Classifier: Programming Language :: Python :: 3
Classifier: Topic :: Security
Classifier: Topic :: System :: Archiving :: Compression
Description-Content-Type: text/markdown
Requires-Dist: cryptography
Requires-Dist: tqdm

![CI](https://github.com/imsweb/pzip/workflows/CI/badge.svg?branch=master)

# PZip

PZip is an encrypted file format (with optional gzip compression), a command-line tool, and a Python file-like
interface.

## Installation

`pip install pzip`

## Command Line Usage

For a full list of options, run `pzip -h`. Basic usage is summarized below:

```
pzip --key keyfile sensitive_data.csv
pzip --key keyfile sensitive_data.csv.pz
```

Piping and outputting to stdout is also supported:

```
tar cf - somedir | pzip -z --key keyfile -o somedir.pz
pzip --key keyfile -c somedir.pz | tar xf -
```

PZip will generate an encryption key automatically, if you want:

```
pzip -a sensitive_data.csv
encrypting with password: HgHs4OIm4zGXkch6lTBIqg

pzip -p HgHs4OIm4zGXkch6lTBIqg sensitive_data.csv.pz
```

## Python Usage

```python
import os
from pzip import PZip

key = os.urandom(32)

with PZip("myfile.pz", PZip.Mode.ENCRYPT, key) as f:
    f.write(b"sensitive data")

with PZip("myfile.pz", PZip.Mode.DECRYPT, key) as f:
    print(f.read())
```

To encrypt using a password instead of a random key (and thus use PBKDF2 instead of HKDF):

```python
with PZip("myfile.pz", PZip.Mode.ENCRYPT, password=b"secret") as f:
    f.write(b"hello world")
```

For on-the-fly/streaming encryption, or writing to non-seekable files, you may pass in the length of the plaintext
that will be written in the PZip header. Alternately, if you don't wish to store the plaintext length in the header
for privacy reasons, you can pass `size=0`.

```python
plaintext = b"hello world"
with PZip(streaming_response, "wb", key, size=len(plaintext)) as f:
    f.write(plaintext)
```

## Encryption

PZip uses AES-GCM with 128-, 192-, or 256-bit (default) keys. Keys are derived using one of the following, based on
the source key material:

  * PBKDF2-SHA256 with a configurable iteration count (currently 200,000) if the key material is a password
  * HKDF-SHA256 if the key material is a random key

A random 128-bit salt, and 96-bit nonce (GCM IV) is generated by default for each file, but may also be supplied via
the Python interface for systems that can more strongly guarantee uniqueness. The key size, nonce size,
iteration count, salt, and nonce/IV are stored in the PZip file header. Additionally, the nonce is prepended to the
file contents when encrypting as a way to fail fast when doing streaming decryption. The decrypted plaintext will still
be authenticated via the tag at the end, but a fail-fast mechanism is important when dealing with large files. The
authentication tag is appended after the ciphertext in order to make this format suitable for on-the-fly streaming
encryption.

## Compression

PZip optionally compresses data using gzip at the default compression level. Nothing about the file format precludes
adding an option in the future to allow conifguration of the comprssion level, or even the compression algorithm.

## File Format

The PZip file format consists of a 36-byte header, followed by a variable-size nonce in plaintext, immediately followed
by the same nonce encrypted. The remainder of the file is encrypted data, except for the last 16 bytes, which are the
AES-GCM authentication tag data. The header is big/network endian, with the following fields/sizes:

  * File identification (magic), 4 bytes - `PZIP`
  * File format version, 1 byte - currently `\x01`
  * Flags, 1 byte:
    * Bit 0 (1): set when the file data is gzip-compressed
    * Bit 1 (2): set when the original key material was a password (use PBKDF2 instead of HKDF)
  * AES key size (in bytes), 1 byte - must be 16, 24, or 32
  * GCM nonce size (in bytes), 1 byte - 12 by default, may be larger
  * KDF iterations (4 bytes, unsigned int/long) - currently unused if key material was not a password
  * KDF salt (16 bytes)
  * Plaintext length (8 bytes, unsigned long long) - optional, may be set to 0

Below is an example of a PZip file containing the plaintext "hello world", encrypted with a key derived from the string
"pzip", with no compression (for readability). The portion sectioned off in double bars (`===`) is encrypted.

```
+-------------------------------------------------+------+-------------+------------------------+
| Bytes                                           | Size | Value       | Description            |
+-------------------------------------------------+------+-------------+------------------------+
| 50 5A 49 50                                     | 4    | PZIP        | File identification    |
| 01                                              | 1    | 1           | Version                |
| 02                                              | 1    | 2           | Flags                  |
| 20                                              | 1    | 32          | AES key size in bytes  |
| 0C                                              | 1    | 12          | Nonce size in bytes    |
| 00 03 0d 40                                     | 4    | 200000      | KDF iterations         |
| AD 46 72 0C 70 00 FF CC 20 97 10 5B 10 D4 0B B8 | 16   | <salt>      | KDF salt               |
| 00 00 00 00 00 00 00 0B                         | 8    | 11          | Plaintext length       |
+-------------------------------------------------+------+-------------+------------------------+
| B2 4F DD E3 FF 21 A8 09 3E 0C 1C 3E             | 12   | <nonce>     | Nonce (unencrypted)    |
+=================================================+======+=============+========================+
| 8B EB 12 D4 81 AD 6B 47 B0 0F 74 70             | 12   | <nonce>     | Nonce (encrypted)      |
| 8E A1 96 74 A9 51 31 47 B9 5C A2                | 11   | hello world | Ciphertext             |
+=================================================+======+=============+========================+
| 12 58 A6 8B ED F1 A9 08 47 3A 10 BC B6 1E 28 24 | 16   | <tag>       | GCM authentication tag |
+-------------------------------------------------+------+-------------+------------------------+
```

You can verify the above example in Python:

```python
>>> import binascii, io, pzip
>>> data = binascii.unhexlify(
...     '505A49500102200C00030d40AD46720C7000FFCC2097105B10D40BB8000000000000000BB24FDDE3FF21A80'
...     '93E0C1C3E8BEB12D481AD6B47B00F74708EA19674A9513147B95CA21258A68BEDF1A908473A10BCB61E2824'
... )
>>> pzip.PZip(io.BytesIO(data), "rb", password=b"pzip").read()
b'hello world'
```

## FAQ

*Why does this exist?*

Nothing PZip does couldn't be done by chaining together existing tools - compressing with `gzip`, deriving a key and
encrypting with `openssl`, generating a MAC (if not using GCM), etc. But at that point, you're probably writing a
script to automate the process, tacking on bits of data here and there (or writing multiple files). PZip simply wraps
that in a nice package and documents a file format. Plus having a Python interface you can pretty much treat as a file
is super nice.

*Why not store filename?*

Storing the original filename has a number of security implications, both technical and otherwise. At a technical level,
PZip would need to ensure safe filename handling across all platforms with regards to path delimiters, encodings, etc.
Additionally, PZip was designed for a system where user-generated file attachments may contain sensitive information in
the filenames themselves. In reality, having a stored filename is of minimal use anyway, since the default behavior is
to append and remove a `.pz` suffix when encrypting/decrypting. If a `.pz` file was renamed, you would have a conflict
that would likely be resolved by using the actual filename (not the stored filename) anyway.


