Metadata-Version: 2.1
Name: pzip
Version: 0.9.5
Summary: Crytographically secure file compression.
Home-page: https://github.com/imsweb/pzip
Author: Dan Watson
Author-email: watsond@imsweb.com
License: MIT
Platform: UNKNOWN
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python
Classifier: Programming Language :: Python :: 3
Classifier: Topic :: Security
Classifier: Topic :: System :: Archiving :: Compression
Description-Content-Type: text/markdown
Requires-Dist: cryptography
Requires-Dist: tqdm

![CI](https://github.com/imsweb/pzip/workflows/CI/badge.svg?branch=master)

# PZip

PZip is an encrypted file format (with optional gzip compression), a command-line tool, and a Python file-like
interface.

## Installation

`pip install pzip`

## Command Line Usage

For a full list of options, run `pzip -h`. Basic usage is summarized below:

```
pzip --key keyfile sensitive_data.csv
pzip --key keyfile sensitive_data.csv.pz
```

Piping and outputting to stdout is also supported:

```
tar cf - somedir | pzip -z --key keyfile -o somedir.pz
pzip --key keyfile -c somedir.pz | tar xf -
```

PZip will generate an encryption key automatically, if you want:

```
pzip -a sensitive_data.csv
encrypting with password: 7xRLoyHgK6J2-4mUkT3JoklSyfSYxHb1EkMABjasnUc

pzip -p 7xRLoyHgK6J2-4mUkT3JoklSyfSYxHb1EkMABjasnUc sensitive_data.csv.pz
```

## Python Usage

```python
import os
from pzip import PZip

key = os.urandom(32)

with PZip("myfile.pz", PZip.Mode.ENCRYPT, key) as f:
    f.write(b"sensitive data")

with PZip("myfile.pz", PZip.Mode.DECRYPT, key) as f:
    print(f.read())
```

For on-the-fly/streaming encryption, or writing to non-seekable files, you may pass in the length of the plaintext
that will be written in the PZip header. Alternately, if you don't wish to store the plaintext length in the header
for privacy reasons, you can pass `size=0`.

```python
plaintext = b"hello world"
with PZip(streaming_response, "wb", key, size=len(plaintext)) as f:
    f.write(plaintext)
```

## Encryption

PZip uses AES-GCM with 128-, 192-, or 256-bit (default) keys. Keys are derived using PBKDF2-SHA256 with a configurable
iteration count (currently 200,000) and a random salt per file. A random 96-bit nonce (GCM IV) is generated by default
for each file, but may also be supplied via the Python interface for systems that can more strongly guarantee
uniqueness. The key size, nonce size, iteration count, salt, and nonce/IV are stored in the PZip file header.
Additionally, the nonce is prepended to the file contents when encrypting as a way to fail fast when doing streaming
decryption. The decrypted plaintext will still be authenticated via the tag at the end, but a fail-fast mechanism is
important when dealing with large files. The authentication tag is appended after the ciphertext in order to make this
format suitable for on-the-fly streaming encryption.

## Compression

PZip optionally compresses data using gzip at the default compression level. Nothing about the file format precludes
adding an option in the future to allow conifguration of the comprssion level, or even the compression algorithm.

## File Format

The PZip file format consists of a 36-byte header, followed by a variable-size nonce in plaintext, immediately followed
by the same nonce encrypted. The remainder of the file is encrypted data, except for the last 16 bytes, which are the
AES-GCM authentication tag data. The header is big/network endian, with the following fields/sizes:

  * File identification (magic), 4 bytes - `PZIP`
  * File format version, 1 byte - currently `\x01`
  * Flags, 1 byte - currently only bit 0 is set when the file data is gzip-compressed
  * AES key size (in bytes), 1 byte - must be 16, 24, or 32
  * GCM nonce size (in bytes), 1 byte - 12 by default, may be larger
  * PBKDF2 iterations (4 bytes, unsigned int/long)
  * PBKDF2 salt (16 bytes)
  * Plaintext length (8 bytes, unsigned long long) - optional, may be set to 0

Below is an example of a PZip file containing the plaintext "hello world", encrypted with a key derived from the string
"pzip", with no compression (for readability). The portion sectioned off in double bars (`===`) is encrypted.

```
+-------------------------------------------------+------+-------------+------------------------+
| Bytes                                           | Size | Value       | Description            |
+-------------------------------------------------+------+-------------+------------------------+
| 50 5A 49 50                                     | 4    | PZIP        | File identification    |
| 01                                              | 1    | 1           | Version                |
| 00                                              | 1    | 0           | Flags                  |
| 20                                              | 1    | 32          | AES key size in bytes  |
| 0C                                              | 1    | 12          | Nonce size in bytes    |
| 00 03 0d 40                                     | 4    | 200000      | PBKDF2 iterations      |
| AD 46 72 0C 70 00 FF CC 20 97 10 5B 10 D4 0B B8 | 16   | <salt>      | PBKDF2 salt            |
| 00 00 00 00 00 00 00 0B                         | 8    | 11          | Plaintext length       |
+-------------------------------------------------+------+-------------+------------------------+
| B2 4F DD E3 FF 21 A8 09 3E 0C 1C 3E             | 12   | <nonce>     | Nonce (unencrypted)    |
+=================================================+======+=============+========================+
| 8B EB 12 D4 81 AD 6B 47 B0 0F 74 70             | 12   | <nonce>     | Nonce (encrypted)      |
| 8E A1 96 74 A9 51 31 47 B9 5C A2                | 11   | hello world | Ciphertext             |
+=================================================+======+=============+========================+
| 12 58 A6 8B ED F1 A9 08 47 3A 10 BC B6 1E 28 24 | 16   | <tag>       | GCM authentication tag |
+-------------------------------------------------+------+-------------+------------------------+
```

You can verify the above example in Python:

```python
>>> import binascii, io, pzip
>>> data = binascii.unhexlify(
...     '505A49500100200C00030d40AD46720C7000FFCC2097105B10D40BB8000000000000000BB24FDDE3FF21A80'
...     '93E0C1C3E8BEB12D481AD6B47B00F74708EA19674A9513147B95CA21258A68BEDF1A908473A10BCB61E2824'
... )
>>> pzip.PZip(io.BytesIO(data), "rb", b"pzip").read()
b'hello world'
```

## FAQ

*Why does this exist?*

Nothing PZip does couldn't be done by chaining together existing tools - compressing with `gzip`, deriving a key and
encrypting with `openssl`, generating a MAC (if not using GCM), etc. But at that point, you're probably writing a
script to automate the process, tacking on bits of data here and there (or writing multiple files). PZip simply wraps
that in a nice package and documents a file format. Plus having a Python interface you can pretty much treat as a file
is super nice.

*Why not store filename?*

Storing the original filename has a number of security implications, both technical and otherwise. At a technical level,
PZip would need to ensure safe filename handling across all platforms with regards to path delimiters, encodings, etc.
Additionally, PZip was designed for a system where user-generated file attachments may contain sensitive information in
the filenames themselves. In reality, having a stored filename is of minimal use anyway, since the default behavior is
to append and remove a `.pz` suffix when encrypting/decrypting. If a `.pz` file was renamed, you would have a conflict
that would likely be resolved by using the actual filename (not the stored filename) anyway.


