Metadata-Version: 2.4
Name: repro-tarfile
Version: 0.2.1
Summary: A tiny, zero-dependency replacement for Python's tarfile standard library for creating reproducible/deterministic tar archives.
Project-URL: Documentation, https://github.com/drivendataorg/repro-tarfile#readme
Project-URL: Issues, https://github.com/drivendataorg/repro-tarfile/issues
Project-URL: Source, https://github.com/drivendataorg/repro-tarfile
Author-email: DrivenData <info@drivendata.org>
License: Unless otherwise indicated, this software is copyright of DrivenData and
        licensed under the MIT License. Some portions of this software are copied and
        modified from Python 3.12, which is copyright of the Python Software Foundation
        and licensed under the Python Software Foundation License Version 2. Some other
        portions of this software are copied and modified from Typeshed, which is
        copyright of the Python Software Foundation and licensed under the Apache
        License Version 2.
        
        ==============================================================================
        
        MIT License
        
        Copyright (c) 2024 DrivenData Inc.
        
        Permission is hereby granted, free of charge, to any person obtaining a copy of
        this software and associated documentation files (the “Software”), to deal in
        the Software without restriction, including without limitation the rights to
        use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of
        the Software, and to permit persons to whom the Software is furnished to do so,
        subject to the following conditions:
        
        The above copyright notice and this permission notice shall be included in all
        copies or substantial portions of the Software.
        
        THE SOFTWARE IS PROVIDED “AS IS”, WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
        IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS
        FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR
        COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER
        IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
        CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
        
        ==============================================================================
        
        PYTHON SOFTWARE FOUNDATION LICENSE VERSION 2
        
        Copyright (c) 2001, 2002, 2003, 2004, 2005, 2006, 2007, 2008, 2009, 2010,
        2011, 2012, 2013, 2014, 2015, 2016, 2017, 2018, 2019, 2020, 2021, 2022, 2023 Python Software Foundation;
        All Rights Reserved
        
        1. This LICENSE AGREEMENT is between the Python Software Foundation
        ("PSF"), and the Individual or Organization ("Licensee") accessing and
        otherwise using this software ("Python") in source or binary form and
        its associated documentation.
        
        2. Subject to the terms and conditions of this License Agreement, PSF hereby
        grants Licensee a nonexclusive, royalty-free, world-wide license to reproduce,
        analyze, test, perform and/or display publicly, prepare derivative works,
        distribute, and otherwise use Python alone or in any derivative version,
        provided, however, that PSF's License Agreement and PSF's notice of copyright,
        i.e., "Copyright (c) 2001, 2002, 2003, 2004, 2005, 2006, 2007, 2008, 2009, 2010,
        2011, 2012, 2013, 2014, 2015, 2016, 2017, 2018, 2019, 2020, 2021, 2022, 2023 Python Software Foundation;
        All Rights Reserved" are retained in Python alone or in any derivative version
        prepared by Licensee.
        
        3. In the event Licensee prepares a derivative work that is based on
        or incorporates Python or any part thereof, and wants to make
        the derivative work available to others as provided herein, then
        Licensee hereby agrees to include in any such work a brief summary of
        the changes made to Python.
        
        4. PSF is making Python available to Licensee on an "AS IS"
        basis.  PSF MAKES NO REPRESENTATIONS OR WARRANTIES, EXPRESS OR
        IMPLIED.  BY WAY OF EXAMPLE, BUT NOT LIMITATION, PSF MAKES NO AND
        DISCLAIMS ANY REPRESENTATION OR WARRANTY OF MERCHANTABILITY OR FITNESS
        FOR ANY PARTICULAR PURPOSE OR THAT THE USE OF PYTHON WILL NOT
        INFRINGE ANY THIRD PARTY RIGHTS.
        
        5. PSF SHALL NOT BE LIABLE TO LICENSEE OR ANY OTHER USERS OF PYTHON
        FOR ANY INCIDENTAL, SPECIAL, OR CONSEQUENTIAL DAMAGES OR LOSS AS
        A RESULT OF MODIFYING, DISTRIBUTING, OR OTHERWISE USING PYTHON,
        OR ANY DERIVATIVE THEREOF, EVEN IF ADVISED OF THE POSSIBILITY THEREOF.
        
        6. This License Agreement will automatically terminate upon a material
        breach of its terms and conditions.
        
        7. Nothing in this License Agreement shall be deemed to create any
        relationship of agency, partnership, or joint venture between PSF and
        Licensee.  This License Agreement does not grant permission to use PSF
        trademarks or trade name in a trademark sense to endorse or promote
        products or services of Licensee, or any third party.
        
        8. By copying, installing or otherwise using Python, Licensee
        agrees to be bound by the terms and conditions of this License
        Agreement.
        
        ==============================================================================
        
        The "typeshed" project is licensed under the terms of the Apache license, as
        reproduced below.
        
        = = = = =
        
        Apache License
                                   Version 2.0, January 2004
                                http://www.apache.org/licenses/
        
           TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION
        
           1. Definitions.
        
              "License" shall mean the terms and conditions for use, reproduction,
              and distribution as defined by Sections 1 through 9 of this document.
        
              "Licensor" shall mean the copyright owner or entity authorized by
              the copyright owner that is granting the License.
        
              "Legal Entity" shall mean the union of the acting entity and all
              other entities that control, are controlled by, or are under common
              control with that entity. For the purposes of this definition,
              "control" means (i) the power, direct or indirect, to cause the
              direction or management of such entity, whether by contract or
              otherwise, or (ii) ownership of fifty percent (50%) or more of the
              outstanding shares, or (iii) beneficial ownership of such entity.
        
              "You" (or "Your") shall mean an individual or Legal Entity
              exercising permissions granted by this License.
        
              "Source" form shall mean the preferred form for making modifications,
              including but not limited to software source code, documentation
              source, and configuration files.
        
              "Object" form shall mean any form resulting from mechanical
              transformation or translation of a Source form, including but
              not limited to compiled object code, generated documentation,
              and conversions to other media types.
        
              "Work" shall mean the work of authorship, whether in Source or
              Object form, made available under the License, as indicated by a
              copyright notice that is included in or attached to the work
              (an example is provided in the Appendix below).
        
              "Derivative Works" shall mean any work, whether in Source or Object
              form, that is based on (or derived from) the Work and for which the
              editorial revisions, annotations, elaborations, or other modifications
              represent, as a whole, an original work of authorship. For the purposes
              of this License, Derivative Works shall not include works that remain
              separable from, or merely link (or bind by name) to the interfaces of,
              the Work and Derivative Works thereof.
        
              "Contribution" shall mean any work of authorship, including
              the original version of the Work and any modifications or additions
              to that Work or Derivative Works thereof, that is intentionally
              submitted to Licensor for inclusion in the Work by the copyright owner
              or by an individual or Legal Entity authorized to submit on behalf of
              the copyright owner. For the purposes of this definition, "submitted"
              means any form of electronic, verbal, or written communication sent
              to the Licensor or its representatives, including but not limited to
              communication on electronic mailing lists, source code control systems,
              and issue tracking systems that are managed by, or on behalf of, the
              Licensor for the purpose of discussing and improving the Work, but
              excluding communication that is conspicuously marked or otherwise
              designated in writing by the copyright owner as "Not a Contribution."
        
              "Contributor" shall mean Licensor and any individual or Legal Entity
              on behalf of whom a Contribution has been received by Licensor and
              subsequently incorporated within the Work.
        
           2. Grant of Copyright License. Subject to the terms and conditions of
              this License, each Contributor hereby grants to You a perpetual,
              worldwide, non-exclusive, no-charge, royalty-free, irrevocable
              copyright license to reproduce, prepare Derivative Works of,
              publicly display, publicly perform, sublicense, and distribute the
              Work and such Derivative Works in Source or Object form.
        
           3. Grant of Patent License. Subject to the terms and conditions of
              this License, each Contributor hereby grants to You a perpetual,
              worldwide, non-exclusive, no-charge, royalty-free, irrevocable
              (except as stated in this section) patent license to make, have made,
              use, offer to sell, sell, import, and otherwise transfer the Work,
              where such license applies only to those patent claims licensable
              by such Contributor that are necessarily infringed by their
              Contribution(s) alone or by combination of their Contribution(s)
              with the Work to which such Contribution(s) was submitted. If You
              institute patent litigation against any entity (including a
              cross-claim or counterclaim in a lawsuit) alleging that the Work
              or a Contribution incorporated within the Work constitutes direct
              or contributory patent infringement, then any patent licenses
              granted to You under this License for that Work shall terminate
              as of the date such litigation is filed.
        
           4. Redistribution. You may reproduce and distribute copies of the
              Work or Derivative Works thereof in any medium, with or without
              modifications, and in Source or Object form, provided that You
              meet the following conditions:
        
              (a) You must give any other recipients of the Work or
                  Derivative Works a copy of this License; and
        
              (b) You must cause any modified files to carry prominent notices
                  stating that You changed the files; and
        
              (c) You must retain, in the Source form of any Derivative Works
                  that You distribute, all copyright, patent, trademark, and
                  attribution notices from the Source form of the Work,
                  excluding those notices that do not pertain to any part of
                  the Derivative Works; and
        
              (d) If the Work includes a "NOTICE" text file as part of its
                  distribution, then any Derivative Works that You distribute must
                  include a readable copy of the attribution notices contained
                  within such NOTICE file, excluding those notices that do not
                  pertain to any part of the Derivative Works, in at least one
                  of the following places: within a NOTICE text file distributed
                  as part of the Derivative Works; within the Source form or
                  documentation, if provided along with the Derivative Works; or,
                  within a display generated by the Derivative Works, if and
                  wherever such third-party notices normally appear. The contents
                  of the NOTICE file are for informational purposes only and
                  do not modify the License. You may add Your own attribution
                  notices within Derivative Works that You distribute, alongside
                  or as an addendum to the NOTICE text from the Work, provided
                  that such additional attribution notices cannot be construed
                  as modifying the License.
        
              You may add Your own copyright statement to Your modifications and
              may provide additional or different license terms and conditions
              for use, reproduction, or distribution of Your modifications, or
              for any such Derivative Works as a whole, provided Your use,
              reproduction, and distribution of the Work otherwise complies with
              the conditions stated in this License.
        
           5. Submission of Contributions. Unless You explicitly state otherwise,
              any Contribution intentionally submitted for inclusion in the Work
              by You to the Licensor shall be under the terms and conditions of
              this License, without any additional terms or conditions.
              Notwithstanding the above, nothing herein shall supersede or modify
              the terms of any separate license agreement you may have executed
              with Licensor regarding such Contributions.
        
           6. Trademarks. This License does not grant permission to use the trade
              names, trademarks, service marks, or product names of the Licensor,
              except as required for reasonable and customary use in describing the
              origin of the Work and reproducing the content of the NOTICE file.
        
           7. Disclaimer of Warranty. Unless required by applicable law or
              agreed to in writing, Licensor provides the Work (and each
              Contributor provides its Contributions) on an "AS IS" BASIS,
              WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
              implied, including, without limitation, any warranties or conditions
              of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A
              PARTICULAR PURPOSE. You are solely responsible for determining the
              appropriateness of using or redistributing the Work and assume any
              risks associated with Your exercise of permissions under this License.
        
           8. Limitation of Liability. In no event and under no legal theory,
              whether in tort (including negligence), contract, or otherwise,
              unless required by applicable law (such as deliberate and grossly
              negligent acts) or agreed to in writing, shall any Contributor be
              liable to You for damages, including any direct, indirect, special,
              incidental, or consequential damages of any character arising as a
              result of this License or out of the use or inability to use the
              Work (including but not limited to damages for loss of goodwill,
              work stoppage, computer failure or malfunction, or any and all
              other commercial damages or losses), even if such Contributor
              has been advised of the possibility of such damages.
        
           9. Accepting Warranty or Additional Liability. While redistributing
              the Work or Derivative Works thereof, You may choose to offer,
              and charge a fee for, acceptance of support, warranty, indemnity,
              or other liability obligations and/or rights consistent with this
              License. However, in accepting such obligations, You may act only
              on Your own behalf and on Your sole responsibility, not on behalf
              of any other Contributor, and only if You agree to indemnify,
              defend, and hold each Contributor harmless for any liability
              incurred by, or claims asserted against, such Contributor by reason
              of your accepting any such warranty or additional liability.
        
           END OF TERMS AND CONDITIONS
        
           APPENDIX: How to apply the Apache License to your work.
        
              To apply the Apache License to your work, attach the following
              boilerplate notice, with the fields enclosed by brackets "{}"
              replaced with your own identifying information. (Don't include
              the brackets!)  The text should be enclosed in the appropriate
              comment syntax for the file format. We also recommend that a
              file or class name and description of purpose be included on the
              same "printed page" as the copyright notice for easier
              identification within third-party archives.
        
           Copyright {yyyy} {name of copyright owner}
        
           Licensed under the Apache License, Version 2.0 (the "License");
           you may not use this file except in compliance with the License.
           You may obtain a copy of the License at
        
               http://www.apache.org/licenses/LICENSE-2.0
        
           Unless required by applicable law or agreed to in writing, software
           distributed under the License is distributed on an "AS IS" BASIS,
           WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
           See the License for the specific language governing permissions and
           limitations under the License.
        
        = = = = =
        
        Parts of typeshed are licensed under different licenses (like the MIT
        license), reproduced below.
        
        = = = = =
        
        The MIT License
        
        Copyright (c) 2015 Jukka Lehtosalo and contributors
        
        Permission is hereby granted, free of charge, to any person obtaining a
        copy of this software and associated documentation files (the "Software"),
        to deal in the Software without restriction, including without limitation
        the rights to use, copy, modify, merge, publish, distribute, sublicense,
        and/or sell copies of the Software, and to permit persons to whom the
        Software is furnished to do so, subject to the following conditions:
        
        The above copyright notice and this permission notice shall be included in
        all copies or substantial portions of the Software.
        
        THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
        IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
        FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
        AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
        LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
        FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER
        DEALINGS IN THE SOFTWARE.
        
        = = = = =
License-File: LICENSE
Keywords: deterministic,reproducible,tar,tarfile
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python
Classifier: Programming Language :: Python :: 3.8
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Programming Language :: Python :: 3.14
Classifier: Topic :: System :: Archiving
Classifier: Topic :: System :: Archiving :: Compression
Classifier: Topic :: System :: Archiving :: Packaging
Requires-Python: >=3.8
Provides-Extra: cli
Requires-Dist: rptar; extra == 'cli'
Description-Content-Type: text/markdown

# repro-tarfile

[![PyPI](https://img.shields.io/pypi/v/repro-tarfile.svg)](https://pypi.org/project/repro-tarfile/)
[![Conda Version](https://img.shields.io/conda/vn/conda-forge/repro-tarfile.svg)](https://anaconda.org/conda-forge/repro-tarfile)
[![conda-forge feedstock](https://img.shields.io/badge/conda--forge-feedstock-yellowgreen)](https://github.com/conda-forge/repro-tarfile-feedstock)
[![Supported Python versions](https://img.shields.io/pypi/pyversions/repro-tarfile)](https://pypi.org/project/repro-tarfile/)
[![tests](https://github.com/drivendataorg/repro-tarfile/actions/workflows/tests.yml/badge.svg?branch=main)](https://github.com/drivendataorg/repro-tarfile/actions/workflows/tests.yml?query=branch%3Amain)
[![codecov](https://codecov.io/gh/drivendataorg/repro-tarfile/branch/main/graph/badge.svg)](https://codecov.io/gh/drivendataorg/repro-tarfile)

**A tiny, zero-dependency replacement for Python's `tarfile` standard library for creating reproducible/deterministic tar archives.**

"Reproducible" or "deterministic" in this context means that the binary content of the tar archive is identical if you add files with identical binary content in the same order. It means you can reliably check equality of the contents of two tar archives by simply comparing checksums of the archive using a hash function like MD5 or SHA-256.

This Python package provides a `ReproducibleTarFile` class that works exactly like [`tarfile.TarFile`](https://docs.python.org/3/library/tarfile.html#tarfile-objects) from the Python standard library, except that certain archive metadata are set to fixed values. See ["How does repro-tarfile work?"](#how-does-repro-tarfile-work) below for details.

You can also optionally install a command-line program, **rptar**. See ["rptar command line program"](#rptar-command-line-program) below for more information.

_Looking instead to create reproducible/deterministic ZIP archives? Check out our sister package, [repro-zipfile](https://github.com/drivendataorg/repro-zipfile)!_

## Installation

repro-tarfile is available from PyPI. To install, run:

```bash
pip install repro-tarfile
```

It is also available from conda-forge. To install, run:

```bash
conda install repro-tarfile -c conda-forge
```

## Usage

Simply `import repro_tarfile` and use it in the same way you would use regular [`tarfile`](https://docs.python.org/3/library/tarfile.html) from the Python standard library.

```python
import repro_tarfile

with repro_tarfile.open("archive.tar.gz", "w:gz") as tar:
    tar.add("examples/data.txt", arcname="data.txt")
```

Note that files must be written to the archive in the same order to reproduce an identical archive. Be aware that functions that like `os.listdir`, `os.glob`, `Path.iterdir`, and `Path.glob` return files in a nondeterministic order—you should call `sorted` on their returned values first.

See [`examples/usage.py`](./examples/usage.py) for an example script that you can run, and [`examples/demo_vs_tarfile.py`](./examples/demo_vs_tarfile.py) for a demonstration in contrast with the standard library's tarfile module.

For more advanced usage, such as customizing the fixed metadata values, see the subsections under ["How does repro-tarfile work?"](#how-does-repro-tarfile-work).

## rptar command-line program

[![PyPI](https://img.shields.io/pypi/v/rptar.svg)](https://pypi.org/project/rptar/)

You can optionally install a lightweight command-line program, **rptar**. This includes an additional dependency on the [typer](https://typer.tiangolo.com/) CLI framework. You can install it either directly or using the `cli` extra with repro-tarfile. We recommend you use [pipx](https://github.com/pypa/pipx) for installing Python CLIs into isolated virtual environments. You can also install it with regular pip, too.

```bash
pipx install rptar
# or
pipx install repro-tarfile[cli]
```

rptar is designed to a partial drop-in replacement ubiquitous [tar](https://linux.die.net/man/1/tar) program. Use `rptar --help` to see the documentation. Here are some usage examples:

```bash
# Archive one file
rptar -czvf archive.tar.gz some_file.txt
# Archive two files
rptar -czvf archive.tar.gz file1.txt file2.txt
# Archive many files with glob
rptar -czvf archive.tar.gz some_dir/*.txt
# Archive directory recursively
rptar -czvf archive.tar.gz some_dir/
```

In addition to the fixed metadata values that repro-tarfile sets, rptar will also always sort all paths being archived.

## How does repro-tarfile work?

Tar archives are not normally reproducible even when containing files with identical content because of metadata. In particular, the usual culprits are:

1. Last-modified timestamps of added files
2. File-system permissions (mode) of added files
3. File owner user and group of added files
4. If using gzip compression, the uncompressed filename in the gzip header
5. If using gzip compression, the last modified timestamp the gzip header

`repro_tarfile.ReproducibleTarFile` is a subclass of `tarfile.TarFile` that overrides the `addfile` method (which is also used interally by `add`) with a version that set the above file metadata to fixed values. It also overrides the `gzopen` method used for gzip compression to override the gzip header values. Note that repro-tarfile does not modify the original files—it simply overrides the metadata written to the archive.

You can effectively reproduce what repro-tarfile does in a `.tar.gz` case with something like this:

```python
from gzip import GzipFile
from pathlib import Path
import tarfile

with Path("archive.tar.gz").open("wb") as fp:
    with GzipFile(filename="", fileobj=fp, mode="wb", mtime=315532800) as gz:
        with tarfile.open(fileobj=gz, mode="w") as tar:
            # Use write to add a file to the archive
            tarinfo = tar.gettarinfo("examples/data.txt", arcname="data.txt")
            tarinfo.mtime = 315532800
            tarinfo.mode=0o644
            tarinfo.uid=0
            tarinfo.gid=0
            tarinfo.uname=""
            tarinfo.gname=""
            with Path("examples/data.txt").open("rb") as fp2:
                tar.addfile(tarinfo, fp2)
```

It's kind of a pain! We believe repro-tarfile is sufficiently more convenient to justify a small package.

See the next two sections for more details about the replacement metadata values and how to customize them.

### Fixed metadata values

Here's a quick reference table of the fixed metadata values. You can use the associated environment variable to override a value.

| Metadata field               | Default                               | Environment variable      |
|------------------------------|---------------------------------------|---------------------------|
| Last modified timestamp      | `315532800` (1980-01-01 00:00:00 UTC) | `SOURCE_DATE_EPOCH`       |
| File mode                    | `644` (rw-r--r--)                     | `REPRO_TARFILE_FILE_MODE` |
| Directory mode               | `755` (rwxr-xr-x)                     | `REPRO_TARFILE_DIR_MODE`  |
| Owner user ID                | `0`                                   | `REPRO_TARFILE_UID`       |
| Owner group ID               | `0`                                   | `REPRO_TARFILE_GID`       |
| Owner user name              | empty string                          | `REPRO_TARFILE_UNAME`     |
| Owner group name             | empty string                          | `REPRO_TARFILE_GNAME`     |
| Gzip archive filename        | empty string                          |                           |
| Gzip last modified timestamp | `315532800` (1980-01-01 00:00:00 UTC) | `SOURCE_DATE_EPOCH`       |

For deeper explanations, see below.

#### Last-modified timestamps

Tar archives store the last-modified timestamps of added files and directories. The default fixed value used by repro-tarfile is 315532800, which corresponds to 1980-01-01 00:00:00 UTC.

You can customize this value with the `SOURCE_DATE_EPOCH` environment variable. If set, it will be used as the fixed value instead. This should be an integer corresponding to the [Unix epoch time](https://en.wikipedia.org/wiki/Unix_time) of the timestamp you want to set, e.g., `1704067230` for 2024-01-01 00:00:00 UTC. `SOURCE_DATE_EPOCH` is a [standard](https://reproducible-builds.org/docs/source-date-epoch/) created by the [Reproducible Builds project](https://reproducible-builds.org/) for software distributions.

### File-system permissions

Tar archives store the file-system permissions of files and directories. The default permissions set for new files or directories often can be different across different systems or users without any intentional choices being made. (These default permissions are controlled by something called [`umask`](https://en.wikipedia.org/wiki/Umask).) repro-tarfile will set these to fixed values. By default, the fixed values are `0o644` (`rw-r--r--`) for files and `0o755` (`rwxr-xr-x`) for directories, which matches the common default `umask` of `0o022` for root users on Unix systems. (The [`0o` prefix](https://docs.python.org/3/reference/lexical_analysis.html#integers) is how you can write an octal—i.e., base 8—integer literal in Python.)

You can customize these values using the environment variables `REPRO_ZIPFILE_FILE_MODE` and `REPRO_ZIPFILE_DIR_MODE`. They should be in three-digit octal [Unix numeric notation](https://en.wikipedia.org/wiki/File-system_permissions#Numeric_notation), e.g., `644` for `rw-r--r--`.

### File owner user and group

In typical file systems, every file and directory has an owner. Tar archives record the user and group information of the owner. If different users or systems are generating identical files and then archiving them, the owner information will likely be different. By default, repro-tarfile uses user and group IDs values of `0`, and empty strings for the user and group names. These are the standard values recommended by the [Reproducible Builds project](https://reproducible-builds.org/docs/archives/#users-groups-and-numeric-ids).

You can customize the user and group IDs using the environment variables `REPRO_TARFILE_UID` and `REPRO_TARFILE_GID`. The values should be integers. You can customize the user and group names using the environment variables `REPRO_TARFILE_UNAME` and `REPRO_TARFILE_GNAME`.

### Gzip header values

The gzip compression file format includes a header that contains metadata about the compressed file—in this case, the tar archive. This header includes the archive filename and the last modified timestamp of the archive. By default, repro-tarfile sets the archive filename to an empty string, and the last modified timestamp to the same default value as the added files last modified timestamp, 315532800, which corresponds to 1980-01-01 00:00:00 UTC.

The environment variable `SOURCE_DATE_EPOCH` used to customize the added file last modified timestamp will also be used to set the gzip header last modified timestamp. Currently, we don't support a way to customize the archive filename override.

## Why care about reproducible tar archives?

Tar archives are often useful when dealing with a set of multiple files, especially if the files are large and can be compressed. Creating reproducible tar archives is often useful for:

- **Building a software package.** This is a development best practice to make it easier to verify distributed software packages. See the [Reproducible Builds project](https://reproducible-builds.org/) for more explanation.
- **Working with data.** Verify that your data pipeline produced the same outputs, and avoid further reprocessing of identical data.
- **Packaging machine learning model artifacts.** Manage model artifact packages more effectively by knowing when they contain identical models.

## Related Tools and Alternatives

- https://diffoscope.org/
    - Can do a rich comparison of archive files and show what specifically differs
- https://salsa.debian.org/reproducible-builds/strip-nondeterminism
    - Perl library for removing nondeterministic metadata from file archives
