Metadata-Version: 2.1
Name: filedep
Version: 0.0.3
Summary: filedep: A small python tool to check file dependency
Home-page: https://github.com/flcong/filedep
Author: Francis Cong
Author-email: flcong@outlook.com
License: MIT
Platform: UNKNOWN
Description-Content-Type: text/markdown
License-File: LICENSE.txt

# filedep: A small python tool to check file dependency

## Motivation

When doing empirical analysis, you may encounter the following issue about file
dependency. 

* Suppose `code.py` (or `code.do`, `code.sas`, `code.m`, `code.R`, etc)
  reads data from `indata.csv`, does some data cleaning, and then saves the 
  intermediary data as `outdata.csv`. 
  
* After using `outdata.csv` to run some statistical tests, you want to change 
  the data cleaning procedure a bit, so you modify `code.py`. 
  
* If there are only one code file and two data files, you will easily remember to
re-run `code.py` to update the output data `outdata.csv`. 
  
* However, suppose that `outdata.csv` is then used by `code2.py` to write 
  `finaldata.csv`. Then, people may easily forget to re-run `code2.py` as well
  to update `finaldata.csv`.
  
* As a result, this may cause the illusion that results change after you run
  the same set of code twice. For example, you forget to update `finaldata.csv`
  initially, but then accidentally update `finaldata.csv` some time later. Then,
  you find that results change after you run the same set of code.

To resolve this issue, I build this simple package to check file dependencies
based on last modified time. Users can define file dependencies, such as `code.py`
using `pre1.csv` and `pre2.csv` as input to write `post1.csv` and `post2.csv`.
Then, the function in the package will check if the last modified times of both
`pre1.csv` and `pre2.csv` are before that of `code.py` and the last modified 
times of both `post1.csv` and `post2.csv` are after that of `code.py`. If any
file dependency is broken, the broken ones will be printed or saved to a file.

## Installation

Use `pip` to install the package as follows:
```python 
pip install filedep
```

## Usage

Import the package using
```python
import filedep
```

The key function is `check_dep(deps, outfile=sys.stdout, reterr=False)`. 
The first argument is a list of dependencies (defined below). The second 
argument specifies where to print error information if any file dependency is 
broken. The default is `sys.stdout`. The third argument specifies if broken 
dependencies are returned from the function. This is mainly for testing 
purposes. The default is `False`, i.e., broken dependencies are only printed. 

The file dependencies have to be provided by the user using the format defined
below. In the `template` folder, there is a template to define dependencies
and use `check_dep()` function to check.

### Example 1. No broken file dependencies

The following code creates several empty files:
```python
import filedep
import time
import os
from os.path import join as pj

PATH = r'C:\test_check_dep'
if not os.path.exists(PATH):
    os.mkdir(PATH)

def touch(filepath):
    if os.path.exists(filepath):
        os.utime(filepath)
    else:
        with open(filepath, 'a') as f:
            pass

# Touch files in a specific order
touch(pj(PATH, 'pre11.csv'))
time.sleep(.1)
touch(pj(PATH, 'pre12.csv'))
time.sleep(.1)
touch(pj(PATH, 'code1.py'))
time.sleep(.1)
touch(pj(PATH, 'post11.csv'))
time.sleep(.1)
touch(pj(PATH, 'post12.csv'))
time.sleep(.1)

# Define dependencies
deps = [
    (
        [
            pj(PATH, 'pre11.csv'),
            pj(PATH, 'pre12.csv')
        ], 
        pj(PATH, 'code1.py'), 
        [
            pj(PATH, 'post11.csv'),
            pj(PATH, 'post12.csv'),
        ],
    ),
]
filedep.check_dep(deps)
```
In `deps`, we define a single dependency as follows: under the directory 
`C:\test_check_dep`, `code1.py` reads `pre11.csv` and `pre12.csv` to produce
`post11.csv` and `post12.csv`. Then, the last modified times of both `pre11.csv`
and `pre12.csv` must be before that of `code1.py` and those of both `post11.csv`
and `post12.csv` must be after that of `code1.py`

Since the dependency is satisfied by construction, the output is
```
All file dependencies are verified!
```

### Example 2. Broken file dependencies

The following code creates several empty files and define two broken dependencies:
```python
import filedep
import time
import os
from os.path import join as pj

PATH = r'C:\test_check_dep'
if not os.path.exists(PATH):
    os.mkdir(PATH)

def touch(filepath):
    if os.path.exists(filepath):
        os.utime(filepath)
    else:
        with open(filepath, 'a') as f:
            pass

# Touch files in a specific order
touch(pj(PATH, 'pre11.csv'))
time.sleep(.1)
touch(pj(PATH, 'pre12.csv'))
time.sleep(.1)
touch(pj(PATH, 'code1.py'))
time.sleep(.1)
touch(pj(PATH, 'post11.csv'))
time.sleep(.1)
# Note code1.py is newer than post11.csv
touch(pj(PATH, 'code1.py'))
time.sleep(.1)
touch(pj(PATH, 'post12.csv'))
time.sleep(.1)

# Define dependencies
deps = [
    (
        [
            pj(PATH, 'pre11.csv'),
        ], 
        pj(PATH, 'code1.py'), 
        [
            pj(PATH, 'post11.csv'),
        ],
    ),  
    (
        [
            pj(PATH, 'pre11.csv'),
            pj(PATH, 'pre12.csv'),
        ], 
        pj(PATH, 'code1.py'), 
        [
            pj(PATH, 'post11.csv'), 
            pj(PATH, 'post12.csv'),
        ],
    )
]
filedep.check_dep(deps)
```
Here, we define 2 dependencies. The second one is the same as that in the 
previous example, but the first one defines a simpler dependency: `code1.py` 
uses `pre11.csv` to produce `post11.csv`. Since by construction `post11.csv` is
"touched" before `code1.py`, both dependencies are broken. Hence, the output is
```
There are 2 broken file dependencies!!! 
[1]
                                           Last Modified Time
  Input:
    C:\test_check_dep\pre11.csv      : 2021-10-14 14:25:11.011976
  Code:
    C:\test_check_dep\code1.py       : 2021-10-14 14:25:11.451668
  Output:
    C:\test_check_dep\post11.csv     : 2021-10-14 14:25:11.342247
[2]
                                           Last Modified Time
  Input:
    C:\test_check_dep\pre11.csv      : 2021-10-14 14:25:11.011976
    C:\test_check_dep\pre12.csv      : 2021-10-14 14:25:11.125543
  Code:
    C:\test_check_dep\code1.py       : 2021-10-14 14:25:11.451668
  Output:
    C:\test_check_dep\post11.csv     : 2021-10-14 14:25:11.342247
    C:\test_check_dep\post12.csv     : 2021-10-14 14:25:11.559796
```
where the last modified date of each file in each broken dependency is shown.


## Format of file dependency

The first argument of `check_dep()` is a list of dependencies. Its format 
should be as follows:

* It is a list of tuples.
* Each tuple has three elements.
    - The first element is a list of `str`.
    - The second element is a `str`.
    - The third element is a list of `str`.
    - Each `str` is an absolute path of a existing file.
  
As an example, the following code defines two dependencies:
```python
deps = [
    (
        ['pre1.txt'], 'code1.py', ['post1.txt']
    ),
    (
        ['pre21.txt', 'pre22.txt'], 'code2.py', ['post21.txt', 'post22.txt']
    )
]
```
* The first one says that `code1.py` uses `pre1.txt` as input and outputs 
`post1.txt`. As a result, the last modified date of the three files
should satisfy `pre1.txt<=code1.py<=post1.txt`.
* The second one says that `code2.py` uses `pre21.txt` and `pre22.txt` as input
  and outputs `post21.txt` and `post22.txt`. As a result, the last modified date
  of the three files should satisfy 
  `max(pre21.txt,pre22.txt)<=code1.py<=min(post21.txt,post22.txt)` where `max`
  (`min`) represent the maximum (minimum) date.




