Metadata-Version: 2.1
Name: persist-to-disk
Version: 0.0.7
Summary: Persist expensive operations on disk.
Home-page: https://github.com/zlin7/python-persist_to_disk
Author: Zhen Lin
Author-email: zhenlin4@illinois.edu
License: MIT
Description: 
        # Installation
        
        `pip install .` or `pip install persist-to-disk`
        
        **By default, a folder called `.cache/persist_to_disk` is created under your home directory, and will be used to store cache files.**
        If you want to change it, see "Global Settings" below.
        
        # Global Settings
        
        To set global settings (for example, where the cache should go by default), please do the following:
        
        ```
        import persist_to_disk as ptd
        ptd.config.generate_config()
        ```
        Then, you could (optionally) change the settings in the generated `config.ini`:
        
        1. `persist_path`: where to store the cache.
            All projects you have on this machine will have a folder under `persist_path` by default, unless you specify it within the project (See examples below).
        2. `hashsize`: How many hash buckets to use to store each function's outputs. Default=500.
        3. `lock_granularity`:
            How granular the lock is.
            This could be `call`, `func` or `global`.
        
            * `call` means each hash bucket will have one lock, so only only processes trying to write/read to/from the same hash bucket will share the same lock.
            * `func` means each function will have one lock, so if you have many processes calling the same function they will all be using the same lock.
            * `global` all processes share the same lock (I tested that it's OK to have nested mechanism on Unix).
        
        
        # Quick Start
        
        ### Basic Example
        Using `persist_to_disk` is very easy.
        For example, if you want to write a general training function:
        ```
        import torch
        
        @ptd.persistf()
        def train_a_model(dataset, model_cls, lr, epochs, device='cpu'):
            ...
            return trained_model_or_key
        
        if __name__ == '__main__':
            train_a_model('MNIST', torch.nn.Linear, 1e-3, 30)
        ```
        
        Suppose the above is in a file with path `~/project_name/pipeline/train.py`.
        If we are in `~/project_name` and run `python -m pipeline.train`, a cache folder will be created under `PERSIST_PATH`, like the following:
        ```
        PERSIST_PATH(=ptd.config.get_persist_path())
        ├── project_name-[autoid]
        │   ├── pipeline
        │   │   ├── train
        │   │   │   ├── train_a_model
        │   │   │   │   ├──[hashed_bucket].pkl
        ```
        Note that in the above, `[autoid]` is a auto-generated id.
        `[hashed_bucket]` will be an int in [0, `hashsize`).
        
        ### Multiprocessing
        Note that `ptd.persistf` can be used with [multiprocessing](https://docs.python.org/3/library/multiprocessing.html) directly.
        
        
        # Advanced Settings
        
        ## `config.set_project_path` and `config.set_persist_path`
        
        There are two important paths for each workspace/project: `project_path` and `persist_path`.
        You could set them by calling `ptd.config.set_project_path` and `ptd.config.set_persist_path`.
        
        On a high level, `persist_path` determines *where* the results are cached/persisted, and `project_path` determines the structure of the cache file tree.
        Following the basic example, `ptd.config.persist_path(PERSIST_PATH)` will only change the root directory.
        On the other hand, supppose we add a line of `ptd.config.set_project_path("./pipeline")` to `train.py` and run it again, the new file structure will be created under `PERSIST_PATH`, like the following:
        ```
        PERSIST_PATH(=ptd.config.get_persist_path())
        ├── pipeline-[autoid]
        │   ├── train
        │   │   ├── train_a_model
        │   │   │   ├──[hashed_bucket].pkl
        ```
        
        Alternatively, it is also possible that we store some notebooks under `~/project_name/notebook/`.
        In this case, we could set the `project_path` back to `~/project_name`.
        You could check the mapping from projects to autoids in `~/.persist_to_disk/project_to_pids.txt`.
        
        
        
        ## Additional Parameters
        `persist` take additional arguments.
        For example, consider the new function below:
        ```
        @ptd.persistf(groupby=['dataset', 'epochs'], expand_dict_kwargs=['model_kwargs'], skip_kwargs=['device'])
        def train_a_model(dataset, model_cls, model_kwargs, lr, epochs, device='cpu'):
            model = model_cls(**model_kwargs)
            model.to(device)
            ... # train the model
            model.save(path)
            return path
        ```
        The kwargs we passed to `persistf` has the following effects:
        
        * `groupby`: We will create more intermediate directories basing on what's in `groupby`.
        In the example above, the new cache structure will look like
        ```
        PERSIST_PATH(=ptd.config.get_persist_path())
        ├── project_name-[autoid]
        │   ├── pipeline
        │   │   ├── train
        │   │   │   ├── train_a_model
        │   │   │   │   ├── MNIST
        │   │   │   │   │   ├── 20
        │   │   │   │   │   │   ├──[hashed_bucket].pkl
        │   │   │   │   │   ├── 10
        │   │   │   │   │   │   ├──[hashed_bucket].pkl
        │   │   │   │   ├── CIFAR10
        │   │   │   │   │   ├── 30
        │   │   │   │   │   │   ├──[hashed_bucket].pkl
        ```
        
        * `expand_dict_kwargs`: This simply allows the dictionary to be passed in.
        This is because we cannot hash a dictionary directly, so there are additionally preprocessing steps for these arguments within `ptd`.
        Note that you can also set `expand_dict_kwargs='all'` to avoid specifying individual dictionary arguements.
        However, please only do so IF YOU KNOW what you are passing in - a very big nested dictionary can make the cache-retrievement very slow and use a lot of disk space unnecessarily.
        
        * `skip_kwargs`: This specifies arguments that will be *ignored*.
        For examplte, if we call `train_a_model(..., device='cpu')` and `train_a_model(..., device='cuda:0')`, the second run will simply read the cache, as `device` is ignored.
        
        ### Other useful parameters:
        * `hash_size`: Defaults to 500.
        If a function has a lot of cache files, you can also increase this if necessary to reduce the number of `.pkl` files on disk.
        
        ## 0.0.7
        ==================
        1. Shared cache vs local cache (the latter specified by `persist_path_local` in the config). This assumes local reads faster. Can be skipped
        2. Add support for `argparse.Namespace` to support a common practice.
        3. Add support for argument `alt_dirs` for `persistf`.
            For example, if the function is called `func1` and its default cache path is `/path/repo-2/module/func1`, and we have cache from a similar code base at a different location, whose cache looks like `/path/repo-1/module/func1`.
            Then, we could do:
            ```
            @ptd.persistf(alt_dirs=["/path/repo-1/module/func1"])
            def func1(a=1):
                print(1)
            ```
            A call to `func1` will read cache from `repo-1` and write it to `repo-2`.
        4. Add support for argument `alt_root` for `manual_cache`. It could be a function that modifies the default path.
        
        ## 0.0.6
        ==================
        1. Added the json serialization mode. This could be specified by `hash_method` when calling `persistf`.
        2. If a function is specified to be `cache=ptd.READONLY`, no file lock will be used (to avoid unncessary conflict).
        
        ## 0.0.5
        ==================
        1. `lock_granularity` can be set differently for each function.
        2. Changed the default cache folder to `.cache/persist_to_disk`.
        
        ## 0.0.4
        ==================
        1. Changed the behavior of `switch_kwarg`. Now, this is not considered an input to the wrapped function. For example, the correct usage is
            ```
            @ptd.persistf(switch_kwarg='switch')
            def func1(a=1):
                print(1)
            func1(a=1, switch=ptd.NOCACHE)
            ```
            Note how `switch` is not an argument of `func1`.
        2. Fix the path inference step, which now finds the absolute paths for `project_path` or `file_path` (the path to the file contaning the function) before inferencing the structure.
        
        ## 0.0.3
        ==================
        
        1. Added `set_project_path` to config.
Keywords: Cache,Persist
Platform: UNKNOWN
Description-Content-Type: text/markdown
