Metadata-Version: 2.1
Name: ml4ir
Version: 0.0.1
Summary: Machine Learning libraries for Information Retrieval
Home-page: https://www.salesforce.com/
Author: Search Relevance, Salesforce
Author-email: searchrelevancyscrumteam@salesforce.com
License: ASL 2.0
Platform: UNKNOWN
Classifier: License :: OSI Approved :: Apache Software License
Classifier: Programming Language :: Python :: 3 :: Only
Requires-Python: >=3.6
Description-Content-Type: text/markdown
Requires-Dist: click
Requires-Dist: Sphinx
Requires-Dist: coverage
Requires-Dist: awscli
Requires-Dist: flake8
Requires-Dist: python-dotenv (>=0.5.1)
Requires-Dist: flake8-black
Requires-Dist: flake8-mypy
Requires-Dist: pre-commit
Requires-Dist: mypy
Requires-Dist: appnope (==0.1.0)
Requires-Dist: attrs (==19.3.0)
Requires-Dist: backcall (==0.1.0)
Requires-Dist: colorlog (==4.0.2)
Requires-Dist: dask (==2.8.1)
Requires-Dist: decorator (==4.4.1)
Requires-Dist: dill (==0.3.0)
Requires-Dist: distributed (==2.8.1)
Requires-Dist: entrypoints (==0.3)
Requires-Dist: future (==0.18.2)
Requires-Dist: hdfs (==2.5.8)
Requires-Dist: ipykernel (==5.1.3)
Requires-Dist: ipywidgets (==7.5.1)
Requires-Dist: ipython (==7.11.1)
Requires-Dist: ipython-genutils (==0.2.0)
Requires-Dist: jedi (==0.15.2)
Requires-Dist: joblib (==0.14.0)
Requires-Dist: json5 (==0.8.5)
Requires-Dist: jsonschema (==3.2.0)
Requires-Dist: jupyter-client (==5.3.4)
Requires-Dist: jupyter-core (==4.6.1)
Requires-Dist: jupyterlab (==1.2.3)
Requires-Dist: jupyterlab-server (==1.0.6)
Requires-Dist: Keras-Applications (==1.0.8)
Requires-Dist: Keras-Preprocessing (==1.1.0)
Requires-Dist: lime (==0.1.1.36)
Requires-Dist: Markdown (==3.1.1)
Requires-Dist: MarkupSafe (==1.1.1)
Requires-Dist: matplotlib (==3.1.2)
Requires-Dist: nbconvert (==5.6.1)
Requires-Dist: nbformat (==4.4.0)
Requires-Dist: notebook (==6.0.2)
Requires-Dist: numpy (==1.17.4)
Requires-Dist: oauth2client (==3.0.0)
Requires-Dist: oauthlib (==3.1.0)
Requires-Dist: pandas (==0.25.3)
Requires-Dist: plotly (==4.4.1)
Requires-Dist: protobuf (==3.10.0)
Requires-Dist: pycodestyle (==2.5.0)
Requires-Dist: pyflakes (==2.1.1)
Requires-Dist: pytest (==4.6.3)
Requires-Dist: pytest-cov (==2.5.1)
Requires-Dist: pytest-html (==2.1.1)
Requires-Dist: python-dateutil (==2.8.1)
Requires-Dist: parso (==0.5.2)
Requires-Dist: pexpect (==4.7.0)
Requires-Dist: pickleshare (==0.7.5)
Requires-Dist: prompt-toolkit (==3.0.2)
Requires-Dist: ptyprocess (==0.6.0)
Requires-Dist: Pygments (==2.5.2)
Requires-Dist: PyYAML (==5.1)
Requires-Dist: requests (==2.22.0)
Requires-Dist: requests-oauthlib (==1.3.0)
Requires-Dist: rsa (==4.0)
Requires-Dist: scikit-image (==0.16.2)
Requires-Dist: scikit-learn (==0.21.3)
Requires-Dist: scipy (==1.3.2)
Requires-Dist: seaborn (==0.9.0)
Requires-Dist: six (==1.14.0)
Requires-Dist: sklearn (==0.0)
Requires-Dist: swifter (==0.296)
Requires-Dist: tblib (==1.5.0)
Requires-Dist: tensorboard (==2.0.1)
Requires-Dist: tensorflow (==2.0.1)
Requires-Dist: tensorflow-estimator (==2.0.1)
Requires-Dist: tensorflow-hub (==0.7.0)
Requires-Dist: tensorflow-metadata (==0.15.1)
Requires-Dist: tensorflow-probability (==0.8.0)
Requires-Dist: tensorflow-ranking (==0.2.0)
Requires-Dist: tensorflow-serving-api (==2.0.0)
Requires-Dist: tensorflow-text (==2.0.1)
Requires-Dist: tensorflow-transform (==0.15.0)
Requires-Dist: traitlets (==4.3.3)
Requires-Dist: urllib3 (==1.25.7)
Requires-Dist: wandb (==0.8.36)
Requires-Dist: wcwidth (==0.1.8)

# ml4ir: Machine Learning Library for Information Retrieval

## Setup
#### Requirements
* python3.6+
* pip3
* docker (version 18.09+ tested)


#### Using PIP
ml4ir can be installed as a pip package by using the following command

```
pip install  'git+https://git@github.com/salesforce/ml4ir#egg=ml4ir&subdirectory=python'
```

This will install ml4ir-0.0.1 (the current version). In future, when this package is available on PyPI, it will be as simple as pip install ml4ir


#### Docker (Recommended)
We have set up a `docker-compose.yml` file for building and using docker containers to train models.

To run unit tests
```
docker-compose up
```

To invoke ml4ir with custom arguments with docker, run
```
/bin/bash tools/run_docker.sh ml4ir \
	python3 ml4ir/base/pipeline.py
    <args>
```

For ranking applications, specifically, use
```
/bin/bash tools/run_docker.sh ml4ir \
	python3 ml4ir/applications/ranking/pipeline.py
    <args>
```

Refer to usage section below for details on how to run ml4ir - ranking

Check `ml4ir/applications/ranking/scripts/example_run.sh` for a predefined example run.

To run example invocation of ranking application with docker,
```
/bin/bash python/ml4ir/applications/ranking/scripts/example_run.sh
```

#### Virtual Environment
Install virtualenv
```
pip3 install virtualenv
```

Create new python3 virtual environment inside your git repo (it's .gitignored, don't worry)
```
cd $PLACE_YOU_CAlLED_GIT_CLONE/ml4ir
python3 -m venv python/env/.ml4ir_venv3
```

Activate virtualenv
```
cd python/
source env/.ml4ir_venv3/bin/activate
```

Install all dependencies (carefully)
```
pip3 install --upgrade setuptools
pip install --upgrade pip
pip3 install -r requirements.txt
```

Note, there are some AWS incompatibilities, gotta fix that, but you can ignore them for now
```
ERROR: botocore 1.14.9 has requirement docutils<0.16,>=0.10, but you'll have docutils 0.16 which is incompatible.
ERROR: awscli 1.17.9 has requirement docutils<0.16,>=0.10, but you'll have docutils 0.16 which is incompatible.
ERROR: awscli 1.17.9 has requirement rsa<=3.5.0,>=3.1.2, but you'll have rsa 4.0 which is incompatible.
ERROR: tensorflow-probability 0.8.0 has requirement cloudpickle==1.1.1, but you'll have cloudpickle 1.2.2 which is incompatible.
ERROR: apache-beam 2.18.0 has requirement dill<0.3.2,>=0.3.1.1, but you'll have dill 0.3.0 which is incompatible.
ERROR: apache-beam 2.18.0 has requirement httplib2<=0.12.0,>=0.8, but you'll have httplib2 0.17.0 which is incompatible.
ERROR: apache-beam 2.18.0 has requirement pyarrow<0.16.0,>=0.15.1; python_version >= "3.0" or platform_system != "Windows", but you'll have pyarrow 0.14.1 which is incompatible.
ERROR: tfx-bsl 0.15.3 has requirement absl-py<0.9,>=0.7, but you'll have absl-py 0.9.0 which is incompatible.
ERROR: tfx-bsl 0.15.3 has requirement apache-beam[gcp]<2.17,>=2.16, but you'll have apache-beam 2.18.0 which is incompatible.
ERROR: tensorflow-transform 0.15.0 has requirement absl-py<0.9,>=0.7, but you'll have absl-py 0.9.0 which is incompatible.
```

Note that pre-commit-hooks are required, and installed as a requirement if needed. 
If an error results that they didn't install, execute `pre-commit install` to install git hooks in your .git/ directory.


Set the PYTHONPATH environment variable
```
export PYTHONPATH=$PYTHONPATH:`pwd`/python
```

## Usage
The entrypoint into the training or evaluation functionality of ml4ir is through `ml4ir/base/pipeline.py` and for application specific overrides, look at `ml4ir/applications/<eg: ranking>/pipeline.py

### ml4ir Library
To use ml4ir as a deep learning library to build relevance models, look at the walkthrough under `notebooks/PointwiseRankingDemo.ipynb` or `notebooks/PointwiseRankingDemo.html`(contains architecture diagrams). The notebook walks one through building, training, saving, and the entire life cycle of a `RelevanceModel` from the bottom up. Additionally, the HTML version also sheds light on the design of ml4ir and the data format used.

### Applications - Ranking
#### Examples
Using TFRecord
```
python ml4ir/applications/ranking/pipeline.py \
--data_dir ml4ir/applications/ranking/tests/data/tfrecord \
--feature_config ml4ir/applications/ranking/tests/data/config/feature_config.yaml \
--run_id test \
--data_format tfrecord \
--execution_mode train_inference_evaluate
```

Using CSV
```
python ml4ir/applications/ranking/pipeline.py \
--data_dir ml4ir/applications/ranking/tests/data/csv \
--feature_config ml4ir/applications/ranking/tests/data/config/feature_config.yaml \
--run_id test \
--data_format csv \
--execution_mode train_inference_evaluate
```

Running in inference mode using the default serving signature
```
python ml4ir/applications/ranking/pipeline.py \
--data_dir ml4ir/applications/ranking/tests/data/tfrecord \
--feature_config ml4ir/applications/ranking/tests/data/config/feature_config.yaml \
--run_id test \
--data_format tfrecord \
--model_file `pwd`/models/test/final/default \
--execution_mode inference_only

NOTE: Make sure to add the right data and feature config before training models.
TODO: describe how to do this

```
## Running Tests
To run all the python based tests under `ml4ir`
```
python3 -m pytest
```

To run specific tests, 
```
python3 -m pytest /path/to/test/module
```

## Project Organization
The following structure is a little out of date (TODO(jake) - fix it!)

    ├── LICENSE
    ├── Makefile           <- Makefile with commands like `make data` or `make train`
    ├── README.md          <- The top-level README for developers using this project.
    ├── data
    │   ├── external       <- Data from third party sources.
    │   ├── interim        <- Intermediate data that has been transformed.
    │   ├── processed      <- The final, canonical data sets for modeling.
    │   └── raw            <- The original, immutable data dump.
    │
    ├── docs               <- A default Sphinx project; see sphinx-doc.org for details
    │
    ├── models             <- Trained and serialized models, model predictions, or model summaries
    │
    ├── notebooks          <- Jupyter notebooks. Naming convention is a number (for ordering),
    │                         the creator's initials, and a short `-` delimited description, e.g.
    │                         `1.0-jqp-initial-data-exploration`.
    │
    ├── references         <- Data dictionaries, manuals, and all other explanatory materials.
    │
    ├── reports            <- Generated analysis as HTML, PDF, LaTeX, etc.
    │   └── figures        <- Generated graphics and figures to be used in reporting
    │
    ├── requirements.txt   <- The requirements file for reproducing the analysis environment, e.g.
    │                         generated with `pip freeze > requirements.txt`
    │
    ├── setup.py           <- makes project pip installable (pip install -e .) so src can be imported
    ├── ml4ir                <- Source code for use in this project.
    │   ├── __init__.py    <- Makes ml4ir a Python module
    │   │
    │   ├── data           <- Scripts to download or generate data
    │   │   └── make_dataset.py
    │   │
    │   ├── features       <- Scripts to turn raw data into features for modeling
    │   │   └── build_features.py
    │   │
    │   ├── models         <- Scripts to train models and then use trained models to make
    │   │   │                 predictions
    │   │   ├── predict_model.py
    │   │   └── train_model.py
    │   │
    │   └── visualization  <- Scripts to create exploratory and results oriented visualizations
    │       └── visualize.py
    │
    └── tox.ini            <- tox file with settings for running tox; see tox.testrun.org


--------

<p><small>Project based on the <a target="_blank" href="https://drivendata.github.io/cookiecutter-data-science/">cookiecutter data science project template</a>. #cookiecutterdatascience</small></p>


