Metadata-Version: 2.1
Name: startds
Version: 0.1.2
Summary: A data science experiment framework.
Home-page: https://www.dolphyn.io/
Author: Dolphyn
Author-email: support@dolphyn.io
License: MIT
Project-URL: Documentation, https://github.com/dolphynhq/startds
Project-URL: Source, https://github.com/dolphynhq/startds
Platform: UNKNOWN
Classifier: Development Status :: 2 - Pre-Alpha
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3 :: Only
Classifier: Programming Language :: Python :: 3.6
Classifier: Programming Language :: Python :: 3.7
Classifier: Programming Language :: Python :: 3.8
Classifier: Topic :: Software Development :: Libraries :: Application Frameworks
Classifier: Topic :: Software Development :: Libraries :: Python Modules
Requires-Python: >=3.6
Description-Content-Type: text/markdown

# Start Data Science

_An opinionated, organized way to start and manage data science experiments._

Start Data Science is a template to help you set up experiments. It brings structure to exploratory data analysis (EDA), through to feature extraction, modeling, and resultant outputs whether they're figures, reports, APIs, or apps. 

&nbsp;

The main components are:

- A pre-defined framework creating organization for your experiment
- A pre-compiled `requirements.txt` featuring over 150 commonly used data science libraries
- Extensible scripts with boilerplate for Streamlit, Flask, FastAPI and Cortex.

- (Work in progress) A library of common helpers like writing and reading to S3, methods to clean, transform and extract features from data

- (Work in progress) Adding more open source solutions for apis and apps (eg BentoML)

&nbsp;

The idea of this repo is to provide a comprehensive structure. The user is to delete portions, and manipulate the dag accordingly per their experiment needs. 

&nbsp;

## Getting Started

1. Install the library
```sh
pip install startds
```

2. Create your first experiment
```sh
startds create <exp_name>
```

## Usage  
### Available Commands

- _create_
```sh
startds create <exp_name> [--api flask|fastapi|cortex|all(default)] [--mode eng|ds|all(default)]
```

Creates a new experiment directory structure. where `exp_name` is the name of the new experiment you want to create. This will create a new folder named `exp_name` in the current folder. 

Options that can be provided with this command are 
`--api` and `--mode`. 

`--api` can only take on one of these values : `flask`, `fastapi`, `cortex`, `all` (default value is `all`).
This will accordingly create boilerplate code for that specific api tool in your `_apis` folder.

`--mode` can only take on one of these values : `eng`, `ds`, `all` (default value is `all`).
This affects which folders are created in the `src` folder. 

If `eng` is used as the mode, only folders specific to engineering operations will be created : `_apis`, `_apps`, `_orchestrate`, `_tests`.

if `ds` is used as the mode, only folders specific to data science operations will be created : `clean`, `explore`, 
`transform`, `train`.


If no options are provided, the full experiment directory will be created. 
&nbsp;

- _env_
```sh
startds env  [-f path_to_requirements.txt]
```

This command creates a virtual environment in the directory from where it is run. It is recommended to run this command from the home directory of the new experiment you created with ```startds create```.

Option `-f` can be used to specify your custom `requirements.txt`. In the absence of this option, the default ```requirements.txt``` located at the root directory of your experiment will be used. You can also simply overwrite that default file. 

The default env command initializes a virtual environment for the experiment with over 150 of the most commonly used data science libraries. Note, it installs `airflow` which is required in order to execute the dag.

To start the new virtual environment created, run
```sh
source .venv/bin/activate
```

&nbsp;

### Running the experiment

```python
python run.py
```
Runs `dag.py` which is configured using `airflow` by default. Note: you will require airflow, or you can configure using your preferred orchestrator. The dag can be easily modified to add or remove steps, and/or execute individual components.

&nbsp;

### Running tests

Tests are to be written in the _tests folder inside src folder. ```pytest``` package can be used to run these tests.
Make sure that ```pytest``` is installed and run
```sh
pytest
```
from the root directory or the ```_tests``` directory to run tests


## The resulting directory structure

The directory structure of your new project looks like this: 

```
├── README.md          <- The top-level README for developers using this project.
│
├── Dockerfile         <- Dockerfile to create docker images for K8s or other cloud services
|
├── requirements.txt   <- The requirements file for reproducing the analysis environment, e.g.
│                         generated with `pip freeze > requirements.txt`
|
├── setup.py           <- makes project pip installable (pip install -e .) to enable imports of sibling modules in src
|
├── run.py             <- Run file that calls an orchestrator or individual .py files in your project
|
├── exp_name           <- namesake folder inside the exp_name root folder that you created
|   |
|   ├── metadata           <- Metadata that needs to persisted and shared for data sources and models
|   │   └── data.md
|   │   └── models.md
|   |
|   ├── models             <- Trained and serialized models, model predictions, or model summaries
|   |
|   ├── notebooks          <- Folder to keep notebooks in. Import .py modules from src folder
|   │
|   ├── outputs            <- Generated analysis as HTML, PDF, LaTeX, etc.
|   │   └── figures        <- Generated graphics and figures to be used in reporting
|   |
|   ├── src                <- Source code for use in this project.
|   │   ├── __init__.py    <- Makes src a Python module
|   |   |
|   |   ├── _apis          <- Scripts to create APIs for serving models using Flask/FastAPI/others
|   |   |   └── fastapi
|   |   |   |   └── main.py
|   |   |   |   └── Dockerfile
|   |   |   |   └── build.sh
|   |   |   |
|   |   |   └── flask
|   |   |   |   └── main.py
|   |   |   |   └── Dockerfile
|   |   |   |   └── build.sh
|   |   |   |
|   |   |   └── cortex
|   |   |       └── main.py
|   |   |       └── cortex.yaml
|   |   |       └── requirements.txt
|   |   |
|   │   ├── _apps          <- Scripts to create internal ML apps using streamlit, dash etc
|   |   |   └── streamlit
|   |   |       └── main.py
|   |   |       └── Dockerfile
|   |   |       └── build.sh
|   |   |
|   |   ├── _orchestrate       <- Scripts to run different steps of the project using an orchestrator such as airflow
|   │   |   └── airflow
|   |   |       └── dags
|   |   |           └── dag.py
|   |   |
|   │   ├── _tests         <- Scripts to add tests for your experiment
|   │   │   └── test_clean.py
|   |   |
|   │   ├── clean          <- Scripts to connect and clean data
|   │   │   └── clean_data.py
|   |   |   └── connect_data.py
|   |   |
|   │   ├── explore        <- Scripts to create exploratory and results oriented visualizations
|   │   |   └── visualize.py
|   │   |   └── explore.py
|   │   │
|   │   ├── transform      <- Scripts to turn raw data into features for modeling
|   │   │   └── transform_data.py
|   |   |   └── setup_experiment.py
|   │   │
|   │   ├── train          <- Scripts to train models and then use trained models to make predictions
|   │   │   ├── predict_model.py
|   │   │   └── train_model.py
|   |   |   └── model.py
|   │   │
```

&nbsp;


## Note about importing sibling modules

To enable importing sibling modules when writing code in src, it is best to install the root experiment as
a python package 
```sh
pip install -e .
```
You could also modify the ```sys.path``` in each file that wants to import sibling module
Another solution is to run files form the root folder using ```python3 -m absolute_import_path_to_module```

Some reference for this issue
[Sibling package imports](https://stackoverflow.com/questions/6323860/sibling-package-imports/23542795#23542795)

If you have ideas about how to manage this structure better, please let us know. 
&nbsp;


## Contributing to start-data-science

Feel free to open an issue against this repository or [contact us](mailto:support@dolphyn.io) and we'll help point you in the right direction.


## License

Released under the [MIT license](LICENSE).

## Acknowledgements

A huge thanks to the following projects:

### Structure / Inspiration:

[Django](https://github.com/django/django)  
[Cookiecutter Data Science](https://github.com/drivendata/cookiecutter-data-science/)

### Integrations:

[Streamlit](https://github.com/streamlit/streamlit)  
[Cortex](https://github.com/cortexlabs/cortex)  
[FastAPI](https://github.com/tiangolo/fastapi)  
[Flask](https://github.com/pallets/flask)  
[Airflow](https://github.com/apache/airflow)




