Metadata-Version: 2.4
Name: datashadric
Version: 0.3.2
Summary: An exploratory data science toolkit for analysis, machine learning, multimodal ai agents for text and image processing, and visualization (Apache Superset)
Author-email: "Paul Namalomba (GitHub: paulnamalomba)" <kabwenzenamalomba@gmail.com>
Maintainer-email: "Paul Namalomba (GitHub: paulnamalomba)" <kabwenzenamalomba@gmail.com>
License: MIT License
        
        Copyright (c) 2025 datashadric
        
        Permission is hereby granted, free of charge, to any person obtaining a copy
        of this software and associated documentation files (the "Software"), to deal
        in the Software without restriction, including without limitation the rights
        to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
        copies of the Software, and to permit persons to whom the Software is
        furnished to do so, subject to the following conditions:
        
        The above copyright notice and this permission notice shall be included in all
        copies or substantial portions of the Software.
        
        THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
        IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
        FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
        AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
        LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
        OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
        SOFTWARE.
Project-URL: Homepage, https://github.com/paulnamalomba/datashadric
Project-URL: Bug Reports, https://github.com/paulnamalomba/datashadric/issues
Project-URL: Source, https://github.com/paulnamalomba/datashadric/src/datashadric
Project-URL: Documentation, https://github.com/paulnamalomba/datashadric/README.md
Keywords: data science,machine learning,statistics,visualization,pandas,analysis
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Science/Research
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.8
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Topic :: Scientific/Engineering :: Information Analysis
Classifier: Topic :: Software Development :: Libraries :: Python Modules
Requires-Python: >=3.8
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: pandas
Requires-Dist: cmake
Requires-Dist: numpy
Requires-Dist: scikit-learn
Requires-Dist: matplotlib
Requires-Dist: seaborn
Requires-Dist: scipy
Requires-Dist: statsmodels
Requires-Dist: plotly
Requires-Dist: google-generativeai
Provides-Extra: dev
Requires-Dist: pytest>=6.0; extra == "dev"
Requires-Dist: pytest-cov>=2.0; extra == "dev"
Requires-Dist: black>=21.0; extra == "dev"
Requires-Dist: flake8>=3.8; extra == "dev"
Requires-Dist: mypy>=0.900; extra == "dev"
Provides-Extra: viz
Requires-Dist: apache-superset>=5.0.0; extra == "viz"
Requires-Dist: pyarrow>=15.0.0; extra == "viz"
Provides-Extra: docs
Requires-Dist: sphinx>=4.0; extra == "docs"
Requires-Dist: sphinx-rtd-theme>=1.0; extra == "docs"
Dynamic: license-file

# datashadric - Python Toolkit for Machine Learning and Advanced Data Analytics

**Last updated**: November 10, 2025<br>

**Author**: [Paul Namalomba](https://github.com/paulnamalomba)<br>
  - SESKA Computational Engineer<br>
  - Software Developer<br>
  - PhD Candidate (Civil Engineering Spec. Computational and Applied Mechanic <br>

**Version**: 0.3.2 (13 March 2025)<br>
**Contact**: [kabwenzenamalomba@gmail.com](kabwenzenamalomba@gmail.com)<br>

---

[![Python](https://img.shields.io/badge/Python-3.9%2B-blue.svg)](https://www.python.org/)
[![Pandas](https://img.shields.io/badge/Pandas-1.3%2B-green.svg)](https://pandas.pydata.org/)
[![Google Gemini 2.5](https://img.shields.io/badge/Google-Gemini%202.5%20Flash-orange.svg)](https://ai.google.dev/gemini-api/docs)
[![Matplotlib](https://img.shields.io/badge/Matplotlib-3.4%2B-yellow.svg)](https://matplotlib.org/stable/users/index.html)
[![Seaborn](https://img.shields.io/badge/Seaborn-0.11%2B-purple.svg)](https://seaborn.pydata.org/)
[![Scikit-learn](https://img.shields.io/badge/Scikit--learn-1.0%2B-red.svg)](https://scikit-learn.org/stable/)
[![Statsmodels](https://img.shields.io/badge/Statsmodels-0.12%2B-blueviolet.svg)](https://www.statsmodels.org/stable/index.html)
[![License: MIT](https://img.shields.io/badge/License-MIT-gray.svg)](https://opensource.org/licenses/MIT)

`datashadric` provides a collection of well-organized modules for common data science tasks, from data cleaning and exploration to machine learning model building, unsupervised and supervised classification and statistical analysis and testing. The package is designed with readability and ease-of-use in mind, making complex data science workflows more accessible and easier to write for end-use analysts.

## Contents

- [datashadric — Python Toolkit for Machine Learning and Advanced Data Analytics](#datashadric---python-toolkit-for-machine-learning-and-advanced-data-analytics)
  - [Contents](#contents)
  - [Features](#features)
  - [Installation](#installation)
  - [Quick Start](#quick-start)
  - [Module Overview](#module-overview)
  - [Dependencies](#dependencies)
  - [Testing](#testing)
  - [Examples](#examples)
  - [Contributing](#contributing)
  - [License](#license)
  - [Support](#support)
  - [Build, Release & Deploy Instructions (v0.3.2)](#build-release--deploy-instructions-v030)
  - [Changelog](#changelog)

---

## Features

- **Machine Learning**: Model training, data ensembling (sampling), model evaluation, and prediction tools.
- **Regression Analysis**: Linear and Logistic regression modeling with diagnostic checks.
- **Data Manipulation**: Pandas-based utilities for cleaning and transforming data, getting data descriptive characteristics.
- **Statistical Analysis**: Hypothesis testing, confidence intervals, normal, Bayesian and Gaussian distribution checks. Also some sampling stuff included. 
- **Visualization**: Plotting functions for data exploration, visualization and presentation.
- **Multiple Imputation**: MICE (PMM, norm, logistic regression), Random Forest, and KNN imputation for handling missing data.

## Installation

### From PyPI (recommended)
```bash
pip install datashadric
```

### From Source
```bash
git clone https://github.com/paulnamalomba/datashadric.git
cd datashadric
pip install .
```

### Development Installation
```bash
git clone https://github.com/paulnamalomba/datashadric.git
cd datashadric
pip install -e ".[dev]"
```

## Quick Start

```python
import pandas as pd
from datashadric.mlearning import ml_naive_bayes_model
from datashadric.regression import lr_ols_model
from datashadric.dataframing import df_check_na_values
from datashadric.stochastics import df_gaussian_checks
from datashadric.plotters import df_boxplotter
from datashadric.aiagents import ai_analyze_plot_data_with_vision
from datashadric.aiagents import ai_data_insights_summary
from datashadric.imputation import df_mice_impute_pmm, df_impute_knn

# load your data
df = pd.read_csv('your_data.csv')

# check for missing values
na_summary = df_check_na_values(df)

# test for normality
normality_results = df_gaussian_checks(df, 'your_column')

# create visualizations
df_boxplotter(df, 'category_col', 'numeric_col', type_plot=0)

# build machine learning models
model, metrics = ml_naive_bayes_model(df, 'target_column', test_size=0.2)

# perform regression analysis
ols_results = lr_ols_model(df, 'dependent_var', ['independent_var1', 'independent_var2'])
```

## Module Overview

### `mlearning` - Machine Learning
- `ml_naive_bayes_model()`: Train and evaluate Naive Bayes classifiers
- `ml_naive_bayes_metrics()`: Calculate detailed model performance metrics
- `logr_predictor()`: Logistic regression modeling and prediction
- `confusion_matrix_from_predictions()`: Generate confusion matrices

### `regression` - Regression Analysis
- `lr_ols_model()`: Ordinary Least Squares regression modeling
- `lr_check_homoscedasticity()`: Test regression assumptions
- `lr_check_normality()`: Check residual normality
- `lr_post_hoc_test()`: Post-hoc regression diagnostics

### `dataframing` - Data Manipulation
- `df_check_na_values()`: Comprehensive missing value analysis
- `df_drop_dupes()`: Remove duplicate rows with reporting
- `df_one_hot_encoding()`: Convert categorical variables to dummy variables
- `df_check_correlation()`: Correlation analysis and visualization

### `stochastics` - Statistical Analysis
- `df_gaussian_checks()`: Test data normality with Shapiro-Wilk and Q-Q plots
- `df_calc_conf_interval()`: Calculate confidence intervals
- `df_calc_moe()`: Compute margin of error
- `df_calc_zscore()`: Z-score calculations

### `plotters` - Visualization
- `df_boxplotter()`: Box plots for outlier detection
- `df_histplotter()`: Histogram creation with customization
- `df_scatterplotter()`: Scatter plot generation
- `df_pairplot()`: Comprehensive pairwise plotting

### `imputation` - Multiple Imputation Methods *(new in v0.3.2)*
- `df_mice_impute_pmm()`: MICE with Predictive Mean Matching — imputes from observed donor values
- `df_mice_impute_norm()`: MICE with Bayesian Linear Regression (norm) — smooth posterior-predictive draws
- `df_mice_impute_logistic()`: MICE with Logistic Regression for binary/categorical columns
- `df_impute_random_forest()`: Iterative Random Forest imputation (missForest-style)
- `df_impute_knn()`: K-Nearest Neighbours imputation
- `df_impute_summary()`: Before/after comparison of NaN counts and descriptive statistics

## Dependencies

### Core Dependencies
- pandas >= 1.3.0
- numpy >= 1.20.0
- scikit-learn >= 1.0.0
- matplotlib >= 3.4.0
- seaborn >= 0.11.0
- scipy >= 1.7.0
- statsmodels >= 0.12.0
- plotly >- 5.0.0

You can simply do:
```bash
pip install -r requirements/requirements-core.txt
```

### Testing Dependencies
For running tests, you'll need to install additional packages:
```bash
pip install pytest pytest-cov
```

## Testing (Testing the app Modules)

To run the test suite:

```bash
# Install testing dependencies first
pip install pytest pytest-cov

# Run all tests
python -m pytest tests/ -v

# Run tests with coverage report
python -m pytest tests/ --cov=datashadric --cov-report=html --cov-report=term-missing
```

## Examples (Applications of certain Data Science techniques)

### Data Cleaning and Exploration
```python
from datashadric.dataframing import df_check_na_values, df_drop_dupes
from datashadric.plotters import df_histplotter

# check data quality
na_report = df_check_na_values(df)
df_clean = df_drop_dupes(df)

# visualize distributions
df_histplotter(df_clean, 'numeric_column', type_plot=0, bins=30)
```

### Statistical Testing (testing data samples)

```python
from datashadric.stochastics import df_gaussian_checks, df_calc_conf_interval

# test normality
normality_test = df_gaussian_checks(df, 'measurement_column')

# calculate confidence intervals
ci = df_calc_conf_interval(df['measurement_column'], confidence=0.95)
```

### Machine Learning Workflows

```python
from datashadric.mlearning import ml_naive_bayes_model, ml_naive_bayes_metrics

# train model
model, initial_metrics = ml_naive_bayes_model(df, 'target', test_size=0.3)

# detailed evaluation
detailed_metrics = ml_naive_bayes_metrics(model, X_test, y_test)
```

## Contributing to the Project

Contributions are welcome! Please feel free to submit a Pull Request. For major changes, please open an issue first to discuss what you would like to change.

1. Fork the repository
2. Create your feature branch (`git checkout -b feature/AmazingFeature`)
3. Commit your changes (`git commit -m 'Add some AmazingFeature'`)
4. Push to the branch (`git push origin feature/AmazingFeature`)
5. Open a Pull Request

## Licensing & Copyright

This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.

The author retains all rights to the code and documentation in this repository. You are free to use, modify, and distribute the code as long as you comply with the terms of the MIT License.

## Have issues or questions?

If you encounter any problems or have questions, please file an issue on the [datashadric GitHub repository - Issues Page](https://github.com/paulnamalomba/datashadric/issues).

## Build, Release & Deploy Instructions (v0.3.2)

The full build-to-publish workflow is captured in `datashadric-build-test-upload_instructions.ps1` (PowerShell)
and `datashadric-build-test-upload_instructions.bat` (CMD).  The steps below can be run manually in order.

### 1. Clean previous build artefacts
```bash
# Remove old distributions and egg-info
rm -rf dist/ build/ src/*.egg-info
```

### 2. Build the package
```bash
python -m build
```
This produces `.tar.gz` and `.whl` files in the `dist/` directory.

### 3. Validate the build
```bash
twine check dist/*
```
Ensure the output reports no errors or warnings.

### 4. Quick smoke-test
```python
import datashadric
print(datashadric.__version__) # should print 0.3.2 as of 13 March 2026
```

### 5. Run the test suite
```bash
python -m pytest tests/ -v --cov=datashadric --cov-report=term-missing
```

### 6. Publish to TestPyPI (optional, recommended)
```bash
twine upload --repository testpypi dist/*
pip install --index-url https://test.pypi.org/simple/ datashadric==0.3.2
```

### 7. Publish to PyPI
```bash
twine upload --repository pypi dist/*
```

### 8. Install locally in editable mode
```bash
pip install -e .
```

### 9. Tag the release in Git
```bash
git add .
git commit -m "Release v0.3.2 — multiple imputation methods"
git tag -a v0.3.2 -m "v0.3.2"
git push origin main --tags
```

> **Note**: If you use the `Manage-GitHub` PowerShell function, you can replace steps 8-9 with:
> ```powershell
> Manage-GitHub -commitMessage "Release v0.3.2" -TagName v0.3.2 -TagMessage "v0.3.2"
> ```

---

## Changelog

*Iterative Releases are usually the same release re-bundled with minor imporvements, hence they are grouped also below*

### Version: 0.3.0 - 0.3.2 (Iterative Releases)
### Release Date: 12 March 2026 - 13 March 2026
- **New module: `imputation`** — comprehensive multiple imputation methods for handling missing data
  - MICE with Predictive Mean Matching (PMM)
  - MICE with Bayesian Linear Regression (norm)
  - MICE with Logistic Regression for binary/categorical columns
  - Iterative Random Forest imputation (missForest-style, supports numeric and categorical)
  - K-Nearest Neighbours (KNN) imputation
  - Imputation summary utility for before/after comparison
- Added `MODULE_NOTES.md` in `src/datashadric/` documenting every module and function
- Added build, release, and deploy instructions to README
- Version bump to 0.3.2 and then 0.3.2 for minor fixes and documentation updates
- Fixed README formatting and typos
- Fixed broken anova function in `stochastics` module (was using wrong statsmodels submodules)
- Fixed VIF calculation function in `stochastics` module to ensure it works correctly with pandas DataFrames and handles constant term properly
- Fixed broken ols regression function in `regression` module (was using wrong statsmodels submodules)
- Updated documentation in `MODULE_NOTES.md` for all modules, especially the new `imputation` module

### Version: 0.2.0 - 0.2.3 (Iterative Releases)
### Release Date: 4 Novemeber 2025 - 10 November 2025
- Added image annotation when detecting outliers using AI-assisted bounding box generation
- Enhanced outlier detection and removal functions in data-processor module
- Added use of AI agents to assist with data analysis and visualization tasks (needs user to store their API keys in system environment variables)
- Added Apache Superset as an additional visualization dependency
- Minor bug fixes and enhancements in dataframing and plotters modules
- Updated documentation

### Version: 0.1.4
### Release Date: 9 October 2025
- Minor bug fixes
- Minor enhancements to user optionality in many functions for mlearning, stochastics and dataframing modules
- Added user optionality for saving plots to files in plotters module
- Updated documentation

### Version: 0.1.3
### Release Date: 8 October 2025
- Minor bug fixes
- Added print statements for better process tracking in data processing functions
- Added for stochastic and machine learning based outlier detectio adn removal
- Updated documentation

### Version: 0.1.2
### Release Date: 6 October 2025
- Enhanced dataframe utilities
- New functions for index and column name retrieval
- Improved documentation and examples

### Version: 0.1.1
### Release Date: 3 October 2025
- Supplemental release 
- Additional functions for outlier detection
- Additional functions for plotting (LOWESS meanline plotter)
- Additional functions for data clustering based on k-means

### Version: 0.1.0 
### Release Date: 2 October 2025
- Initial release
- Core modules: mlearning, regression, dataframing, stochastics, plotters
- Comprehensive documentation and examples
- Minimal test coverage
