Metadata-Version: 2.3
Name: logsage
Version: 0.1.1
Summary: LLM-based library for AI training job failure attribution and recommending auto-resume policy
Author: Haim Elisha
Author-email: helisha@nvidia.com
Requires-Python: >=3.11,<4.0
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Requires-Dist: drain3 (>=0.9.11,<0.10.0)
Requires-Dist: fastapi (>=0.117.1,<0.118.0)
Requires-Dist: langchain (>=0.3.27,<0.4.0)
Requires-Dist: langchain-core (>=0.3.76,<0.4.0)
Requires-Dist: langchain-nvidia-ai-endpoints (>=0.3.18,<0.4.0)
Requires-Dist: nh3 (>=0.3.1,<0.4.0)
Requires-Dist: numpy (>=2.3.3,<3.0.0)
Requires-Dist: pandas (>=2.3.2,<3.0.0)
Requires-Dist: pydantic-settings (>=2.11.0,<3.0.0)
Requires-Dist: requests (>=2.32.5,<3.0.0)
Requires-Dist: uvicorn (>=0.37.0,<0.38.0)
Description-Content-Type: text/markdown

# LogSage

LogSage is an LLM-powered toolkit that analyzes AI training job logs, attributes root causes, and recommends an auto-resume policy to reduce wasted GPU time. It provides:
- Error extraction and de-duplication from SLURM job logs
- Root-cause attribution (e.g., hardware, configuration, memory, communication)
- Action recommendations (restart immediately vs stop) with justification
- Optional node isolation hints for hardware-related failures

## Table of Contents

- [Description](#description)
- [Quickstart](#quickstart)
- [Components](#components)
- [Running the API](#running-the-api)
- [Configuration](#configuration)
- [Testing](#testing)
- [Roadmap and Project Status](#roadmap-and-project-status)
- [Additional Resources](#additional-resources)
- [Contributing](#contributing)

## Description

- **Problem**: Training jobs fail for many reasons (hardware, networking, configuration, data). Manual root-cause analysis is slow and wastes GPU hours.
- **Solution**: LogSage uses log parsing + NVIDIA NIM to extract error patterns, attribute likely causes, and recommend whether to restart or stop, with an explanation.
- **How it works (high level)**:
  1. Receive job logs (client-provided)
  2. Extract and cluster error lines; remove noise
  3. Attribute errors with an LLM using structured prompts and heuristics
  4. Recommend restart/stop and, when applicable, suggest temporal isolation of suspect nodes
- **Benefits**: Faster triage, increased data center availability, reduced GPU downtime, and improved error extraction coverage by leveraging LLMs.

## Quickstart

Prerequisites:
- Python >= 3.11
- Poetry (recommended) or pip

Setup (using Poetry):
```bash
make install
# or
poetry install --with dev --all-extras
```

Install via pip:
```bash
python3 -m venv myenv
source myenv/bin/activate
pip install -U logsage --index-url=https://urm.nvidia.com/artifactory/api/pypi/nv-shared-pypi-local/simple --extra-index-url https://pypi.org/simple
```

Run the API locally:
```bash
python -m logsage.auto_resume_policy.run_server
# or
uvicorn logsage.auto_resume_policy.server:app --host 0.0.0.0 --port 8000 --reload
```

Open the docs:
- Swagger UI: `http://localhost:8000/docs`
- ReDoc: `http://localhost:8000/redoc`

## Components

- `logsage/auto_resume_policy/`
  - FastAPI server exposing endpoints to create an attribution ID, ingest logs, and retrieve analysis
  - CLI utilities to analyze local log files
  - Recommendations engine that suggests job auto-resume policies: 'STOP', 'RESTART', 'TEMPORAL-ISOLATION+RESTART'
  - Detailed API and usage docs: see [`logsage/auto_resume_policy/README.md`](logsage/auto_resume_policy/README.md)

## Running the API

The FastAPI service is defined in `logsage/auto_resume_policy/server.py` and can be run locally during development:
```bash
python -m logsage.auto_resume_policy.run_server
```
Key endpoints (full specs in module README and Swagger):
- `GET /healthz`: liveness check
- `GET /version`: version info
- `POST /errors/attribution_id`: create an attribution ID for a job
- `POST /errors/logs`: submit logs under the attribution ID
- `POST /errors/attribution`: run attribution and get recommendation

For request/response schemas and cURL examples, see [`logsage/auto_resume_policy/README.md`](logsage/auto_resume_policy/README.md).

## Configuration

Configuration is managed via `logsage/auto_resume_policy/config.py` (Pydantic settings). Key variables:
- `NVIDIA_API_KEY` (required in production): API key for NVIDIA AI Endpoints (required for LLM calls)
- `FAST_API_ROOT_PATH` (optional): Root path when running behind a proxy
- `DEBUG` (optional): `true`/`false` (default `true` locally; set `false` in prod)

## Testing

Run the test suite:
```bash
make test
# or
poetry run pytest
```
Coverage reports are configured in `pyproject.toml`.

## Roadmap and Project Status

- Add API authentication and rate limiting
- Streamed/async log ingestion paths
- Integration with log collector like loki
- Expanded attribution categories and guardrails
- Improve test coverage and add end-to-end examples

## Additional Resources

- Auto-Resume-Policy API & details: [`logsage/auto_resume_policy/README.md`](logsage/auto_resume_policy/README.md)
- Fetcher & deployment details: [`logsage/fetcher/README.md`](logsage/fetcher/README.md)
- Project configuration and developer tooling: `pyproject.toml`, `Makefile`

Internal dashboards (if applicable):
- Grafana (LogSage): `https://grafana.nvidia.com/d/aeutclepcu41sf/logsage?orgId=290`
- Kibana/ES example index (sandbox): `https://gpuwa.nvidia.com/elasticsearch/df-sandbox-ohazai-logsage-test2-202508/_search`

## Contributing

For setup, development guidelines, and versioning information, see the [Contributing Guide](CONTRIBUTING.md).

