Metadata-Version: 2.4
Name: sidedoc
Version: 0.2.0
Summary: AI-native document format that separates content from formatting
Author: Sidedoc Contributors
License: MIT
Project-URL: Homepage, https://sidedoc.io
Project-URL: Repository, https://github.com/jgardner04/sidedoc
Project-URL: Documentation, https://sidedoc.io
Project-URL: Issues, https://github.com/jgardner04/sidedoc/issues
Keywords: document,ai,markdown,docx
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Requires-Python: >=3.11
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: python-docx>=1.1.0
Requires-Dist: mistune>=3.0.0
Requires-Dist: click>=8.1.0
Requires-Dist: Pillow>=10.0.0
Provides-Extra: dev
Requires-Dist: pytest>=8.0.0; extra == "dev"
Requires-Dist: pytest-cov>=6.0.0; extra == "dev"
Requires-Dist: mypy>=1.11.0; extra == "dev"
Provides-Extra: benchmarks
Requires-Dist: pypandoc>=1.16.2; extra == "benchmarks"
Requires-Dist: tiktoken>=0.12.0; extra == "benchmarks"
Dynamic: license-file

# Sidedoc

[![PyPI version](https://img.shields.io/pypi/v/sidedoc.svg)](https://pypi.org/project/sidedoc/)

An AI-native document format that separates content from formatting, enabling efficient AI interaction with documents while preserving rich formatting for human consumption.

**Status:** ✅ MVP Complete with Table, Hyperlink, and Track Changes support

---

## The Problem

Current document workflows force a tradeoff between AI efficiency and human usability:

1. **Reading documents:** Tools like Document Intelligence extract content for AI reasoning, but this is expensive (high token cost for XML parsing) and loses the connection to the original formatting.

2. **Creating documents:** Tools like Pandoc generate docx from markdown, but this is one-way—there's no maintained link between the AI-friendly representation and the formatted output.

3. **Iterative collaboration:** When AI and humans work on the same document over time, the current model requires repeated extraction and regeneration, which is lossy and expensive.

## Why Sidedoc?

### The Token Efficiency Problem

When AI reads a 10-page Word document:
- **Via Document Intelligence/XML:** 15,000+ tokens (includes XML structure, formatting metadata, schema definitions)
- **Via Sidedoc markdown:** 1,500 tokens (clean content only)

**Result:** 10x cost reduction and faster processing for AI workflows.

For a team processing 100 documents per day, this translates to:
- **Cost savings:** 90% reduction in AI API costs for document reading
- **Speed improvement:** 5-10x faster document processing
- **Context efficiency:** Fit more documents in a single AI context window

### The Lossless Iteration Problem

**Current workflow:**
1. Extract .docx → AI reads XML/text (expensive, slow)
2. AI generates new content
3. Generate new .docx → Formatting details lost or manually reapplied
4. Repeat → Each cycle degrades formatting fidelity

**Sidedoc workflow:**
1. Extract .docx once → Creates .sidedoc with separated content and formatting
2. AI reads markdown (cheap, fast)
3. AI edits markdown
4. Sync → Formatting automatically reapplied from preserved metadata
5. Repeat infinitely → Zero formatting degradation

### Real-World Use Case: Quarterly Report Automation

**Scenario:** Your company has a formatted quarterly report template with logos, custom styles, and corporate branding.

**Traditional approach:**
- Extract content → AI updates with new data → Generate new docx
- Problem: Lose custom formatting, manual reformatting required each quarter
- Cost: High token usage + human time for reformatting

**Sidedoc approach:**
1. Extract company template to `.sidedoc` (once)
2. AI reads markdown (90% fewer tokens), updates with Q2 data
3. Sync changes back - formatting, headers, logos all intact
4. Stakeholders receive familiar, properly formatted Word document
5. Next quarter: Repeat steps 2-4 (no reformatting needed)

## The Solution

Documents should have two representations that stay in sync:

- **Markdown** — optimized for AI (efficient to read/write, low token cost)
- **Formatted docx** — optimized for humans (rich formatting, familiar tools)

Changes to either should propagate to the other. Sidedoc makes this possible.

### Sidedoc vs. Alternatives

| Approach | AI Token Cost | Format Preservation | Iteration Support | Human Usability |
|----------|---------------|---------------------|-------------------|-----------------|
| **Direct .docx** | 🔴 Very High (XML) | N/A | 🔴 Poor | ✅ Excellent |
| **Document Intelligence** | 🔴 Very High | 🔴 Lost | 🔴 None | ❌ No output format |
| **Pandoc (md→docx)** | ✅ Low | 🔴 Lost | 🔴 One-way only | ⚠️ Basic styling |
| **Sidedoc** | ✅ Low | ✅ Perfect | ✅ Lossless | ✅ Excellent |

**Key Insight:** Sidedoc is the only approach that combines low AI token costs with perfect format preservation and lossless iteration, while maintaining full human usability through standard Word documents.

---

## How It Works

Sidedoc extracts a `.docx` into a `.sidedoc/` directory:

```
document.sidedoc/
├── content.md         # Clean markdown (AI reads/writes this)
├── styles.json        # Formatting information per block
├── structure.json     # Block structure and mappings (generated by sync)
├── manifest.json      # Metadata and version info (generated by sync)
└── assets/            # Images and embedded files
```

For sharing, `sidedoc pack` creates a `.sdoc` ZIP archive.

The AI works with `content.md` — pure markdown with no metadata or special markers. The other files preserve formatting information so the original docx can be reconstructed with styling intact.

### The Sync Advantage

Sidedoc's sync capability provides a fundamentally different workflow than traditional document generation:

**Traditional document generation:**
- AI reads document → Generates new content → Creates entirely new .docx
- Problem: All formatting must be specified programmatically or defaults to basic styles
- Result: Original formatting (custom styles, branding, complex layouts) is lost

**Sidedoc sync workflow:**
- AI edits `content.md` → Sync detects changes → Updates .docx intelligently
- **Unchanged blocks:** Keep their original formatting exactly (fonts, colors, spacing)
- **Modified blocks:** Preserve formatting while updating content
- **New blocks:** Receive intelligent defaults based on type (heading styles for headings, etc.)
- **Deleted blocks:** Cleanly removed from output
- Result: Format preservation is automatic, not manual

This means you can maintain a "golden master" document with perfect corporate styling, and AI can update content indefinitely without degrading the formatting.

---

## CLI Commands

```bash
# Extract: Create sidedoc from docx
sidedoc extract document.docx

# Build: Generate docx from sidedoc
sidedoc build document.sidedoc

# Sync: After editing content.md, update the docx
sidedoc sync document.sidedoc

# Validate: Check sidedoc integrity
sidedoc validate document.sidedoc

# Diff: Show changes between content.md and last synced state
sidedoc diff document.sidedoc

# Info: Display sidedoc metadata
sidedoc info document.sidedoc

# Unpack/Pack: Extract or create sidedoc from directory (debugging)
sidedoc unpack document.sidedoc -o ./unpacked/
sidedoc pack ./unpacked/ -o document.sidedoc
```

---

## Example Workflow

```bash
# 1. User has a formatted Word document
$ ls
quarterly_report.docx

# 2. Extract to sidedoc for AI processing
$ sidedoc extract quarterly_report.docx
Created: quarterly_report.sidedoc

# 3. AI reads the markdown content efficiently
$ cat quarterly_report.sidedoc/content.md
# Quarterly Report
We achieved **strong results** in Q3...

# 4. AI edits content.md (via Claude Code or similar)
# ... AI adds a new section, modifies text ...

# 5. Sync changes back to docx
$ sidedoc sync quarterly_report.sidedoc
Synced: 3 blocks modified, 1 block added

# 6. Rebuild the docx for human consumption
$ sidedoc build quarterly_report.sidedoc -o quarterly_report_updated.docx
Created: quarterly_report_updated.docx

# 7. Open in Word - formatting preserved, content updated
```

---

## Supported Elements (MVP)

| Element | Markdown | Notes |
|---------|----------|-------|
| Headings H1-H6 | `#` to `######` | Maps to Heading 1-6 styles |
| Paragraphs | Plain text | Preserves paragraph breaks |
| Bold | `**text**` | Inline formatting |
| Italic | `*text*` | Inline formatting |
| Hyperlinks | `[text](url)` | Clickable links preserved |
| Bulleted lists | `- item` | Single level for MVP |
| Numbered lists | `1. item` | Single level for MVP |
| Images | `![alt](path)` | Copied to assets/ |
| Tables | GFM pipe tables | See Tables section |
| Track Changes | CriticMarkup | See Track Changes section |

## Limitations & Roadmap

Sidedoc currently supports the most common document elements. The following are **not yet supported** but are on the roadmap:

- Nested lists (multi-level indentation)
- Headers and footers
- Footnotes and endnotes
- Comments
- Text boxes and shapes
- Charts and SmartArt

Have a use case that needs one of these? [Open an issue](https://github.com/jgardner04/sidedoc/issues) to help prioritize.

---

## Table Support

Sidedoc extracts Word tables as [GitHub Flavored Markdown](https://github.github.com/gfm/#tables-extension-) pipe tables:

```markdown
| Department | Q1 Revenue | Q2 Revenue |
| --- | --- | --- |
| Sales | $1,200,000 | $1,350,000 |
| Marketing | $800,000 | $920,000 |
```

### Features

- **Column alignment:** `:---` (left), `:---:` (center), `---:` (right)
- **Cell formatting:** Background colors, borders preserved in styles.json
- **Inline formatting:** Bold, italic, and hyperlinks within cells
- **Merged cells:** Horizontal and vertical merges detected and restored
- **Structural edits:** Add/remove rows and columns by editing the markdown
- **Special characters:** Pipe characters escaped as `\|`

### Limitations

- Nested tables are not supported (rare in Word documents)
- Formulas are extracted as static text values
- CriticMarkup (track changes) is supported within table cells

---

## Track Changes Support

Sidedoc supports bidirectional track changes using [CriticMarkup](http://criticmarkup.com/) syntax:

### CriticMarkup Syntax

| Syntax | Meaning | Example |
|--------|---------|---------|
| `{++text++}` | Insertion | `{++added text++}` |
| `{--text--}` | Deletion | `{--removed text--}` |
| `{~~old~>new~~}` | Substitution | `{~~old text~>new text~~}` |

### Extraction (docx → sidedoc)

When extracting a Word document with Track Changes:
- Insertions become `{++text++}` in content.md
- Deletions become `{--text--}` in content.md
- Author and date metadata are preserved in structure.json

Use `--track-changes` to force enable or `--no-track-changes` to accept all changes:

```bash
# Auto-detect track changes (default)
sidedoc extract document.docx

# Force extract track changes as CriticMarkup
sidedoc extract document.docx --track-changes

# Accept all changes (ignore track changes)
sidedoc extract document.docx --no-track-changes
```

### Sync (edited content.md → docx)

AI agents can use CriticMarkup in content.md to propose changes:

```markdown
# Before editing
This is a simple document.

# After AI edits
This is a {++very ++}simple document with {--removed--} content.
```

When syncing, CriticMarkup becomes proper Word track changes:
- `{++text++}` → w:ins elements (insertions)
- `{--text--}` → w:del elements (deletions)
- `{~~old~>new~~}` → deletion followed by insertion

Configure the author name for AI-generated changes:

```bash
# Default author: "Sidedoc AI"
sidedoc sync document.sidedoc -o output.docx

# Custom author name
sidedoc sync document.sidedoc -o output.docx --author "Claude AI"
```

### Limitations

- Track changes only apply to text content (not formatting changes)
- Comments are not yet supported
- Move operations are extracted as delete + insert

---

## Installation

```bash
pip install sidedoc
```

---

## Development

```bash
# Clone the repository
git clone https://github.com/jgardner04/sidedoc.git
cd sidedoc

# Install in development mode
pip install -e ".[dev]"

# Run tests
pytest
```

---

## Links

- [GitHub](https://github.com/jgardner04/sidedoc)
- [Author: Jonathan Gardner](https://jonathangardner.io)
