Metadata-Version: 2.4
Name: ml3on-format-core
Version: 0.0.2
Summary: ML3Seq Format
Author: macro
Requires-Python: <3.15,>=3.11
Description-Content-Type: text/markdown
Requires-Dist: ml3macro-utils>=0.0.3

# ML3Seq Format Core

The core implementation of the ML3Seq (JSON + Multi-part hybrid) serialization format.

## Status: On the way out... to make way for the new.

See status notes in project [README.md#status](https://codeberg.org/gxyflow/ml3on-format/src/branch/main/README.md#status-on-the-way-out-to-make-way-for-the-new). 

## Overview

ML3Seq Format Core provides the fundamental building blocks for the ML3Seq serialization format, which combines JSON for structured data with unescaped multiline blocks for text content. This format is particularly well-suited for applications involving language models (LLMs) where preserving original formatting reduces cognitive load and improves reliability.

## Key Components

### 1. ML3Seq - The Sequence Container

The main container for ML3Seq format data:

```python
from ml3on.core import ML3Seq, ML3SeqItem

# Create a sequence with items
sequence = ML3Seq(
    ML3SeqItem(kind="FILE", name="file1.txt", content="Content here"),
    ML3SeqItem(kind="FILE", name="file2.txt", content="More content")
)

# Serialize to ML3Seq format
ml3seq_string = sequence.as_ml3seq

# Deserialize from ML3Seq format
parsed_sequence = ML3Seq.from_ml3seq(ml3seq_string)
```

### 2. ML3SeqItem - Individual Items

Represents a single item in an ML3Seq:

```python
from ml3on.core import ML3SeqItem

# Create an item
item = ML3SeqItem(
    kind="DOCUMENT",
    title="Sample",
    content="This is multiline\ncontent"
)

# Serialize item
ml3seq_item_string = item.to_ml3seq(config)

# Deserialize item
parsed_item = ML3SeqItem.from_ml3seq(ml3seq_item_string, config)
```

### 3. ML3SeqMultilineString - Type-Based Control

A string subclass that provides explicit control over serialization format:

```python
from ml3on.core import ML3SeqMultilineString

# Create a multiline string
ml_string = ML3SeqMultilineString("This will always\nbe in a multiline block")

# Behaves like a regular string
print(len(ml_string))  # String methods work
print(ml_string.upper())  # All string operations supported

# But serializes differently in ML3Seq format
```

### 4. ML3SeqFormatConfig - Configuration Management

Manages ML3Seq format configuration:

```python
from ml3on.core import ML3SeqFormatConfig

# Create configuration
config = ML3SeqFormatConfig(separator_prefix="CUSTOM|")

# Get current separator
separator = config.separator_prefix()

# Convert to dictionary
config_dict = config.to_dict()
```

## Format Specification

### Basic Structure

```
{separatorPrefix}BEGIN:{kind}
{jsonWithNonMultilineValues}
{separatorPrefix}{multilineField1}
{multilineField1Content}
...
{separatorPrefix}{multilineFieldN}
{multilineFieldNContent}
{separatorPrefix}END:{kind}
```

### Complete Sequence Format

```
[configJson]  # Optional configuration
[item1]
[item2]
...
[itemN]
```

### Format Rules

1. **Control Markers**: BEGIN/END markers are only recognized at the start of lines (no leading whitespace)
2. **Field Validation**: Field names must be non-empty strings and cannot start with "BEGIN:" or "END:"
3. **JSON Values**: Non-multiline values are stored as JSON
4. **Multi-line Fields**: Multi-line content appears after field markers without escaping
5. **Immutability**: All data structures are immutable for safety

### Example Formats

#### Simple Item (No Multiline Content)
```ml3seq
-~<§BEGIN:SIMPLE
{"field1": "value1", "field2": "value2"}
-~<§END:SIMPLE
```

#### Item with Multiline Content
```ml3seq
-~<§BEGIN:DOCUMENT
{"title": "Sample", "author": "Test"}
-~<§content
This is the first line of content.
This is the second line of content.
This is the third line of content.
-~<§END:DOCUMENT
```

#### Multiple Items in Sequence
```ml3seq
-~<§BEGIN:ITEM1
{"id": 1, "name": "First"}
-~<§description
First item description
with multiple lines
-~<§END:ITEM1
-~<§BEGIN:ITEM2
{"id": 2, "name": "Second"}
-~<§description
Second item description
also with multiple lines
-~<§END:ITEM2
```

#### With Configuration
```ml3seq
{"separator_prefix": "BOOP|"}
BOOP|BEGIN:CONFIGURED
{"field": "value"}
BOOP|multiline
Multiline content here
BOOP|END:CONFIGURED
```

## Configuration

### Separator Prefix

The separator prefix can be configured in several ways:

1. **Constructor Argument** (highest priority):
   ```python
   config = ML3SeqFormatConfig(separator_prefix="CUSTOM|")
   ```

2. **Environment Variable**:
   ```bash
   export ML3Seq_FORMAT_SEPARATOR_PREFIX="CUSTOM|"
   ```

3. **Config JSON** (in serialized format):
   ```ml3seq
   {"separator_prefix": "CUSTOM|"}
   ```

4. **Default**: `-~<§` (if none of the above are provided)

### Configuration Precedence

1. Explicit constructor argument
2. Environment variable
3. Config JSON in serialized format
4. Default constant

## Advanced Usage

### Working with Complex Data

```python
from ml3on.core import ML3Seq, ML3SeqItem

# Nested structures
item = ML3SeqItem(
    kind="COMPLEX",
    metadata={"key1": "value1", "key2": [1, 2, 3]},
    tags=["tag1", "tag2", "tag3"],
    content="Multiline content\nwith multiple lines"
)

# Lists of items
items = [
    ML3SeqItem(kind="ITEM", name="item1"),
    ML3SeqItem(kind="ITEM", name="item2"),
    ML3SeqItem(kind="ITEM", name="item3")
]

sequence = ML3Seq(*items)
```

### Error Handling

```python
from ml3on.core import ML3Seq, ML3SeqItem

try:
    # Invalid format
    sequence = ML3Seq.from_ml3seq("invalid ml3seq format")
except ValueError as e:
    print(f"Format error: {e}")

try:
    # Missing required fields
    item = ML3SeqItem(kind="")  # Empty kind
    ml3seq_str = item.to_ml3seq(config)
except ValueError as e:
    print(f"Validation error: {e}")

try:
    # Invalid field names
    item = ML3SeqItem(kind="TEST", **{"": "invalid"})
except ValueError as e:
    print(f"Field error: {e}")
```

### Custom Validation

```python
from ml3on.core import ML3SeqItem

def validate_item(item: ML3SeqItem):
    """Custom validation logic"""
    if not item.kind:
        raise ValueError("Item kind cannot be empty")
    
    # Check required fields
    required_fields = ["id", "name"]
    for field in required_fields:
        if field not in [k for k, v in item.kv_pairs]:
            raise ValueError(f"Missing required field: {field}")
    
    return True
```

## Performance Considerations

### Large Data Handling

```python
# Large multiline strings (10k+ characters)
large_content = "A" * 10000 + "\n" + "B" * 10000
item = ML3SeqItem(kind="LARGE", content=large_content)

# Large sequences (1000+ items)
large_sequence = ML3Seq(*[ML3SeqItem(kind=f"ITEM_{i}") for i in range(1000)])
```

### Memory Efficiency

The implementation uses:
- `frozendict` for immutable dictionaries
- `tuple` for immutable sequences
- Generators where appropriate
- Efficient string handling

### Benchmarking

```python
import time
from ml3on.core import ML3Seq, ML3SeqItem

# Create test data
test_items = [ML3SeqItem(kind=f"ITEM_{i}", value=str(i)) for i in range(1000)]

# Benchmark serialization
start = time.time()
sequence = ML3Seq(*test_items)
ml3seq_str = sequence.as_ml3seq
end = time.time()
print(f"Serialization: {end - start:.4f} seconds")

# Benchmark deserialization
start = time.time()
parsed = ML3Seq.from_ml3seq(ml3seq_str)
end = time.time()
print(f"Deserialization: {end - start:.4f} seconds")
```

## Integration Patterns

### File System Integration

```python
import os
from ml3on.core import ML3Seq

def save_ml3seq_file(sequence: ML3Seq, filepath: str):
    """Save ML3Seq to file"""
    with open(filepath, 'w', encoding='utf-8') as f:
        f.write(sequence.as_ml3seq)

def load_ml3seq_file(filepath: str) -> ML3Seq:
    """Load ML3Seq from file"""
    with open(filepath, 'r', encoding='utf-8') as f:
        return ML3Seq.from_ml3seq(f.read())
```

### Network Integration

```python
import requests
from ml3on.core import ML3Seq

def send_ml3seq_api_request(url: str, sequence: ML3Seq):
    """Send ML3Seq data via API"""
    headers = {'Content-Type': 'text/ml3seq'}
    response = requests.post(url, data=sequence.as_ml3seq, headers=headers)
    return ML3Seq.from_ml3seq(response.text)
```

### Database Integration

```python
from ml3on.core import ML3Seq

def store_ml3seq_in_database(db_connection, sequence: ML3Seq):
    """Store ML3Seq in database"""
    cursor = db_connection.cursor()
    cursor.execute(
        "INSERT INTO ml3seq_data (content) VALUES (%s)",
        (sequence.as_ml3seq,)
    )
    db_connection.commit()
```

## Testing

### Running Tests

```bash
# Run all core tests
just test packages/ml3seq-format-core/tests/

# Run specific test file
just test packages/ml3seq-format-core/tests/ml3seq/core/test_sequence.py

# Run with verbose output
just test packages/ml3seq-format-core/tests/ -v
```

### Test Structure

```
packages/ml3seq-format-core/tests/
├── ml3seq/
│   ├── core/
│   │   ├── test_sequence.py        # ML3Seq tests
│   │   ├── test_item.py            # ML3SeqItem tests
│   │   ├── test_multiline.py       # ML3SeqMultilineString tests
│   │   ├── test_multiline__serde.py # Serialization tests
│   │   ├── test_multiline__typing.py # Type tests
│   │   ├── test_multiline__edge_cases.py # Edge case tests
│   │   ├── test_sequence.py        # Sequence tests
│   │   ├── test_integration.py     # Integration tests
│   │   └── test_edge_cases.py      # Edge cases
│   └── config/
│       ├── test_config.py          # Config tests
│       ├── test_constants.py       # Constants tests
│       └── test_protocol.py        # Protocol tests
```

### Writing Tests

```python
import pytest
from ml3on.core import ML3Seq, ML3SeqItem

def test_basic_serialization():
    """Test basic serialization"""
    item = ML3SeqItem(kind="TEST", field="value")
    sequence = ML3Seq(item)
    
    ml3seq_str = sequence.as_ml3seq
    assert "-~<§BEGIN:TEST" in ml3seq_str
    assert "-~<§END:TEST" in ml3seq_str

def test_round_trip():
    """Test serialization/deserialization round trip"""
    original = ML3Seq(
        ML3SeqItem(kind="TEST", field="value")
    )
    
    ml3seq_str = original.as_ml3seq
    parsed = ML3Seq.from_ml3seq(ml3seq_str)
    
    assert len(parsed.items) == len(original.items)
    assert parsed.items[0].kind == original.items[0].kind
```

## API Reference

### ML3Seq

**Class**: `ML3Seq(*args, config=None, **kwargs)`

**Properties**:
- `config: Optional[ML3SeqFormatConfigProtocol]` - Configuration
- `items: tuple[ML3SeqItem, ...]` - Sequence items
- `separator_prefix: str` - Current separator prefix
- `as_ml3seq: str` - Serialized ML3Seq string

**Methods**:
- `from_ml3seq(cls, value: str) -> ML3Seq` - Parse ML3Seq format string

### ML3SeqItem

**Class**: `ML3SeqItem(kind: str, **kwargs)`

**Properties**:
- `kind: str` - Item type identifier
- `kv_pairs: tuple[tuple[str, Any], ...]` - Key-value pairs

**Methods**:
- `to_ml3seq(config: ML3SeqFormatConfigProtocol) -> str` - Serialize to ML3Seq format
- `from_ml3seq(cls, value: str, config: ML3SeqFormatConfigProtocol) -> ML3SeqItem` - Parse ML3Seq item

### ML3SeqMultilineString

**Class**: `ML3SeqMultilineString(value: str | bytes)`

**Inherits from**: `str`

**Methods**:
- All standard string methods
- `__new__(cls, value)` - Create instance
- `__reduce__()` - Pickle support

### ML3SeqFormatConfig

**Class**: `ML3SeqFormatConfig(separator_prefix=None, **kwargs)`

**Methods**:
- `separator_prefix() -> str` - Get separator prefix
- `to_dict() -> Mapping[str, Any]` - Get config as dictionary

## Best Practices

### 1. Choose Unique Separators

```python
# Good: Unique, unlikely to appear in content
config = ML3SeqFormatConfig(separator_prefix="-~<§")

# Avoid: Common characters that might appear in content
config = ML3SeqFormatConfig(separator_prefix="---")
```

### 2. Validate Input Data

```python
# Validate before serialization
if not isinstance(data, dict):
    raise ValueError("Data must be a dictionary")

# Validate field names
for field_name in data.keys():
    if not isinstance(field_name, str) or not field_name.strip():
        raise ValueError(f"Invalid field name: {field_name}")
```

### 3. Handle Large Data Efficiently

```python
# Process large sequences in chunks
chunk_size = 100
all_items = []

for i in range(0, len(large_data), chunk_size):
    chunk = large_data[i:i + chunk_size]
    items = [ML3SeqItem(kind="DATA", **item) for item in chunk]
    all_items.extend(items)

sequence = ML3Seq(*all_items)
```

### 4. Error Recovery

```python
# Graceful error handling
try:
    sequence = ML3Seq.from_ml3seq(user_provided_string)
except ValueError as e:
    # Fallback to default or alternative format
    logger.error(f"ML3Seq parse error: {e}")
    sequence = create_default_sequence()
```

### 5. Configuration Management

```python
# Centralized configuration
DEFAULT_CONFIG = ML3SeqFormatConfig(separator_prefix="-~<§")

def get_ml3seq_config():
    """Get application-wide ML3Seq config"""
    return DEFAULT_CONFIG
```

## Comparison with Other Formats

### ML3Seq vs JSON

**Advantages of ML3Seq:**
- Unescaped multiline content
- Better readability for mixed data
- Explicit structure boundaries
- Reduced cognitive load for LLMs

**When to use JSON:**
- Pure structured data
- Browser compatibility
- Simple configurations
- API responses

### ML3Seq vs YAML

**Advantages of ML3Seq:**
- Explicit multiline blocks
- JSON compatibility for metadata
- Better for mixed structured/unstructured data
- More predictable parsing

**When to use YAML:**
- Human-edited configuration files
- Simple data structures
- When escaping is acceptable

### ML3Seq vs Custom Formats

**Advantages of ML3Seq:**
- Standardized format
- Type safety
- Comprehensive error handling
- Integration with Pydantic
- Well-tested implementation

## Migration Guide

### From JSON to ML3Seq

```python
import json
from ml3on.core import ML3Seq, ML3SeqItem

# Convert JSON to ML3Seq
def json_to_ml3seq(json_str: str) -> ML3Seq:
    data = json.loads(json_str)
    
    if isinstance(data, list):
        items = []
        for item_data in data:
            # Extract kind or use default
            kind = item_data.get("kind", "ITEM")
            
            # Create ML3Seq item
            item = ML3SeqItem(kind=kind, **item_data)
            items.append(item)
        
        return ML3Seq(*items)
    else:
        # Single item
        kind = data.get("kind", "ITEM")
        return ML3Seq(ML3SeqItem(kind=kind, **data))
```

### From ML3Seq to JSON

```python
import json
from ml3on.core import ML3Seq

def ml3seq_to_json(ml3seq_str: str) -> str:
    sequence = ML3Seq.from_ml3seq(ml3seq_str)
    
    json_data = []
    for item in sequence.items:
        item_dict = {
            "kind": item.kind,
            **{k: str(v) if hasattr(v, 'strip') else v for k, v in item.kv_pairs}
        }
        json_data.append(item_dict)
    
    return json.dumps(json_data)
```

## Troubleshooting

### Common Issues

**Issue**: `ValueError: Invalid ML3Seq format`
- **Cause**: Malformed ML3Seq string
- **Solution**: Validate input and check for missing BEGIN/END markers

**Issue**: Fields appearing in wrong format
- **Cause**: Type annotations not properly specified
- **Solution**: Ensure `ML3SeqMultilineString` is used for multiline fields

**Issue**: Separator conflicts
- **Cause**: Separator prefix appears in content
- **Solution**: Choose a more unique separator prefix

### Debugging Tips

```python
# Debug serialization
def debug_serialize(item):
    print(f"Item kind: {item.kind}")
    print(f"Item fields: {[(k, type(v).__name__) for k, v in item.kv_pairs]}")
    
    ml3seq_str = item.to_ml3seq(config)
    print(f"ML3Seq output:\n{ml3seq_str}")
    return ml3seq_str

# Debug deserialization
def debug_deserialize(ml3seq_str):
    print(f"Input ML3Seq:\n{ml3seq_str}")
    
    lines = ml3seq_str.split('\n')
    print(f"Lines: {len(lines)}")
    for i, line in enumerate(lines):
        print(f"{i}: {repr(line)}")
    
    return ML3SeqItem.from_ml3seq(ml3seq_str, config)
```

## Performance Optimization

### Caching

```python
from functools import lru_cache

@lru_cache(maxsize=100)
def get_cached_ml3seq(item_data):
    """Cache frequently used ML3Seq items"""
    item = ML3SeqItem(**item_data)
    return item.to_ml3seq(config)
```

### Batch Processing

```python
def process_batch(items_data):
    """Process items in batches"""
    items = []
    for data in items_data:
        item = ML3SeqItem(**data)
        items.append(item)
    
    # Single serialization call
    sequence = ML3Seq(*items)
    return sequence.as_ml3seq
```

### Memory Management

```python
def process_large_file(filepath):
    """Process large files efficiently"""
    with open(filepath, 'r') as f:
        while True:
            chunk = f.read(4096)  # 4KB chunks
            if not chunk:
                break
            
            # Process chunk
            yield process_chunk(chunk)
```

## Security Considerations

### Input Validation

```python
# Validate ML3Seq input
def safe_parse_ml3seq(ml3seq_str: str, max_size=1000000):
    """Safely parse ML3Seq with size limits"""
    if len(ml3seq_str) > max_size:
        raise ValueError(f"ML3Seq input too large: {len(ml3seq_str)} bytes")
    
    # Check for suspicious patterns
    if "eval(" in ml3seq_str or "import " in ml3seq_str:
        raise ValueError("Potentially unsafe ML3Seq content")
    
    return ML3Seq.from_ml3seq(ml3seq_str)
```

### Field Name Sanitization

```python
def sanitize_field_names(data: dict) -> dict:
    """Sanitize field names before creating ML3Seq items"""
    sanitized = {}
    for key, value in data.items():
        # Remove potentially dangerous characters
        safe_key = ''.join(c for c in key if c.isalnum() or c in '_-')
        if safe_key:
            sanitized[safe_key] = value
    return sanitized
```

## Future Enhancements

### Planned Features

1. **Streaming Support**: For very large datasets
2. **Schema Validation**: Integration with JSON Schema
3. **Performance Optimizations**: For specific use cases
4. **Additional Formats**: Alternative serialization options
5. **Enhanced Error Recovery**: Better handling of malformed input

### Potential Improvements

- **Binary Data Support**: Safe handling of binary content
- **Compression**: Built-in compression options
- **Encryption**: Secure serialization options
- **Versioning**: Format version management
- **Extensions**: Plugin system for custom features

## Documentation

- [Main README](../../README.md): Project overview
- [Pydantic Integration](../ml3seq-format-pydantic/README.md): Pydantic usage
- [Build System](../../BUILD_README.md): Building and versioning

## Support

For issues, questions, or contributions:
- **GitHub Issues**: Report bugs and request features
- **Discussions**: Ask questions and share ideas
- **Pull Requests**: Contribute improvements

## License

MIT License - Open source and free to use.

## Changelog

See [VERSION](../../VERSION) file for version history.

## Contributing

Contributions are welcome! Please:
1. Follow existing code patterns
2. Add comprehensive tests
3. Update documentation
4. Maintain backward compatibility
5. Follow the project's coding standards

## Examples

See the [pydantic package](../ml3seq-format-pydantic/README.md) for comprehensive usage examples with Pydantic models.
