Metadata-Version: 2.4
Name: charsetrs
Version: 0.1.0
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Programming Language :: Rust
License-File: LICENSE
Summary: A Python library with Rust bindings for charset detection
Author: Mario Taddeucci
License: MIT
Requires-Python: >=3.10, <3.14
Description-Content-Type: text/markdown; charset=UTF-8; variant=GFM

# charsetrs

A fast Python library with Rust bindings for detecting file character encodings and normalizing files.

## Features

- **Simple API**: Just two functions - `analyse()` and `normalize()`
- **Fast encoding detection** using Rust
- **Newline detection**: Detects LF, CRLF, or CR newline styles
- **File normalization**: Convert encoding and newlines in-place using streaming
- **Memory efficient**: Constant memory usage (~56KB) for files of any size
- **Supports large files**: Process 10GB+ files on 512MB RAM systems
- **Supports multiple encodings**: UTF-8, Latin-1, Windows-1252, UTF-16, ASCII, Arabic, Korean, and more
- **Configurable sample size**: Control memory usage vs accuracy trade-off

## Installation

### Development Installation

```bash
# Install dependencies
uv sync

# Build and install in development mode
uv run maturin develop
```

### Production Build

```bash
uv run maturin build --release
```

## Usage

### Basic Usage

```python
import charsetrs

# Analyse file encoding and newline style
result = charsetrs.analyse("file.txt")
print(f"Encoding: {result.encoding}")  # e.g., 'utf_8'
print(f"Newlines: {result.newlines}")  # e.g., 'LF', 'CRLF', or 'CR'

# Normalize file to UTF-8 with LF newlines (in-place modification)
charsetrs.normalize(
    "file.txt",
    encoding="utf-8",
    newlines="LF"
)
```

### Working with Large Files

The library uses streaming to efficiently handle files of any size with constant memory usage (~56KB):

```python
import charsetrs

# Use only 512KB for detection (faster, less memory)
result = charsetrs.analyse("large_file.txt", max_sample_size=512*1024)

# Use 2MB for detection (more accurate)
result = charsetrs.analyse("large_file.txt", max_sample_size=2*1024*1024)

# Normalize large file with custom sample size
# Memory usage: ~56KB regardless of file size (10GB+ files supported)
charsetrs.normalize(
    "large_file.txt",
    encoding="utf-8",
    newlines="LF",
    max_sample_size=1024*1024
)
```

### Newline Normalization

Convert between different newline styles (in-place modification):

```python
import charsetrs

# Convert Windows-style (CRLF) to Unix-style (LF)
charsetrs.normalize("windows.txt", encoding="utf-8", newlines="LF")

# Convert to Windows-style (CRLF)
charsetrs.normalize("unix.txt", encoding="utf-8", newlines="CRLF")

# Convert to old Mac-style (CR)
charsetrs.normalize("file.txt", encoding="utf-8", newlines="CR")
```

### Supported Encodings

- UTF-8, UTF-16 (LE/BE), UTF-32
- ISO-8859-1 (Latin-1)
- Windows code pages: 1252, 1256 (Arabic), 1255 (Hebrew), 1253 (Greek), 1251 (Cyrillic), 1254 (Turkish), 1250 (Central European)
- CP949 (Korean), EUC-KR
- Shift_JIS, EUC-JP (Japanese)
- Big5, GBK, GB2312 (Chinese)
- KOI8-R, KOI8-U (Cyrillic)
- Mac encodings (Roman, Cyrillic)
- ASCII

## API Reference

### `charsetrs.analyse(file_path, max_sample_size=None)`

Analyse the encoding and newline style of a file.

**Parameters:**
- `file_path` (str or Path): Path to the file
- `max_sample_size` (int, optional): Maximum bytes to read for detection (default: 1MB)

**Returns:**
- `AnalysisResult`: Object with `encoding` and `newlines` attributes

**Example:**
```python
result = charsetrs.analyse("file.txt")
print(result.encoding)  # 'utf_8'
print(result.newlines)  # 'LF'
```

### `charsetrs.normalize(file_path, encoding="utf-8", newlines="LF", max_sample_size=None)`

Normalize a file by converting its encoding and newline style in-place using streaming.

This function modifies the file in-place with constant memory usage (~56KB), making it suitable for very large files (10GB+) on memory-constrained systems (512MB RAM).

**Parameters:**
- `file_path` (str or Path): Path to the file to normalize
- `encoding` (str, optional): Target encoding (default: 'utf-8')
- `newlines` (str, optional): Target newline style - 'LF', 'CRLF', or 'CR' (default: 'LF')
- `max_sample_size` (int, optional): Maximum bytes to read for detection (default: 1MB)

**Raises:**
- `ValueError`: If encoding conversion fails or invalid newlines value
- `IOError`: If file cannot be read or written
- `LookupError`: If target encoding is invalid

**Example:**
```python
charsetrs.normalize(
    "input.txt",
    encoding="utf-8",
    newlines="LF"
)
```

### `AnalysisResult`

A frozen dataclass containing analysis results:

```python
@dataclass(frozen=True)
class AnalysisResult:
    encoding: str                        # e.g., 'utf_8', 'cp1252'
    newlines: Literal["LF", "CRLF", "CR"]  # Detected newline style
```

## Testing

Run the test suite:

```bash
uv run pytest tests/
```

Run specific tests:

```bash
# Test new API
uv run pytest tests/test_charsetrs_api.py -v

# Test with sample files
uv run pytest tests/test_full_detection.py -v
```

## Development Tasks

The project uses taskipy for common development tasks:

```bash
# Run tests
uv run task test

# Format all code (Python + Rust)
uv run task format

# Check formatting and linting (Python + Rust)
uv run task lint

# Format only Rust code
uv run task format_rust

# Lint only Rust code (formatting + clippy)
uv run task lint_rust
```

## Project Structure

```
.
├── src/
│   ├── charsetrs/         # Python package
│   │   └── __init__.py    # Python API
│   └── charsetrs_core/        # Rust source code
│       └── lib.rs         # Rust encoding detection
├── tests/                 # Test suite
│   ├── test_charsetrs_api.py
│   ├── test_full_detection.py
│   └── data/              # Sample files in various encodings
├── pyproject.toml         # Python project configuration
└── Cargo.toml             # Rust project configuration
```

## Performance

The library uses streaming to efficiently handle large files:
- **Constant memory usage**: ~56KB regardless of file size
- **Suitable for large files**: Process 10GB+ files on 512MB RAM systems
- **Default detection**: Reads 1MB sample for encoding detection
- **Configurable**: Adjust `max_sample_size` based on your needs
- **Single-pass processing**: Linear time complexity O(n)

For more details, see [MEMORY_EFFICIENCY.md](MEMORY_EFFICIENCY.md)

## License

MIT
