Metadata-Version: 2.4
Name: ubase-core
Version: 0.2.0
Summary: Universal base conversion with significant leading zero support
Author: Andrew Lehti
License-Expression: MIT
Keywords: base,unicode,encoding,conversion,emoji
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3 :: Only
Classifier: Operating System :: OS Independent
Requires-Python: >=3.10
Description-Content-Type: text/markdown
License-File: LICENSE
Dynamic: license-file

# ubase-core

```ubase-core encode 1111111110 10 1```

`ubase-core` is a universal base conversion library for encoding integers into arbitrary alphabets and decoding them back into integers.

Unlike traditional converters that are limited to fixed alphabets such as base2, base16, or base64, `ubase-core` lets you control how the alphabet is built. You can use curated Unicode sets, full Unicode order, emoji ranges, exclusions, and custom seeded alphabets. Base 2-62 are inherently supported.

The package provides both:

- a Python API
- a command-line interface (CLI)

---

# Install

```bash
pip install ubase-core
```

---

# Import

Top-level import:

```python
from ubase import uBase
```

Extras:

```python
from ubase import uBase, abc, effBase
```

Direct module import:

```python
from ubase.core import uBase, abc, effBase
```

---

# Quick Start

```python
from ubase import uBase

x = 1111111110

enc = uBase(x, 16, 1)
dec = uBase(enc, 16, 1)

print(enc)
print(dec)
```

---

# Core API

```python
uBase(n, b, s, m=0, x=None, u=0, safe=1)
```

## Parameters

| parameter | meaning |
|---|---|
| `n` | integer to encode, or encoded value to decode |
| `b` | requested base, or selector `0` / `1` |
| `s` | significance mode: `1` on, `0` off |
| `m` | mode: `0` symbols, `1` codepoint hex, `2` raw digit indices |
| `x` | exclusions and/or custom seeded alphabet specification |
| `u` | Unicode mode: `0` curated, `1` full Unicode order |
| `safe` | safe filter: `1` on, `0` off |

---

# Features

## Standard base conversion

- Encodes integers into positional base systems.
- Decodes encoded values back into integers.
- Works with any base `>= 2`.
- Supports very large integers.

Example:

```text
1111110 -> base16 -> 10f446
10f446 -> base16 -> 1111110
```

## Configurable alphabets

The digit alphabet used for encoding can be customized.

Supported sources include:

- the default curated alphabet
- full Unicode order
- emoji reservoirs
- custom seed strings
- filtered alphabets with excluded characters or range ids

This allows you to build alphabets such as:

- base16 using `0123456789abcdef`
- base64-style alphabets
- emoji-heavy encodings
- Unicode-heavy encodings
- fully custom symbol systems

## Custom alphabet seeds

To seed the alphabet with your own ordered symbols, start `x` with `-1`.

```python
x = [-1, "0123456789abcdef"]
```

Behavior:

- those characters are placed at the beginning of the alphabet
- duplicates are removed by first appearance
- remaining symbols are filled from the normal generator only if more are needed

Combined example:

```python
x = [-1, "0123456789abcdef", ("x", "0OIl"), 21]
```

This means:

- seed with `0123456789abcdef`
- exclude the characters `0OIl`
- exclude Unicode range id `21`
- fill the remainder normally

## Character exclusions

You can remove characters from the alphabet.

Exclude visually confusing characters:

```python
x = "0OIl"
```

Exclude Unicode ranges:

```python
x = [21, 24]
```

Exclude both:

```python
x = [21, "0OIl"]
```

## Unicode alphabet support

Two modes exist:

### Curated mode (`u=0`)

- Uses a curated set of Unicode blocks.
- Avoids many problematic characters.
- Best for most use cases.

### Full Unicode mode (`u=1`)

- Iterates through the entire Unicode codepoint range.
- Allows extremely large bases.
- May include characters that render differently depending on terminal or font support.

## Unicode safety filter

Optional filtering removes problematic Unicode categories.

```python
safe = 1
```

Filters include:

- control characters
- surrogate ranges
- variation selectors
- private-use ranges
- non-characters
- whitespace-like characters covered by the configured filter rules

## Multiple output formats

The encoder can return results in three formats.

### Symbol mode (`m=0`)

Normal encoded output.

```python
uBase(1111110, 16, 1)
```

### Codepoint hex mode (`m=1`)

Outputs the encoded string as space-separated Unicode codepoint hex values.

```python
uBase(1111110, 16, 1, 1)
```

Useful for:

- stable transport through shells or text systems
- debugging character output
- inspecting Unicode-heavy alphabets

### Digit index mode (`m=2`)

Returns the raw alphabet index sequence.

```python
uBase(1111110, 16, 1, 2)
```

Useful for:

- debugging
- testing
- analysis
- building a separate render layer

## Significance-preserving width mode

### `s = 0`

Standard positional encoding. No width offset. No preserved leading digit positions.

### `s = 1`

Significance-preserving width mode.

This mode offsets shorter-length buckets so the encoded length remains meaningful and can round-trip without losing leading alphabet-zero positions.

Use the same `s` value for encode and decode.

## Automatic base clamping

If the requested base is larger than the usable alphabet size after filtering, exclusions, and seed rules, the converter does not fail. It clamps to the largest usable base under the current settings.

You can inspect the actual base used with:

```python
from ubase import effBase

print(effBase(758327457298))
```

## Escape parsing inside seed and exclude strings

Seed and exclude strings support escaped Unicode values.

Supported prefixes:

- `\u`
- `\U`
- `\x`

The parser reads hex digits until the next whitespace or the end of the string.

Examples:

```python
r"\u263A"
r"\U1F600"
r"abc\U1F600 def"
r"\x41 \x42 \x43"
```

Outputs:

- `☺`
- `😀`
- `abc😀def`
- `ABC`

### Important rule

Whitespace ends the escape value.

Without a delimiter, letters such as `a-f` are valid hex digits, so a parser cannot reliably tell where the codepoint stops and where normal text begins. That is why whitespace is not a valid symbol for any base alphabet.

## Blocked characters

These never enter the final alphabet, even if you include them in a custom seed:

- backslash `\`
- forward slash `/`
- single quote `'`
- double quote `"`
- backtick `` ` ``

Whitespace is also removed from the alphabet.

---

# Base Behavior

## Standard bases

For `b >= 2`, the converter resolves an alphabet of up to that many symbols.

## Special selectors

| value | meaning |
|---|---|
| `b >= 2` | normal requested base |
| `b = 0` | base62 reservoir plus emoji reservoir |
| `b = 1` | emoji reservoir |

---

# Helper Functions

## `abc(b, x=None, u=0, safe=1)`

Returns the resolved alphabet string.

```python
from ubase import abc

print(abc(64))
```

## `effBase(b, x=None, u=0, safe=1)`

Returns the effective base after all filtering and clamping.

```python
from ubase import effBase

print(effBase(4096, [-1, "0123456789abcdef"], 1, 1))
```

---

# Example Usage

```python
from ubase import uBase

# basic encode / decode
print(uBase(n=1111110, b=10, s=1))
print(uBase(n=uBase(1111110, 10, 1), b=10, s=1))

# hexadecimal-style base
print(uBase(n=1111110, b=16, s=1))
print(uBase(n=uBase(1111110, 16, 1), b=16, s=1))

# base64-style alphabet
print(uBase(n=1111110, b=64, s=1))

# emoji alphabet
print(uBase(n=1111110, b=1, s=1))

# base62 + emoji reservoir
print(uBase(n=1111110, b=0, s=1))

# return codepoint hex view
print(uBase(n=1111110, b=16, s=1, m=1))

# return raw digit indices
print(uBase(n=1111110, b=16, s=1, m=2))

# decode from digit indices
digits = uBase(n=1111110, b=16, s=1, m=2)
print(uBase(n=digits, b=16, s=1, m=2))

# custom seeded alphabet
print(uBase(
    n=1111110,
    b=16,
    s=1,
    x=[-1, "0123456789abcdef"]
))

# seeded alphabet with exclusions
print(uBase(
    n=1111110,
    b=128,
    s=1,
    x=[-1, "0123456789abcdef", ("x", "0OIl"), 21]
))

# full Unicode ordering
print(uBase(
    n=1111110,
    b=128,
    s=1,
    u=1
))

# disable Unicode safety filter
print(uBase(
    n=1111110,
    b=128,
    s=1,
    u=1,
    safe=0
))

# exclusion of characters
print(uBase(
    n=1111110,
    b=64,
    s=1,
    x="0OIl"
))

# exclusion of Unicode ranges
print(uBase(
    n=1111110,
    b=128,
    s=1,
    x=[21, 24]
))

# escaped Unicode inside seed
print(uBase(
    n=1111110,
    b=64,
    s=1,
    x=[-1, r"abc\U1F600 def"]
))

# very large base request (will clamp automatically)
print(uBase(
    n=1111110,
    b=1000000,
    s=1
))

# round-trip demonstration
x = 1111110
encoded = uBase(x, 64, 1)
decoded = uBase(encoded, 64, 1)

print("encoded:", encoded)
print("decoded:", decoded)
```

---

# CLI

The CLI uses subcommands.

```bash
ubase-core --version
```

## Command summary

| command | purpose |
|---|---|
| `encode` | encode an integer |
| `decode` | decode symbols, hex view, or digit indices |
| `alphabet` | show the resolved alphabet or a preview |
| `effbase` | show the effective base |
| `info` | show resolved configuration and alphabet preview |
| `roundtrip` | encode in all modes and verify round-trip correctness |

## CLI basics

### Encode

```bash
ubase-core encode 1111111110 16 1
```

### Decode

```bash
ubase-core decode 4247d0a76 16 1
```

### Hex output mode

```bash
ubase-core encode 1111111110 16 1 -m 1
```

### Digit-index mode

```bash
ubase-core encode 1111111110 16 1 -m 2
```

Decode digit indices back:

```bash
ubase-core decode "4 2 4 7 13 0 10 7 6" 16 1 -m 2
```

Comma-separated input also works:

```bash
ubase-core decode "4,2,4,7,13,0,10,7,6" 16 1 -m 2
```

## CLI options

| option | meaning |
|---|---|
| `-m`, `--mode` | `0` symbols, `1` codepoint hex, `2` digit indices |
| `-u`, `--unicode` | `0` curated, `1` full Unicode |
| `--safe` | safe filter value |
| `--unsafe` | shortcut for `--safe 0` |
| `--seed` | custom seed segment, repeatable |
| `--exclude` | exclude characters, repeatable |
| `--range` | exclude Unicode range id, repeatable |
| `-x`, `--spec` | legacy mixed exclusions: ints exclude ranges, strings exclude chars |

## Invalid values

The core normalizes invalid values for `s`, `m`, `x`, `u`, and `safe` to:

- `s = 1`
- `m = 0`
- `x = None`
- `u = 0`
- `safe = 1`

The CLI passes values through to that behavior.

## CLI custom seed examples

### Hex-style alphabet seed

```bash
ubase-core encode 1111111110 16 1 --seed 0123456789abcdef
```

### Base64-style seed

```bash
ubase-core encode 1111111110 64 1 --seed 0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ-_
```

### Seed plus exclusions

```bash
ubase-core encode 1111111110 128 1 --seed 0123456789abcdef --exclude 0OIl --range 21 -u 1
```

### Multiple seed segments

Order is preserved across repeated `--seed` flags.

```bash
ubase-core encode 1111111110 128 1 --seed abc --seed 123 --seed XYZ
```

## CLI escape examples

Use shell quoting when the seed contains backslashes or spaces.

### Unicode smiley

```bash
ubase-core alphabet 64 --seed 'abc\u263A def'
```

### Unicode emoji

```bash
ubase-core alphabet 64 --seed 'abc\U1F600 def'
```

### Repeated seeds instead of internal spaces

```bash
ubase-core alphabet 64 --seed abc --seed '\U1F600' --seed def
```

## CLI alphabet inspection

### Preview alphabet

```bash
ubase-core alphabet 128
```

### Preview in hex

```bash
ubase-core alphabet 128 --hex-view
```

### Full alphabet

```bash
ubase-core alphabet 128 --full
```

### Show effective base only

```bash
ubase-core effbase 758327457298
```

### Show full config and preview

```bash
ubase-core info 128 --seed 0123456789abcdef --exclude 0OIl --range 21 -u 1
```

## CLI round-trip verification

This command encodes the input in all three modes and decodes them back.

```bash
ubase-core roundtrip 1111111110 128 1 --seed 0123456789abcdef --exclude 0OIl --range 21 -u 1
```

It prints:

- requested base
- effective base
- glyph encoding
- hex-view encoding
- digit-index encoding
- all three decoded integers
- pass/fail status

A failing round trip exits with a non-zero exit code.

---

# Notes

## Use the same configuration for decode

To reverse a value correctly, decode with the same:

- `b`
- `s`
- `m`
- `x`
- `u`
- `safe`

## `m = 2` returns alphabet indices

Digit-index mode returns indices in the resolved alphabet, not human decimal glyphs.

## Terminal rendering varies

Some Unicode glyphs may render differently or not at all depending on terminal, font, and platform.

Use `m = 1` or `m = 2` when you need a stable textual transport form.

## Shell quoting matters

When a CLI argument contains spaces, backslashes, or escape sequences, quote it.

Examples:

```bash
--seed 'abc\U1F600 def'
--exclude '0OIl'
```
---

# License

MIT License
