Metadata-Version: 2.4
Name: unicode-text-to-identifier
Version: 0.1.0a0
Summary: A tool that converts arbitrary text (like user input or file names) into valid Python identifiers while preserving as much of the original meaning as possible.
Author-email: Jifeng Wu <jifengwu2k@gmail.com>
License-Expression: MIT
Project-URL: Homepage, https://github.com/jifengwu2k/unicode-text-to-identifier
Project-URL: Bug Tracker, https://github.com/jifengwu2k/unicode-text-to-identifier/issues
Classifier: Programming Language :: Python :: 2
Classifier: Programming Language :: Python :: 3
Classifier: Operating System :: OS Independent
Requires-Python: >=2
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: enum34; python_version < "3.4"
Requires-Dist: typing; python_version < "3.5"
Dynamic: license-file

A tool that converts arbitrary text (like user input or file names) into valid Python identifiers while preserving as much of the original meaning as possible.

## Installation

```bash
pip install unicode-text-to-identifier
```

## Usage

```python
# coding=utf-8
from unicode_text_to_identifier import unicode_text_to_identifier

assert unicode_text_to_identifier(u"") == u"_"
assert unicode_text_to_identifier(u" \r\n\t") == u"_"
assert unicode_text_to_identifier(u"123abc") == u"_123abc"
assert unicode_text_to_identifier(u"&abc 123") == u"_abc_123"
assert unicode_text_to_identifier(u"  hello  world  $") == u"_hello_world__"
assert unicode_text_to_identifier(u"测试@unicode") == u"测试_unicode"
```

## How It Works

First, `get_unicode_character_type()` categorizes each Unicode character into one of four types:

- `LETTER_OR_UNDERSCORE` (Unicode category starting with 'L' or underscore `_`)
- `DECIMAL_DIGIT` (Unicode category 'Nd')
- `SPACE_OR_CONTROL` (Unicode categories starting with 'Z' or 'Cc')
- `OTHER` (all other characters)

Then, it implements the following conversion rules using a state machine:

- The first character must be a letter/underscore. If the first character is a digit, prepend `_` to make it valid.
- Subsequent valid characters (letters/underscores/digits) are kept as-is.
- Other characters are replaced with underscores, but whitespace/control character sequences are collapsed into single underscores.
- Ensures the output is non-empty (appends `_` if empty).

## Contributing

Contributions are welcome! Please submit pull requests or open issues on the GitHub repository.

## License

This project is licensed under the [MIT License](LICENSE).
