Metadata-Version: 2.1
Name: ideograph
Version: 1.0.0
Summary: Tool for finding ideographic (e.g. Han) characters from their components
Home-page: https://github.com/iwsfutcmd/ideograph
Author: Ben Yang
Author-email: benayang@gmail.com
License: UNKNOWN
Platform: UNKNOWN
Classifier: Programming Language :: Python :: 3
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Classifier: Topic :: Text Processing :: Linguistic
Classifier: Natural Language :: Chinese (Simplified)
Classifier: Natural Language :: Chinese (Traditional)
Classifier: Natural Language :: Japanese
Classifier: Natural Language :: Korean
Classifier: Natural Language :: Vietnamese
Description-Content-Type: text/markdown

# ideograph

A tool to look up ideographs by their components. At the moment, it only contains Han characters, but it could be expanded to include other ideographic scripts such as Tangut or Sumero-Akkadian Cuneiform.

## Installation

```bash
$ pip install ideograph
```

## Usage

*ideograph* consists of a single function `find()`, which takes a string of ideograph components and returns a set of ideographs that include all of those components.

Characters in the component string that are not ideographic components are ignored.

Note that the current implementation is quite strict and relies on visual distinction for components rather than etymological connection: e.g. "人" ≠ "亻".

*ideograph* can either be used from the command line:

```bash
$ ideograph 木日勿
䵘楊鸉𣝻𣿘𥂸𥠜𦼴𩁒𪳷𬬍
```

or imported as a Python package:

```python3
>>> import ideograph
>>> ideograph.find("木日勿")
{'𣿘', '𣝻', '𥠜', '𪎥', '𩁒', '𪎧', '𥟘', '𣓗', '楊', '𣓾', '𬬍', '𪳷', '𦼴', '鸉', '䵘', '𥂸'}
```

## Data

Character components are derived from the [cjkvi-ids database](https://github.com/cjkvi/cjkvi-ids) (included in this Git repository as a submodule), specifically the `ids-cdp.txt` data file. As some components do not currently have a Unicode code point assigned to them, they are given code points in the Private Use Area of Unicode. Note that because of this, some of these characters may be returned by the `find()` function.

Data is stored in a sqlite3 database, which can be regenerated from cjkvi-ids data by running the `generate_data.py` script.

