Metadata-Version: 2.1
Name: bnunicodenormalizer
Version: 0.0.1
Summary: Bangla Unicode Normalization Toolkit
Home-page: https://github.com/mnansary/bnUnicodeNormalizer
Author: Bengali.AI
Author-email: research.bengaliai@gmail.com
License: MIT
Keywords: bangla,unicode,text normalization
Platform: UNKNOWN
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Education
Classifier: Operating System :: OS Independent
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Description-Content-Type: text/markdown
License-File: LICENSE

# bnUnicodeNormalizer
Bangla Unicode Normalization
# install
```python
pip install bnunicodenormalizer
```
# useage
```python
# import
from bnunicodenormalizer import Normalizer 
# initialize
bnorm=Normalizer()
# normalize
word='াআমাকো'
print(f"Non-norm:{word}; Norm:{bnorm(word)}")
```
> Non-norm:াআমাকো; Norm:আমাকো

# Cases

**In all examples (a) is the non-normalized form and (b) is the normalized form**

*  Broken Vowel and consonanr diacritics
``` 
# Example-1: 
(a)'আরো'==(b)'আরো' ->  False 
    (a) breaks as:['আ', 'র', 'ে', 'া']
    (b) breaks as:['আ', 'র', 'ো']
# Example-2:
(a)পৌঁছে==(b)পৌঁছে ->  False
    (a) breaks as:['প', 'ে', 'ৗ', 'ঁ', 'ছ', 'ে']
    (b) breaks as:['প', 'ৌ', 'ঁ', 'ছ', 'ে']
# Example-3:
(a)সংস্কৄতি==(b)সংস্কৃতি ->  False
    (a) breaks as:['স', 'ং', 'স', '্', 'ক', 'ৄ', 'ত', 'ি']
    (b) breaks as:['স', 'ং', 'স', '্', 'ক', 'ৃ', 'ত', 'ি']
```
* Broken nukta unicode
```        
Example-1:
(a)কেন্দ্রীয়==(b)কেন্দ্রীয় ->  False
    (a) breaks as:['ক', 'ে', 'ন', '্', 'দ', '্', 'র', 'ী', 'য', '়']
    (b) breaks as:['ক', 'ে', 'ন', '্', 'দ', '্', 'র', 'ী', 'য়']
Example-2:
(a)রযে়ছে==(b)রয়েছে ->  False
    (a) breaks as:['র', 'য', 'ে', '়', 'ছ', 'ে']
    (b) breaks as:['র', 'য়', 'ে', 'ছ', 'ে']
Example-3: 
(a)জ়ন্য==(b)জন্য ->  False
    (a) breaks as:['জ', '়', 'ন', '্', 'য']
    (b) breaks as:['জ', 'ন', '্', 'য']
``` 
* Invalid hosontos that come after / before the vowels and the modifiers
```
# Example-1:
(a)দুই্টি==(b)দুইটি-->False
    (a) breaks as ['দ', 'ু', 'ই', '্', 'ট', 'ি']
    (b) breaks as ['দ', 'ু', 'ই', 'ট', 'ি']
# Example-2:
(a)এ্তে==(b)এতে-->False
    (a) breaks as ['এ', '্', 'ত', 'ে']
    (b) breaks as ['এ', 'ত', 'ে']
# Example-3:
(a)নেট্ওয়ার্ক==(b)নেটওয়ার্ক-->False
    (a) breaks as ['ন', 'ে', 'ট', '্', 'ও', 'য়', 'া', 'র', '্', 'ক']
    (b) breaks as ['ন', 'ে', 'ট', 'ও', 'য়', 'া', 'র', '্', 'ক']
# Example-4:
(a)এস্আই==(b)এসআই-->False
    (a) breaks as ['এ', 'স', '্', 'আ', 'ই']
    (b) breaks as ['এ', 'স', 'আ', 'ই']
```
* Invalid hosonto is in between two vowel diacritics 
``` 
# Example-1: 
(a)'চু্ক্তি'==(b)'চুক্তি' ->  False 
    (a) breaks as:['চ', 'ু', '্', 'ক', '্', 'ত', 'ি']
    (b) breaks as:['চ', 'ু','ক', '্', 'ত', 'ি']
# Example-2:
(a)'যু্ক্ত'==(b)'যুক্ত' ->   False
    (a) breaks as:['য', 'ু', '্', 'ক', '্', 'ত']
    (b) breaks as:['য', 'ু', 'ক', '্', 'ত']
# Example-3:
(a)'কিছু্ই'==(b)'কিছুই' ->   False
    (a) breaks as:['ক', 'ি', 'ছ', 'ু', '্', 'ই']
    (b) breaks as:['ক', 'ি', 'ছ', 'ু','ই']
```
* 'ত'+hosonto 
``` 
# Example-1:
(a)বুত্পত্তি==(b)বুৎপত্তি-->False
    (a) breaks as ['ব', 'ু', 'ত', '্', 'প', 'ত', '্', 'ত', 'ি']
    (b) breaks as ['ব', 'ু', 'ৎ', 'প', 'ত', '্', 'ত', 'ি']
# Example-2:
(a)উত্স==(b)উৎস-->False
    (a) breaks as ['উ', 'ত', '্', 'স']
    (b) breaks as ['উ', 'ৎ', 'স']
```
* Unwanted consecutive double diacritics
```
# Example-1: 
(a)'যুুদ্ধ'==(b)'যুদ্ধ' ->  False 
    (a) breaks as:['য', 'ু', 'ু', 'দ', '্', 'ধ']
    (b) breaks as:['য', 'ু', 'দ', '্', 'ধ']
# Example-2:
(a)'দুুই'==(b)'দুই' ->   False
    (a) breaks as:['দ', 'ু', 'ু', 'ই']
    (b) breaks as:['দ', 'ু', 'ই']
# Example-3:
(a)'প্রকৃৃতির'==(b)'প্রকৃতির' ->   False
    (a) breaks as:['প', '্', 'র', 'ক', 'ৃ', 'ৃ', 'ত', 'ি', 'র']
    (b) breaks as:['প', '্', 'র', 'ক', 'ৃ', 'ত', 'ি', 'র']
# Example-4:
(a)আমাকোা==(b)'আমাকো'->   False
    (a) breaks as:['আ', 'ম', 'া', 'ক', 'ে', 'া', 'া']
    (b) breaks as:['আ', 'ম', 'া', 'ক', 'ো']
```


* vowels followed by vowel diacritics
```
# Example-1:
(a)উুলু==(b)উলু-->False
    (a) breaks as ['উ', 'ু', 'ল', 'ু']
    (b) breaks as ['উ', 'ল', 'ু']
# Example-2:
(a)আর্কিওোলজি==(b)আর্কিওলজি-->False
    (a) breaks as ['আ', 'র', '্', 'ক', 'ি', 'ও', 'ো', 'ল', 'জ', 'ি']
    (b) breaks as ['আ', 'র', '্', 'ক', 'ি', 'ও', 'ল', 'জ', 'ি']


Also Normalizes 'এ' and 'ত্র'
# Example-1:
(a)একএে==(b)একত্রে-->False
    (a) breaks as ['এ', 'ক', 'এ', 'ে']
    (b) breaks as ['এ', 'ক', 'ত', '্', 'র', 'ে']
# Example-2:
(a)একএ==(b)একত্র-->False
    (a) breaks as ['এ', 'ক', 'এ']
    (b) breaks as ['এ', 'ক', 'ত', '্', 'র']
```             
* Repeated consonant diacritics (folas)
```
# Example-1:
(a)গ্র্রামকে==(b)গ্রামকে-->False
    (a) breaks as ['গ', '্', 'র', '্', 'র', 'া', 'ম', 'ক', 'ে']
    (b) breaks as ['গ', '্', 'র', 'া', 'ম', 'ক', 'ে']
```
* Removes invalid starts and ends

Change Log
===========

0.0.1 (15/02/2022)
-------------------
- First Release

