Metadata-Version: 2.1
Name: pycantonese
Version: 3.1.0.dev3
Summary: PyCantonese: Cantonese Linguistics and NLP in Python
Home-page: https://pycantonese.org
Author: Jackson L. Lee
Author-email: jacksonlunlee@gmail.com
License: MIT License
Download-URL: https://pypi.org/project/pycantonese/#files
Project-URL: Bug Tracker, https://github.com/jacksonllee/pycantonese/issues
Project-URL: Source Code, https://github.com/jacksonllee/pycantonese
Keywords: computational linguistics,natural language processing,NLP,Cantonese,linguistics,corpora,speech,language,Chinese,Jyutping
Platform: UNKNOWN
Classifier: Development Status :: 4 - Beta
Classifier: Environment :: Console
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Education
Classifier: Intended Audience :: Information Technology
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: MIT License
Classifier: Natural Language :: Chinese (Traditional)
Classifier: Natural Language :: Cantonese
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3 :: Only
Classifier: Programming Language :: Python :: 3.6
Classifier: Programming Language :: Python :: 3.7
Classifier: Programming Language :: Python :: 3.8
Classifier: Programming Language :: Python :: 3.9
Classifier: Topic :: Scientific/Engineering
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: Topic :: Scientific/Engineering :: Human Machine Interfaces
Classifier: Topic :: Scientific/Engineering :: Information Analysis
Classifier: Topic :: Text Processing
Classifier: Topic :: Text Processing :: Filters
Classifier: Topic :: Text Processing :: General
Classifier: Topic :: Text Processing :: Indexing
Classifier: Topic :: Text Processing :: Linguistic
Requires-Python: >=3.6
Description-Content-Type: text/x-rst
Requires-Dist: pylangacq (<1.0.0,>=0.12.0)
Requires-Dist: wordseg (==0.0.2)

PyCantonese: Cantonese Linguistics and NLP in Python
====================================================



Full Documentation: https://pycantonese.org

|

.. image:: https://badge.fury.io/py/pycantonese.svg
   :target: https://pypi.python.org/pypi/pycantonese
   :alt: PyPI version

.. image:: https://img.shields.io/pypi/pyversions/pycantonese.svg
   :target: https://pypi.python.org/pypi/pycantonese
   :alt: Supported Python versions

.. image:: https://circleci.com/gh/jacksonllee/pycantonese/tree/master.svg?style=svg
   :target: https://circleci.com/gh/jacksonllee/pycantonese/tree/master
   :alt: Build

|

.. start-sphinx-website-index-page

PyCantonese is a Python library for Cantonese linguistics and natural language
processing (NLP). Currently implemented features (more to come!):

- Accessing and searching corpus data
- Parsing and conversion tools for Jyutping romanization
- Stop words
- Word segmentation
- Part-of-speech tagging

Quick Examples
--------------

With PyCantonese imported:

.. code-block:: python

    >>> import pycantonese as pc

1. Word segmentation

.. code-block:: python

    >>> pc.segment("廣東話好難學？")  # Is Cantonese difficult to learn?
    ['廣東話', '好', '難', '學', '？']

2. Conversion from Cantonese characters to Jyutping

.. code-block:: python

    >>> pc.characters_to_jyutping('香港人講廣東話')  # Hongkongers speak Cantonese
    [("香港人", "hoeng1gong2jan4"), ("講", "gong2"), ("廣東話", "gwong2dung1waa2")]

3. Finding all verbs in the HKCanCor corpus

   In this example,
   we search for the regular expression ``'^V'`` for all words whose
   part-of-speech tag begins with "V" in the original HKCanCor annotations:

.. code-block:: python

    >>> corpus = pc.hkcancor() # get HKCanCor
    >>> all_verbs = corpus.search(pos='^V')
    >>> len(all_verbs)  # number of all verbs
    29012
    >>> from pprint import pprint
    >>> pprint(all_verbs[:10])  # print 10 results
    [('去', 'V', 'heoi3', ''),
     ('去', 'V', 'heoi3', ''),
     ('旅行', 'VN', 'leoi5hang4', ''),
     ('有冇', 'V1', 'jau5mou5', ''),
     ('要', 'VU', 'jiu3', ''),
     ('有得', 'VU', 'jau5dak1', ''),
     ('冇得', 'VU', 'mou5dak1', ''),
     ('去', 'V', 'heoi3', ''),
     ('係', 'V', 'hai6', ''),
     ('係', 'V', 'hai6', '')]

4. Parsing Jyutping for (onset, nucleus, coda, tone)

.. code-block:: python

    >>> pc.parse_jyutping('gwong2dung1waa2')  # 廣東話
    [('gw', 'o', 'ng', '2'), ('d', 'u', 'ng', '1'), ('w', 'aa', '', '2')]

Download and Install
--------------------

PyCantonese requires Python 3.6 or above.
To download and install the stable, most recent version::

    $ pip install --upgrade pycantonese

To test your installation in the Python interpreter:

.. code-block:: python

    >>> import pycantonese as pc
    >>> pc.__version__  # show version number

Links
-----

* Source code: https://github.com/jacksonllee/pycantonese
* Bug tracker, feature requests: https://github.com/jacksonllee/pycantonese/issues
* Email: Please contact `Jackson Lee <https://jacksonllee.com>`_.
* Social media: Updates, tips, and more are posted on the Facebook page below.



|

How to Cite
-----------

PyCantonese is authored and mainteined by `Jackson L. Lee <https://jacksonllee.com>`_.

A talk introducing PyCantonese:

Lee, Jackson L. 2015. PyCantonese: Cantonese linguistic research in the age of big data.
Talk at the Childhood Bilingualism Research Centre, Chinese University of Hong Kong. September 15. 2015.
`Notes+slides <https://pycantonese.org/papers/Lee-pycantonese-2015.html>`_

License
-------

MIT License. Please see ``LICENSE.txt`` in the GitHub source code for details.

The HKCanCor dataset included in PyCantonese is substantially modified from
its source in terms of format. The original dataset has a CC BY license.
Please see ``pycantonese/data/hkcancor/README.md``
in the GitHub source code for details.

The rime-cantonese data (release 2020.09.09) is
incorporated into PyCantonese for word segmentation and
characters-to-Jyutping conversion.
This data has a CC BY 4.0 license.
Please see ``pycantonese/data/rime_cantonese/README.md``
in the GitHub source code for details.

Logo
----

The PyCantonese logo is the Chinese character 粵 meaning Cantonese,
with artistic design by albino.snowman (Instagram handle).

Acknowledgments
---------------

Individuals who have contributed feedback, bug reports, etc.
(in alphabetical order of last names if known):

- @cathug
- Litong Chen
- @g-traveller
- Rachel Han
- Ryan Lai
- Charles Lam
- Hill Ma
- @richielo
- @rylanchiu
- Stephan Stiller
- Tsz-Him Tsui

.. end-sphinx-website-index-page

Changelog
---------

Please see ``CHANGELOG.md``.

Setting up a Development Environment
------------------------------------

The latest code under development is available on Github at
`jacksonllee/pycantonese <https://github.com/jacksonllee/pycantonese>`_.
You need to have `Git LFS <https://git-lfs.github.com/>`_ installed on your system.
To obtain this version for experimental features or for development:

.. code-block:: bash

   $ git clone https://github.com/jacksonllee/pycantonese.git
   $ cd pycantonese
   $ git lfs pull
   $ pip install -r dev-requirements.txt
   $ pip install -e .

To run tests and styling checks:

.. code-block:: bash

   $ pytest -vv --doctest-modules --cov=pycantonese pycantonese docs
   $ flake8 pycantonese
   $ black --check --line-length=79 pycantonese

To build the documentation website files:

.. code-block:: bash

    $ python build_docs.py

