This corpus as distributed with PyCantonese is a reformatted version of the
Hong Kong Cantonese Corpus (HKCanCor) by Kang Kwong Luke <kkluke@ntu.edu.sg>,
hosted at <http://compling.hss.ntu.edu.sg/hkcancor/>.

HKCanCor is released under a CC BY license
<http://creativecommons.org/licenses/by/4.0/legalcode>.

If this corpus is used, the following should be cited:

K. K. Luke and May L.Y. Wong (2015) The Hong Kong Cantonese Corpus:
Design and Uses. Journal of Chinese Linguistics (to appear).

This reformatted version of HKCanCor comes with 61 text files:
- 58 corpus data files, whose filenames all begin with ``FC-``
- ``FILE_INFO`` containing the metadata for each corpus data file
- ``SPEAKERS`` for speakers' information
- ``README`` (this file)

None of the 61 text files have an extension name (such as .txt).

********************************

For each corpus data file:

Each non-empty line is a sentence, with words separated by the space character.
Example:

SP:FC-001_v2-A 喂_wai3/e 遲_ci4/a 啲_di1/u 去_heoi3/v 唔_m4/d 去_heoi3/v

Each sentence is prepended by the speaker identifier, which is ``FC-001_v2-A``
signaled by ``SP:`` (SP = speaker) in the example just above.
Each word carries three pieces of information:
Cantonese/Chinese character(s), Jyuping romanization, and part-of-speech tag.
They are arranged in this format using the underscore ``_`` and the forward
slash ``/``: character(s)_Jyutping/POS

For the part-of-speech tagset, please refer to the HKCanCor website
<http://compling.hss.ntu.edu.sg/hkcancor/>.

********************************

For ``SPEAKERS``:

There are altogether 147 distinct speakers. Each line in the file has the
information for one speaker, in the form of:

<speaker-identifier> <gender>-<age>-<origin>

For example, the first line of the file is as follows:

FC-001_v2-A F-34-HK

This particular speaker has the identifier ``FC-001_v2-A``, is a female, is
34 years old, and is from Hong Kong.

For speakers with missing information, the question mark ``?`` is used, e.g.,
``FC-020_v-A F-22-?`` with an unknown origin.

Note: In the original HKCanCor source corpus, a few speakers are coded with an
indeterminate age, e.g., ``30/35``, ``25/30``. These have been rendered with a
single number, one which lies in-between the given range, e.g., ``30/35``
becomes 32.

********************************

For ``FILE_INFO``:

Each line has the metadata for one corpus data file, in the form of:

<filename> <tape_number> <date_of_recording> <list_of_speakers>

For example, the first line is as follows:

FC-001_v2 001 300497 AB

The <date_of_recording> field uses the date format of DDMMYY; ``000000`` is used
for an unknown date.

The <list_of_speakers> field lists the letter codes for the speakers of the
corpus data file in question. For the example here with data file ``FC-001_v2``,
there are two speakers, FC-001_v2-A and FC-001_v2-B.

Note: In the original HKCanCor source corpus, some data files have metadata
categories of ``Notes`` and ``UN``. These are ignored in the present reformatted
version.

---------------------------------------

Last update of this README file: 2015-01-19

PyCantonese
Author: Jackson Lee <jsllee.phon@gmail.com>
Website: <http://pycantonese.github.io/>

