Metadata-Version: 2.1
Name: phonemizer
Version: 2.0.1
Summary: Simple text to phonemes converter for multiple languages
Home-page: https://github.com/bootphon/phonemizer
Author: Mathieu Bernard
Author-email: mathieu.a.bernard@inria.fr
License: GPL3
Keywords: linguistics G2P phoneme festival espeak TTS
Platform: UNKNOWN
Classifier: Programming Language :: Python :: 3
Classifier: License :: OSI Approved :: GNU General Public License v3 or later (GPLv3+)
Classifier: Operating System :: OS Independent
Requires-Python: >=3.6
Description-Content-Type: text/markdown
Requires-Dist: joblib
Requires-Dist: segments
Requires-Dist: attrs (>=18.1)

[![Travis (.org)](https://img.shields.io/travis/bootphon/phonemizer)](
https://travis-ci.org/bootphon/phonemizer)
[![Codecov](https://img.shields.io/codecov/c/github/bootphon/phonemizer)](
https://codecov.io/gh/bootphon/phonemizer)
[![GitHub release (latest SemVer)](https://img.shields.io/github/v/release/bootphon/phonemizer)](
https://github.com/bootphon/phonemizer/releases/latest)
[![PyPI - Downloads](https://img.shields.io/pypi/dm/phonemizer)](
https://pypi.org/project/phonemizer)
[![DOI](https://zenodo.org/badge/56728069.svg)](
https://doi.org/10.5281/zenodo.1045825)

# Phonemizer -- *foʊnmaɪzɚ*

* Simple text to phonemes converter for multiple languages, based on
  [festival](http://www.cstr.ed.ac.uk/projects/festival),
  [espeak-ng](https://github.com/espeak-ng/espeak-ng/)
  and [segments](https://github.com/cldf/segments).

* Provides both the `phonemize` command-line tool and the Python function
  `phonemizer.phonemize`

* **espeak-ng** is a text-to-speech software supporting multiple
  languages and IPA (Internatinal Phonetic Alphabet) output. See
  https://github.com/espeak-ng/espeak-ng. Alternatively you can use
  the orginal [espeak](http://espeak.sourceforge.net/) program
  (*espeak-ng* is a fork of *espeak* supporting much more languages
  and significant improvements).

* **festival** is also a text-to-speech software. Currently only
  American English is supported and festival uses a custom phoneset
  (http://www.festvox.org/bsv/c4711.html), but festival is the only
  backend supporting tokenization at the syllable level. See
  http://www.cstr.ed.ac.uk/projects/festival.

* **segments** is a Unicode tokenizer that build a phonemization from
  a grapheme to phoneme mapping provided as a file by the user. See
  https://github.com/cldf/segments.


## Installation

**You need python>=3.6.** If you really need to use python2, use an [older
version](https://github.com/bootphon/phonemizer/releases/tag/v1.0) of
phonemizer.

### Dependencies

* You need to install festival and espeak-ng on your system. Visit
  [this festival link](http://www.festvox.org/docs/manual-2.4.0/festival_6.html#Installation)
  and [that espeak-ng one](https://github.com/espeak-ng/espeak-ng#espeak-ng-text-to-speech)
  for installation guidelines. On Debian/Ubuntu simply run:

        $ sudo apt-get install festival espeak-ng

* Alternatively you may want to use `espeak` instead of `espeak-ng`,
  see [here](http://espeak.sourceforge.net/download.html) for
  instalaltion instructions.

### Phonemizer

* The simplest way is using pip:

        $ pip install phonemizer

* **OR** install it from sources with:

        $ git clone https://github.com/bootphon/phonemizer
        $ cd phonemizer
        $ python setup.py build
        $ [sudo] python setup.py install

  If you experiment an error such as `ImportError: No module named
  setuptools` during installation, refeer to [issue
  11](https://github.com/bootphon/phonemizer/issues/11).


### Docker image

Alternatively you can run the phonemizer within docker, using the
provided `Dockerfile`. To build the docker image, have a:

    $ git clone https://github.com/bootphon/phonemizer
    $ cd phonemizer
    $ sudo docker build -t phonemizer .

Then run an interactive session with:

    $ sudo docker run -it phonemizer /bin/bash


## Command-line examples

For a complete list of available options, have a:

    $ phonemize --help

See the installed backends with the `--version` option:

    $ phonemize --version
    phonemizer-2.0
    available backends: festival-2.5.0, espeak-ng-1.49.3, segments-2.0.1


### Input/output exemples

* from stdin to stdout:

        $ echo "hello world" | phonemize
        həloʊ wɜːld

* from file to stdout

        $ echo "hello world" > hello.txt
        $ phonemize hello.txt
        həloʊ wɜːld

* from file to file

        $ phonemize hello.txt -o hello.phon --strip
        $ cat hello.phon
        həloʊ wɜːld


### Token separators

You can specify separators for phonemes, syllables (festival only) and
words.

    $ echo "hello world" | phonemize -b festival -w ' ' -p ''
    hhaxlow werld

    $ echo "hello world" | phonemize -b festival -p ' ' -w ''
    hh ax l ow w er l d

    $ echo "hello world" | phonemize -b festival -p '-' -s '|'
    hh-ax-l-|ow-| w-er-l-d-|

    $ echo "hello world" | phonemize -b festival -p '-' -s '|' --strip
    hh-ax-l|ow w-er-l-d

    $ echo "hello world" | phonemize -b festival -p ' ' -s ';esyll ' -w ';eword '
    hh ax l ;esyll ow ;esyll ;eword w er l d ;esyll ;eword

You cannot specify the same separator for several tokens (for instance
a space for both phones and words):

    $ echo "hello world" | phonemize -b festival -p ' ' -w ' '
    fatal error: illegal separator with word=" ", syllable="" and phone=" ",
    must be all differents if not empty


### Options

* **Espeak** us-english is the default

        $ echo "hello world" | phonemize
        həloʊ wɜːld
        $ echo "hello world" | phonemize -l en-us -b espeak
        həloʊ wɜːld

* use **Festival** US English instead

        $ echo "hello world" | phonemize -l en-us -b festival
        hhaxlow werld

* In French, using **espeak**

        $ echo "bonjour le monde" | phonemize -b espeak -l fr-fr
        bɔ̃ʒuʁ lə- mɔ̃d

        $ echo "bonjour le monde" | phonemize -b espeak -l fr-fr -p ' ' -w ';eword '
        b ɔ̃ ʒ u ʁ ;eword l ə- ;eword m ɔ̃ d ;eword

* In Japanese, using **segments**

        $ echo 'konnichiwa' | phonemize -b segments -l japanese
        konnitʃiwa

        $ echo 'konnichiwa' | phonemize -b segments -l ./phonemizer/share/japanese.g2p
        konnitʃiwa

* **Espeak** can output SAMPA phonemes instead of IPA ones (this is only supported
  by espeak-ng, not by the original espeak)

        $ echo "hello world" | phonemize -l en-us -b espeak --sampa
        h@loU w3:ld

* **Espeak** can output the stresses on phonemes (this is not supported by festival
  or segments backends)

        $ echo "hello world" | phonemize -l en-us -b espeak --with-stress
        həlˈoʊ wˈɜːld

* **Espeak** can switch languages during phonemization (below from French to
  English), use the ``--language-switch`` option to deal with it

        $ echo "j'aime le football" | phonemize -l fr-fr -b espeak --language-switch keep-flags
        [WARNING] fount 1 utterances containing language switches on lines 1
        [WARNING] extra phones may appear in the "fr-fr" phoneset
        [WARNING] language switch flags have been kept (applying "keep-flags" policy)
        ʒɛm lə- (en)fʊtbɔːl(fr)

        $ echo "j'aime le football" | phonemize -l fr-fr -b espeak --language-switch remove-flags
        [WARNING] fount 1 utterances containing language switches on lines 1
        [WARNING] extra phones may appear in the "fr-fr" phoneset
        [WARNING] language switch flags have been removed (applying "remove-flags" policy)
        ʒɛm lə- fʊtbɔːl

        $ echo "j'aime le football" | phonemize -l fr-fr -b espeak --language-switch remove-utterance
        [WARNING] removed 1 utterances containing language switches (applying "remove-utterance" policy)


### Supported languages

* Languages supported by festival are:

        en-us	->	english-us

* Languages supported by the segments backend are:

        chintang  -> ./phonemizer/share/chintang.g2p
	    cree	  -> ./phonemizer/share/cree.g2p
	    inuktitut -> ./phonemizer/share/inuktitut.g2p
	    japanese  -> ./phonemizer/share/japanese.g2p
	    sesotho	  -> ./phonemizer/share/sesotho.g2p
	    yucatec	  -> ./phonemizer/share/yucatec.g2p

  Instead of a language you can also provide a file specifying a
  grapheme to phoneme mapping (see the files above for exemples).

* Languages supported by espeak are (espeak-ng supports even more of
  them), type `phonemize --help` for an exhaustive list:

        af	->	afrikaans
        an	->	aragonese
        bg	->	bulgarian
        bs	->	bosnian
        ca	->	catalan
        cs	->	czech
        cy	->	welsh
        da	->	danish
        de	->	german
        el	->	greek
        en	->	default
        en-gb	->	english
        en-sc	->	en-scottish
        en-uk-north	->	english-north
        en-uk-rp	->	english_rp
        en-uk-wmids	->	english_wmids
        en-us	->	english-us
        en-wi	->	en-westindies
        eo	->	esperanto
        es	->	spanish
        es-la	->	spanish-latin-am
        et	->	estonian
        fa	->	persian
        fa-pin	->	persian-pinglish
        fi	->	finnish
        fr-be	->	french-Belgium
        fr-fr	->	french
        ga	->	irish-gaeilge
        grc	->	greek-ancient
        hi	->	hindi
        hr	->	croatian
        hu	->	hungarian
        hy	->	armenian
        hy-west	->	armenian-west
        id	->	indonesian
        is	->	icelandic
        it	->	italian
        jbo	->	lojban
        ka	->	georgian
        kn	->	kannada
        ku	->	kurdish
        la	->	latin
        lfn	->	lingua_franca_nova
        lt	->	lithuanian
        lv	->	latvian
        mk	->	macedonian
        ml	->	malayalam
        ms	->	malay
        ne	->	nepali
        nl	->	dutch
        no	->	norwegian
        pa	->	punjabi
        pl	->	polish
        pt-br	->	brazil
        pt-pt	->	portugal
        ro	->	romanian
        ru	->	russian
        sk	->	slovak
        sq	->	albanian
        sr	->	serbian
        sv	->	swedish
        sw	->	swahili-test
        ta	->	tamil
        tr	->	turkish
        vi	->	vietnam
        vi-hue	->	vietnam_hue
        vi-sgn	->	vietnam_sgn
        zh	->	Mandarin
        zh-yue	->	cantonese


## Licence

**Copyright 2015-2019 Mathieu Bernard**

This program is free software: you can redistribute it and/or modify
it under the terms of the GNU General Public License as published by
the Free Software Foundation, either version 3 of the License, or
(at your option) any later version.

This program is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
GNU General Public License for more details.

You should have received a copy of the GNU General Public License
along with this program. If not, see <http://www.gnu.org/licenses/>.


