Metadata-Version: 2.0
Name: spsim
Version: 0.1.2
Summary: A spelling similarity measure for cognate identification.
Home-page: https://github.com/luismsgomes/spsim
Author: Luís Gomes
Author-email: luismsgomes@gmail.com
License: MIT
Keywords: text bilingual cognate mt machine translation
Platform: UNKNOWN
Classifier: Development Status :: 3 - Alpha
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3.5
Classifier: Topic :: Text Processing :: Linguistic
Requires-Dist: docopt
Requires-Dist: munkres
Requires-Dist: openfile
Requires-Dist: stringology

=======
 spsim
=======

``spsim`` is a Python 3 module that implements a spelling similarity measure
for identifying cognates across languages, taking into account spelling
differences that are characteristic of each language pair, as described
in [Gomes2011]_.

Note: in the examples below, `$` denotes the Bash prompt and a Linux, MacOs or similar \*nix environment is assumed.

Install as usual::

    $ pip3 install spsim

Example command line usage::

    $ # first let's get some pairs of words that may be cognates:
    $ wget http://research.variancia.com/spsim/maybe_enpt.txt
    $ cat maybe_enpt.txt
    pharmacy    farmácia
    arithmetic  aritmética

    $ # If we don't give any example cognates, SpSim will be equivalent to
    $ #             1 - edit_distance / max_len_of_strings
    $ # Note that by default spsim matches accentuated characters, i.e. a == á
    $ echo "" > empty.txt
    $ spsim empty.txt maybe_enpt.txt
    pharmacy    farmácia    0.5
    arithmetic  aritmética  0.8

    $ now let's get some example cognates:
    $ wget http://research.variancia.com/spsim/examples_enpt.txt
    $ cat examples_enpt.txt
    alcohol     álcool
    alpha       alfa
    anomaly     anomalia
    mathematics matemática
    methodology metodologia
    metric      métrica
    morphine    morfina
    photos      fotos

    $ # by giving these examples to spsim, it will learn to ignore certain differences:
    $ spsim examples_enpt.txt maybe_enpt.txt
    pharmacy    farmácia    1.0
    arithmetic  aritmética  1.0


.. [Gomes2011] Measuring Spelling Similarity for Cognate Identification,
    Luís Gomes and Gabriel Pereira Lopes
    in *Progress in Artificial Intelligence, 15th Portuguese Conference in
    Artificial Intelligence, EPIA 2011, Lisboa, Portugal, October 2011*,
    http://www.springerlink.com/content/gtl56j3l06906020/



