Metadata-Version: 1.1
Name: SoMaJo
Version: 1.1.2
Summary: A tokenizer for German web and social media texts.
Home-page: http://pypi.python.org/pypi/SoMaJo/
Author: Thomas Proisl, Peter Uhrig
Author-email: thomas.proisl@fau.de
License: GNU General Public License v3 or later (GPLv3+)
Description: ======
        SoMaJo
        ======
        
        Introduction
        ============
        
        SoMaJo is a state-of-the-art tokenizer for German web and social media
        texts that won the `EmpiriST 2015 shared task
        <https://sites.google.com/site/empirist2015/>`_ on automatic
        linguistic annotation of computer-mediated communication / social
        media. As such, it is particularly well-suited to perform tokenization
        on all kinds of written discourse, for example chats, forums, wiki
        talk pages, tweets, blog comments, social networks, SMS and WhatsApp
        dialogues.
        
        In addition to tokenizing the input text, SoMaJo can also output token
        class information for each token, i.e. if it is a number, an XML tag,
        an abbreviation, etc.::
        
          echo 'Wow, superTool!;)' | tokenizer -c -t -
          Wow	regular
          ,	symbol
          super	regular
          Tool	regular
          !	symbol
          ;)	emoticon
        
        *New in version 1.1.0*: SoMaJo can output additional information for
        each token that can help to reconstruct the original untokenized text
        (to a certain extent)::
        
          echo 'der beste Betreuer? - >ProfSmith! : )' | tokenizer -c -e -
          der	
          beste	
          Betreuer	SpaceAfter=No
          ?	
          ->	SpaceAfter=No, OriginalSpelling="- >"
          Prof	SpaceAfter=No
          Smith	SpaceAfter=No
          !	
          :)	OriginalSpelling=": )"
        
        The ``-t`` and ``-e`` options can also be used in combination, of course.
        
        The system is described in greater detail in Proisl and Uhrig (2016).
        
        Installation
        ============
        
        SoMaJo can be easily installed using pip::
        
          pip install SoMaJo
        
        Usage
        =====
        
        Using the tokenizer executable
        ------------------------------
        
        You can use the tokenizer as a standalone program from the command
        line. General usage information is available via the ``-h`` option::
          
          tokenizer -h
        
        To tokenize a text file according to the guidelines of the EmpiriST
        2015 shared task, just call the tokenizer like this::
          
          tokenizer -c <file>
        
        If you do not want to split camel-cased tokens, simply drop the ``-c``
        option::
          
          tokenizer <file>
        
        The tokenizer can also output token class information for each token,
        i.e. if it is a number, an XML tag, an abbreviation, etc.::
          
          tokenizer -t <file>
        
        If you want to be able to reconstruct the untokenized input to a
        certain extent, SoMaJo can also provide you with additional details
        for each token, i.e. if the token was followed by whitespace or if it
        contained internal whitespace (according to the tokenization
        guidelines, things like “: )” get normalized to “:)”)::
        
          tokenizer -e <file>
        
        Using the module
        ----------------
        
        You can easily incorporate the tokenizer into your own Python
        projects. All you have to do is import ``somajo.Tokenizer``, create a
        ``Tokenizer`` object and call its ``tokenize`` method::
        
          from somajo import Tokenizer
        
          # note that paragraphs are allowed to contain newlines
          paragraph = "der beste Betreuer?\n- >ProfSmith! : )"
          
          tokenizer = Tokenizer(split_camel_case=True, token_classes=False, extra_info=False)
          tokens = tokenizer.tokenize(paragraph)
        
          print("\n".join(tokens))
        
        Evaluation
        ==========
        
        SoMaJo was the system with the highest average F₁ score in the
        EmpiriST 2015 shared task. The performance of the current version on
        the two test sets is summarized in the following table\ [1]_:
        
        ====== ========= ====== =====
        Corpus Precision Recall F₁
        ====== ========= ====== =====
        CMC    99.62     99.56  99.59
        Web    99.83     99.92  99.87
        ====== ========= ====== =====
        
        .. [1] Training and test sets are available from the `official website
               <https://sites.google.com/site/empirist2015/home/gold>`_.
        
        References
        ==========
        
        - Proisl, Thomas, Peter Uhrig (2016): “SoMaJo: State-of-the-art
          tokenization for German web and social media texts.” In: Proceedings
          of the 10th Web as Corpus Workshop (WAC-X) and the EmpiriST Shared
          Task. Berlin: Association for Computational Linguistics (ACL),
          91–96. `PDF <http://aclweb.org/anthology/W16-2607>`_.
        
Platform: UNKNOWN
Classifier: Development Status :: 4 - Beta
Classifier: Environment :: Console
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: GNU General Public License v3 or later (GPLv3+)
Classifier: Natural Language :: German
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3
Classifier: Topic :: Text Processing :: Linguistic
