# CHANGELOG #

## Version 1.6.0, 2018-03-05 ##

- XML declarations are recognized as single tokens.
- Additional “nasty” characters (zero-width joiners and non-joiners,
  left-to-right and right-to-left marks) are removed from the input.
- The input is normalized to Unicode normal form C (NFC).

## Version 1.5.0, 2017-10-23 ##

- Bugfix: Removed trailing space from last token in
  paragraph/sentence.
- SoMaJo should be run as 'somajo-tokenizer'. The 'tokenizer' command
  is deprecated.
- XML entities (&amp;, &#75;, &#x7f;) are recognized as single tokens.
- Some abbreviations (usw., usf., etc., uvam.) indicate sentence
  boundaries if they are followed by a potential sentence start.
- We also print a log message that indicates tokenization speed.

## Version 1.4.4, 2017-08-03 ##

This release improves sentence splitting for sentences ending in
German closing quotation marks (“).

## Version 1.4.3, 2017-08-02 ##

This is a bugfix release that fixes a bug that occured in 1.4.2 when
using the option -e on some inputs containing control characters and
other “nasty” characters.

## Version 1.4.2, 2017-07-31 ##

Control characters and other “nasty” characters (soft hyphens and
zero-width spaces) are removed from the input.

## Version 1.4.1, 2017-07-28 ##

Added support for Unicode emoticons and various other Unicode symbols.

## Version 1.4.0, 2017-07-13 ##

SoMaJo can now perform sentence splitting (using the new option
--split_sentences).

## Version 1.3.1, 2017-07-04 ##

SoMaJo is now hosted on Github and the changes made in this version
reflect that change.

## Version 1.3.0, 2016-09-02 ##

Matching of items containing “+” or “&” or being written in camel case
has been optimized a bit. Now the tokenizer runs roughly three to four
times faster.

## Version 1.2.0, 2016-09-01 ##

Two new options added: With -s/--paragraph_separator, you can specify
how paragraphs are delimited in the input data, i.e. by empty lines or
by single newlines. The --parallelization option makes it possible to
use a pool of worker processes to speed up tokenization.

## Version 1.1.2, 2016-08-25 ##

The example in the documentation is now self-contained: Sample input
has been added and the output will be printed.

## Version 1.1.1, 2016-08-19 ##

The link in the Evaluation section of the Readme now points to the
complete gold standard data.

## Version 1.1.0, 2016-08-19 ##

SoMaJo can now output additional information about the original
spelling of the tokens, i.e. if a token was followed by whitespace or
if a token contained internal whitespace (according to the
tokenization guidelines, things like “: )” get normalized to “:)”). To
use this feature, provide the tokenizer script with the ``-e`` option.

## Version 1.0.3, 2016-08-18 ##

This version works around a bug in the regex module that caused
exponential runtimes on certain inputs.
