# CHANGELOG #

## Version 1.8.3, 2018-11-02 ##

- Fixed a bug that caused abbreviations with internal dots but without
  final dot to be split up erroneously (e.g. E.ON).

## Version 1.8.2, 2018-10-26 ##

- Fixed a bug with degree measurements in English (°F, etc.).
- Fixed a bug that caused SoMaJo to hang when an XML tag occured
  within a token that is allowed to contain whitespace.

## Version 1.8.1, 2018-07-30 ##

- Fixed the following bug: When using option -e, “nasty” characters
  between whitespace within tokens that are allowed to contain
  whitespace (e.g. XML tags) caused SoMaJo to hang.
- Added zero-width no-break space (FEFF) to “nasty” characters.

## Version 1.8.0, 2018-07-04 ##

- New language: SoMaJo can tokenize English texts (using the new
  option -l/--language).
- Small improvements to tokenization (URLs, emoticons, number
  compounds, …).

## Version 1.7.0, 2018-03-22 ##

SoMaJo has now full XML support. To tokenize an XML file, use the
option -x/--xml. Via the option --tag (can be used multiple times),
you can specify which tags always constitute sentence breaks, e.g.
title, h1 or p tags in an HTML file.

## Version 1.6.0, 2018-03-05 ##

- XML declarations are recognized as single tokens.
- Additional “nasty” characters (zero-width joiners and non-joiners,
  left-to-right and right-to-left marks) are removed from the input.
- The input is normalized to Unicode normal form C (NFC).

## Version 1.5.0, 2017-10-23 ##

- Bugfix: Removed trailing space from last token in
  paragraph/sentence.
- SoMaJo should be run as 'somajo-tokenizer'. The 'tokenizer' command
  is deprecated.
- XML entities (&amp;, &#75;, &#x7f;) are recognized as single tokens.
- Some abbreviations (usw., usf., etc., uvam.) indicate sentence
  boundaries if they are followed by a potential sentence start.
- We also print a log message that indicates tokenization speed.

## Version 1.4.4, 2017-08-03 ##

This release improves sentence splitting for sentences ending in
German closing quotation marks (“).

## Version 1.4.3, 2017-08-02 ##

This is a bugfix release that fixes a bug that occured in 1.4.2 when
using the option -e on some inputs containing control characters and
other “nasty” characters.

## Version 1.4.2, 2017-07-31 ##

Control characters and other “nasty” characters (soft hyphens and
zero-width spaces) are removed from the input.

## Version 1.4.1, 2017-07-28 ##

Added support for Unicode emoticons and various other Unicode symbols.

## Version 1.4.0, 2017-07-13 ##

SoMaJo can now perform sentence splitting (using the new option
--split_sentences).

## Version 1.3.1, 2017-07-04 ##

SoMaJo is now hosted on Github and the changes made in this version
reflect that change.

## Version 1.3.0, 2016-09-02 ##

Matching of items containing “+” or “&” or being written in camel case
has been optimized a bit. Now the tokenizer runs roughly three to four
times faster.

## Version 1.2.0, 2016-09-01 ##

Two new options added: With -s/--paragraph_separator, you can specify
how paragraphs are delimited in the input data, i.e. by empty lines or
by single newlines. The --parallelization option makes it possible to
use a pool of worker processes to speed up tokenization.

## Version 1.1.2, 2016-08-25 ##

The example in the documentation is now self-contained: Sample input
has been added and the output will be printed.

## Version 1.1.1, 2016-08-19 ##

The link in the Evaluation section of the Readme now points to the
complete gold standard data.

## Version 1.1.0, 2016-08-19 ##

SoMaJo can now output additional information about the original
spelling of the tokens, i.e. if a token was followed by whitespace or
if a token contained internal whitespace (according to the
tokenization guidelines, things like “: )” get normalized to “:)”). To
use this feature, provide the tokenizer script with the -e option.

## Version 1.0.3, 2016-08-18 ##

This version works around a bug in the regex module that caused
exponential runtimes on certain inputs.
