Metadata-Version: 1.2
Name: pystempel
Version: 1.0.1
Summary: Polish stemmer.
Home-page: https://github.com/dzieciou/pystempel
Author: Maciej Gawinecki
Author-email: mgawinecki@gmail.com
License: See documentation
Project-URL: Source, https://github.com/dzieciou/pystempel
Description: Stempel Stemmer
        ===============
        
        Python port of Stempel, an algorithmic stemmer for Polish language, originally written in Java.
        
        The original stemmer has been implemented as part of `Egothor Project`_, taken virtually unchanged to
        `Stempel Stemmer Java library`_ by Andrzej Białecki and next included as part of `Apache Lucene`_,
        a free and open-source search engine library.
        
        .. _Egothor Project: https://www.egothor.org/product/egothor2/
        .. _Stempel Stemmer Java library: http://www.getopt.org/stempel/index.html
        .. _Apache Lucene: https://lucene.apache.org/core/3_1_0/api/contrib-stempel/index.html
        
        This package includes also high-quality stemming table for Polish with 20,000 training sets,
        pretrained by Andrzej Białecki.
        
        The port does not include code for compiling stemming tables.
        
        
        
        .. _sjp.pl: https://sjp.pl/slownik/en/
        
        
        How to use
        ----------
        
        Install in your local environment:
        
        .. code:: console
        
          pip install pystempel
        
        Use in your code:
        
        .. code:: python
        
          >>> from stempel import StempelStemmer
          >>> stemmer = StempelStemmer.default()
          >>> for word in ['książki', 'książki', 'książkami', 'książkowa', 'książkowymi']:
          ...   print(stemmer.stem(word))
          ...
          książek
          książek
          książek
          książkowy
          książkowy
        
        
        Choosing between port and wrapper
        ---------------------------------
        
        If you work on an NLP project in Python you can choose between Python port and Python wrapper.
        Python port is what pystempel tries to achieve: translation from Java implementation to Python.
        Python wrapper is what I used in `tests`_: Python functions to call the original Java implementation of
        stemmer. You can find more about wrappers and ports in `Stackoverflow comparision post`_. Here, I
        compare both approaches to help you decide:
        
        * **Same accuracy**. I have verified Python port by comparing its output
          with output of original Java implementation for 331224 words from Free Polish dictionary
          (`sjp.pl`_) and for 100% of words it returns same output.
        * **Similar performance**. For mentioned dataset both stemmer versions achieved comparable performance.
          Python port completed stemming in 4.4 seconds, while Python wrapper -- in 5 seconds (Intel Core
          i5-6000 3.30 GHz, 16GB RAM, Windows 10, OpenJDK)
        * **Different setup**. Python wrapper requires additionally installation of Cython and pyjnius.
          Python wrapper will make also `debugging harder`_ (switching between two programming languages).
        
        .. _Stackoverflow comparision post: https://stackoverflow.com/questions/10113218/how-to-decide-when-to-wrap-port-write-from-scratch
        .. _debugging harder: https://stackoverflow.com/questions/6970359/find-an-efficient-way-to-integrate-different-language-libraries-into-one-project
        .. _tests: tests/
        
        Development setup
        -----------------
        
        To setup environment for development you will need `Anaconda`_ installed.
        
        .. _Anaconda: https://anaconda.org/
        
        .. code:: console
        
            conda create -n stempel-stemmer
            conda activate stempel-stemmer
            conda install -c conda-forge --file requirements.txt
        
        To run tests:
        
        .. code:: console
        
            curl https://repo1.maven.org/maven2/org/apache/lucene/lucene-analyzers-stempel/8.1.1/lucene-analyzers-stempel-8.1.1.jar > stempel-8.1.1.jar
            python -m pytest ./
        
        To run benchmark:
        
        .. code:: console
        
            python tests\test_benchmark.py
        
        Licensing
        ------------------
        
        Most of the code is covered by `Egothor Open Source License`_, an Apache-style license. The rest of
        the code and pretrained stemming table are covered by the `Apache License 2.0`_. Unit tests use the
        Free Polish dictionary for use in spell-checking from `sjp.pl`_ , covered by `Apache License 2.0`_
        as well.
        
        .. _Egothor Open Source License: https://www.egothor.org/product/egothor2/
        .. _Apache License 2.0: https://www.apache.org/licenses/LICENSE-2.0
        
        
        
        
        Other languages
        ------------------
        
        * `Estem`_ is Erlang wrapper (not port) for Stempel stemmer.
        
        .. _Estem: https://github.com/arcusfelis/estem
        
        
Keywords: NLP,natural language processing,computational linguistics,stemming,linguistics,language,natural language,text analytics
Platform: UNKNOWN
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Education
Classifier: Intended Audience :: Information Technology
Classifier: Intended Audience :: Science/Research
Classifier: Operating System :: POSIX :: Linux
Classifier: Operating System :: MacOS :: MacOS X
Classifier: Operating System :: Microsoft :: Windows
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.7
Classifier: Programming Language :: Python :: 3 :: Only
Classifier: Topic :: Scientific/Engineering
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: Topic :: Scientific/Engineering :: Human Machine Interfaces
Classifier: Topic :: Scientific/Engineering :: Information Analysis
Classifier: Topic :: Text Processing
Classifier: Topic :: Text Processing :: General
Classifier: Topic :: Text Processing :: Linguistic
Requires-Python: >=3.7
