Metadata-Version: 2.1
Name: neolo
Version: 0.1.2
Summary: Text Analysis Software
Home-page: https://github.com/jcrowgey/neolo
Author: Joshua Crowgey
Author-email: jcrowgey@uw.edu
License: BSD
Description: neolo
        =====
        
        Text Analysis Software for Saulo Brandão.  Developed by Joshua Crowgey
        in summer 2014.
        
        ```
        usage: neolo [-h] [--dicts DICT [DICT ...]] [--mltd] [--msttr] [--hdd]
                     [--verbose] [--wordlen] [--wordtypes] [--hapax] [--punc-ratio]
                     [--no-hyphen] [--no-apostrophe] [--sents [ABBREV]]
                     [--stemming LANGUAGE]
                     TEXT
        
        Extract lexical statistics from a text file.
        
        positional arguments:
          TEXT                  the text you want to investigate
        
        optional arguments:
          -h, --help            show this help message and exit
          --dicts DICT [DICT ...]
                                a list of reference texts to compute neologism count
          --mltd                measure of lexical textual diversity
          --msttr               mean segmental type-token ratio
          --hdd                 HD-D probabilistic TTR
          --verbose, -v         increase the verbosty (can be repeated: -vvv)
          --wordlen, -w         print the distribution of words by length
          --wordtypes, -t       print the distribution of wordtypes (unigrams) by
                                count
          --hapax, -x           print the list of hapax legomena
          --punc-ratio, -p      print the ratio of punctuation tokens out of total
                                tokens
          --no-hyphen, -y       remove the hyphen (-) from the list of punctuation
                                symbols used in tokenization
          --no-apostrophe, -a   remove the apostrophe (') from the list of punctuation
                                symbols used in tokenization
          --sents [ABBREV], -s [ABBREV]
                                print sentence length statistics, uses an (optional)
                                abbreviations file containing stings which don't end
                                sentences (eg: Mr.). One abbreviaion per line, include
                                relevant punctuation. Note that items in the
                                abbreviations file will also be protected during later
                                tokenization.
          --stemming LANGUAGE, -m LANGUAGE
                                stem words using NLTK prior to processing them
        ```
        
        Neologism Count
        ---------------
        The name of this program reflects this original functionality.  Neologism
        count is computed by referencing known wordlists or dictionaries.  Word types
        found in the text under consideration which are not found in the reference 
        dictionaries/wordlists are considered neologisms.
        
        To show a simple example, suppose you have a text file called mary.txt 
        which contains the following traditional poem:
        
        ```
        Mary had a little lamb,
        Her fleece was white as snow.
        Everywhere that mary went,
        the lamb was sure to go.
        ```
        
        Supposing you're using the debian distro of GNU/Linux, there is a list of 
        English words stored in /usr/share/dict/words that you can use as a 
        reference.  You can ask neolo to check mary.txt for neologisms using 
        the --dicts option.  The --dicts option takes a list of one ore more filenames
        to use as references in calculating neologisms.
        
        ```
        user@computer:~/src/neolo$ ./neolo texts/mary.txt --dicts /usr/share/dict/words
        Opening texts/mary.txt with encoding:  utf-8 
        Tokenizing, downcasing, stemming text: texts/mary.txt ... done.
        Counting and sorting words in text: texts/mary.txt ...done.
        Opening /usr/share/dict/words with encoding:  utf-8 
        Tokenizing, downcasing, stemming dict files: ['/usr/share/dict/words'] ... done.
        Counting and sorting words in dictonaries: ['/usr/share/dict/words'] ...done.
        Neologism list:
        
        Statistics:
        -----------
        Text size: 21 tokens in 18 types.
        Number of hapax legomena: 15
        TTR (type-token ratio): 0.8571428571428571
        HTR (hapax-token ratio): 0.7142857142857143
        HTyR (hapax-type ratio): 0.8333333333333334
        Neologisms:  0 types not found in 1 dictionaries
        Dictionaries contained 234937 tokens in 233615 types.
        ```
        
        As you can see, there are no words in mary.txt which aren't in the reference
        wordlist file, so neolo says "Neolgisms: 0 types not found in 1 dictionaries".
        
        However, if you edit mary.txt such that instead of fleece, the poem's second
        line says ``Her pleece was white as snow.'', now neolo prints a neologism list
        along with its regular output.
        
        ```
        user@computer:~/src/neolo$ ./neolo texts/mary.txt --dicts /usr/share/dict/words
        Opening texts/mary.txt with encoding:  utf-8 
        Tokenizing, downcasing, stemming text: texts/mary.txt ... done.
        Counting and sorting words in text: texts/mary.txt ...done.
        Opening /usr/share/dict/words with encoding:  utf-8 
        Tokenizing, downcasing, stemming dict files: ['/usr/share/dict/words'] ... done.
        Counting and sorting words in dictonaries: ['/usr/share/dict/words'] ...done.
        Neologism list:
        pleece
        
        Statistics:
        -----------
        Text size: 21 tokens in 18 types.
        Number of hapax legomena: 15
        TTR (type-token ratio): 0.8571428571428571
        HTR (hapax-token ratio): 0.7142857142857143
        HTyR (hapax-type ratio): 0.8333333333333334
        Neologisms:  1 types not found in 1 dictionaries
        Dictionaries contained 234937 tokens in 233615 types.
        ```
        
        MLTD
        ----
        
        MSTTR
        -----
        
        HD-D
        ----
        
Platform: UNKNOWN
Classifier: Development Status :: 5 - Production/Stable
Classifier: Environment :: Console
Classifier: Programming Language :: Python :: 3.5
Requires-Python: >=3.0
Description-Content-Type: text/markdown
