Metadata-Version: 1.1
Name: delune
Version: 0.3.1
Summary: DeLune Python Object Storage and Search Engine
Home-page: https://gitlab.com/hansroh/delune
Author: Hans Roh
Author-email: hansroh@gmail.com
License: GPLv3
Download-URL: https://pypi.python.org/pypi/delune
Description-Content-Type: UNKNOWN
Description: ========
        DeLune
        ========
        
        .. contents:: Table of Contents
        
        
        Introduce
        ============
        
        DeLune (former Wissen) is a simple fulltext search engine and Python object (similar with noSQL document concept) storage written in Python for logic thing and C for a core index/search module.
        
        I had been studed Lucene_ earlier version with Lupy_ and CLucene_. And I had maden my own search engine for excercise.
        
        Its file format, numeric compressing algorithm, indexing process are quiet similar with Lucene earlier version (I don't know about recent versions at all). But querying and result-fetching parts is built from my imagination. As a result it's entirely unorthodox and possibly inefficient (I am a typical nerd and work-alone programmer ;-)
        
        DeLune is a kind of hybrid of search engine and noSQL document database. 
        
        DeLune stores python objects with pickle-compresses serializing, then if you use DeLune as python module, you can store and get document derectly.
        
        DeLune may be useful when it is allowed a few minutes gap on updating, inserting and deleting requests and operations. For example, it will be good for your legacy contents or generated by your own not by customer. 
        
        As most fulltext search engines, DeLune do always and only append data, no modification for existing files. So inserting, updating and deleting ops need high disk writing cost. Sometimes one small deletion op may trigger massive disk writing for optimization (even deleting cost itself is very low).
        
        Anyway, if you need realtime changes on your data, DO BOT USE DeLune or complement with another type of NoSQL or RDBMS.
        
        DeLune supports storing multiple documents for polymorphic use cases like listing and detail views. It is inefficient for storage usage, but helps reading performance.
        
        DeLune's searching mechanism is similar with DNA-RNA-Protein working model can be translated into 'Index File-Temporary Small Replication Buffer-Query Result'.
        
        * Every searcher (Cell) has a single index file handlers group (DNA group in nuclear)
        * Thread has multiple small memory buffer (RNA) for replicating index as needed part
        * Query class (Ribosome) creates query result (Protein) by synthesising buffers' inforamtion (RNAs) and Each thread has own memory space to create result set (Protein) not shared with other threads.
        * Repeat from 2nd if expected more results
        
        And it provides storing, indexing and searching RESTful API through `Skitai App Engine`_,
        
        * Work on multi processes environment
        * Master-slave replication
        * Untested (not yet) sharding, map-reducing, load-balancing using Skitai features
        
        .. _Lucene: https://lucene.apache.org/core/
        .. _Lupy: https://pypi.python.org/pypi/Lupy
        .. _CLucene: http://clucene.sourceforge.net/
        
        
        Installation
        =============
        
        DeLune contains C extension, so need C compiler except Python 3.5 (in this case pre-compiled module will be installed).
         
        .. code:: bash
        
          pip install delune
        
        On posix, it might be required some packages,
        
        .. code:: bash
            
          apt-get install gcc zlib1g-dev
        
        
        Quick Start
        ============
        
        All field text type should be str type, otherwise encoding should be specified.
        
        Indexing and Searching
        -------------------------
        
        Here's an example indexing only one document.
        
        .. code:: python
        
          import delune
          
          # indexing
          analyzer = delune.standard_analyzer (max_term = 3000)
          col = delune.collection ("./col", delune.CREATE, analyzer)
          indexer = col.get_indexer ()
          
          song = "violin sonata in c k.301"
          composer = u"wolfgang amadeus mozart"
          birth = 1756
          home = "50.665629/8.048906" # Lattitude / Longitude of Salzurg
          genre = "01011111" # (rock serenade jazz piano symphony opera quartet sonata)
          
          document = delune.document ()
          
          # object to return, any object serializable by pickle
          document.document ([song, {'composer': 'mozart'}])
          
          # text content to generating auto snippet by given query terms
          document.snippet (song)
          
          # add searchable fields
          document.field ("default", song, delune.TEXT)
          # is same as
          document.field ("default", "$document0.0", delune.TEXT)
          
          document.field ("composer", composer, delune.TEXT)
          # is same as
          document.field ("composer", "$document0.1.composer", delune.TEXT)
          
          document.field ("birth", birth, delune.INT16)
          document.field ("birth2", birth, delune.FNUM, format = "4.0")
          document.field ("genre", genre, delune.BIT8)
          document.field ("home", home, delune.COORD)
          
          indexer.add_document (document)
          indexer.close ()
          
          # searching
          analyzer = delune.standard_analyzer (max_term = 8)
          col = delune.collection ("./col", delune.READ, analyzer)
          searcher = col.get_searcher ()
          print (searcher.query (u'violin', offset = 0, fetch = 2, sort = "tfidf", summary = 30))
          searcher.close ()
          
        
        Result will be like this:
        
        .. code:: python
          
          {
           'code': 200, 
           'time': 0, 
           'total': 1
           'result': [
            [
             ['violin sonata in c k.301', 'wofgang amadeus mozart'], # content
             '<b>violin</b> sonata in c k.301', # auto snippet
             14, 0, 0, 0 # additional info
            ]
           ],   
           'sorted': [None, 0], 
           'regex': 'violin|violins',   
          }
        
        DeLune's document can be any Python objects pickalbe, delune stored document zipped pickled format. But you want to fetch partial documents by key or index, document skeleton shoud be a list or dictionary, but still inner data type can be any picklable objects. I think if your data need much more reading operations than writngs/updatings, DeLune can be as both simple schemaless data storage and fulltext search engine. DeLune's RESTful API and replication is end of this document.
        
        Learning and Classification
        ---------------------------
        
        Here's an example guessing one of 'play golf', 'go to bed' by weather conditions.
        
        .. code:: python
        
           import delune
           
           analyzer = delune.standard_analyzer (max_term = 3000)
           
           # learning
           
           mdl = delune.model ("./mdl", delune.CREATE, analyzer)
           learner = mdl.get_learner ()
           
           document = delune.labeled_document ("Play Golf", "cloudy windy warm")
           learner.add_document (document)  
           document = delune.labeled_document ("Play Golf", "windy sunny warm")
           learner.add_document (document)  
           document = delune.labeled_document ("Go To Bed", "cold rainy")
           learner.add_document (document)  
           document = delune.labeled_document ("Go To Bed", "windy rainy warm")
           learner.add_document (document)   
           learner.close ()
           
           mdl = delune.model ("./mdl", delune.MODIFY, analyzer)
           learner = mdl.get_learner ()
           learner.listbydf () # show all terms with DF (Document Frequency)
           learner.close ()
           
           mdl = delune.model ("./mdl", delune.MODIFY, analyzer)
           learner = mdl.get_learner ()
           learner.build (dfmin = 2) # build corpus DF >= 2
           learner.close ()
           
           mdl = delune.model ("./mdl", delune.MODIFY, analyzer)
           learner = mdl.get_learner ()
           learner.train (
             cl_for = delune.ALL, # for which classifier
             selector = delune.CHI2, # feature selecting method
             select = 0.99, # how many features?
             orderby = delune.MAX, # feature ranking by what?
             dfmin = 2 # exclude DF < 2
           )
           learner.close ()
           
           
           # gusessing
           
           mdl = delune.model ("./mdl", delune.READ, analyzer)
           classifier = mdl.get_classifier ()
           print classifier.guess ("rainy cold", cl = delune.NAIVEBAYES)
           print classifier.guess ("rainy cold", cl = delune.FEATUREVOTE)
           print classifier.guess ("rainy cold", cl = delune.TFIDF)
           print classifier.guess ("rainy cold", cl = delune.SIMILARITY)
           print classifier.guess ("rainy cold", cl = delune.ROCCHIO)
           print classifier.guess ("rainy cold", cl = delune.MULTIPATH)
           print classifier.guess ("rainy cold", cl = delune.META)
           classifier.close ()
           
        
        Result will be like this:
        
        .. code:: python
        
          {
            'code': 200, 
            'total': 1, 
            'time': 5,
            'result': [('Go To Bed', 1.0)],
            'classifier': 'meta'  
          }
        
        
        Export RESTful API Using Skitai
        ==================================
        
        **New in version 0.12.14**
        
        You can use RESTful API with `Skitai App Engine`_.
        
        First of all, you need to install skitai by,
        
        .. code:: bash
        
          pip3 install -U skitai
        
        Then copy and save below code to app.py.
        
        .. code:: python
          
          import os
          import delune
          import skitai  
          
          if __name__ == "__main__":
            pref = skitai.pref ()
            pref.use_reloader = 1
            pref.debug = 1
            
            config = pref.config
            config.sched = "0/5 * * * *"  
            config.local = "http://127.0.0.1:5000/v1"
            
            config.remote = os.environ.get ("DELUNE_ORIGIN")
            config.enable_mirror = False
            
            config.resource_dir = skitai.joinpath ('resources')
            config.enable_index = True
            
            config.logpath = None
            skitai.trackers ('delune:collection')
            skitai.mount ("/v1", delune, "app", pref)
            skitai.run (  
              workers = 2,
              threads = 4,
              port = 5000      
            )
        
        This app run indexing job for every 5 minutes at backgound.
        
        If you want read-only replica, set origin server at your account environement,
        
        .. code:: bash  
        
          export DELUNE_ORIGIN=http://192.168.1.200:5000/v1
        
        All collections will be replicated from http://192.168.1.200:5000/v1 API for every 5 minutes.
        
        Then run app.
        
        .. code:: bash
        
          python app.py -v
        
        Here's example of client side indexing script using API.
        
        .. code:: python
        
          colopt = {
            'version': 1,
            'data_dir': [
            	'models/0/books',
            	'models/1/books',
            	'models/2/books'
            ],
            'analyzer': {
            	"ngram": 0,
            	"stem_level": 1,						
            	"strip_html": 0,
            	"make_lower_case": 1		
            },
            'indexer': {
            	'force_merge': 0,
            	'max_memory': 10000000,
            	'max_segments': 10,
            	'lazy_merge': (0.3, 1),
            },	
            'searcher': {
              'max_result': 2000,
              'num_query_cache': 200
            }
          }	
          
          import requests    
          session = requests.Session ()
          
          # check current collections
          r = session.get ('http://127.0.0.1:5000/v1/').json ()
          if 'books' not in r ["collections"]:  
            # collections dose not exist, then create
            session.post ('http://127.0.0.1:5000/v1/books', colopt)
          
          dbc = db.connect (...)
          cursor = dbc.curosr ()
          cursor.execute (...)
          
          numdoc = 0
          while 1:
            row = cursor.fetchone ()
            if not row: break
            doc = delune.document (row._id)
            doc.document ({"author": row.author, "title": row.title , "abstract": row.abstract})
            doc.snippet (row.abstract)
            doc.field ('default', "%s %s" % (row.title, row.abstract), delune.TEXT, 'en')
            doc.field ('title', row.title, delune.TEXT, 'en')
            doc.field ('author', row.author, delune.STRING)
            doc.field ('isbn', row.isbn, delune.STRING)
            doc.field ('year', row.year, delune.INT16) 
               
            session.post ('http://127.0.0.1:5000/v1/books/documents', doc.as_json ())
            numdoc += 1
            if numdoc % 1000:
            	session.get ('http://127.0.0.1:5000/v1/books/commit')
          
          cursor.close ()
          dbc.close ()
        
        doc.document (object) is set return document object and it can be multiple and you can select 1 of these by parameter.
        
        .. code:: python
        
          session.get (
            'http://127.0.0.1:5000/v1/books/search?"
            "q=title:book"
            "&nthdoc=1"
           )
        
        That will be useful, returning various document formats for search view or detail view.
        
        All APIs are:
        
        .. code:: python
          
          # add new collection with options
          session.post ('http://127.0.0.1:5000/v1", colopt)  
          # get collection status and options
          session.get ('http://127.0.0.1:5000/v1/books")  
          # modify collection options
          session.patch ('http://127.0.0.1:5000/v1/books", colopt)  
          # remove collection but preserve all index files
          session.remove ('http://127.0.0.1:5000/v1/books")
          # remove collection with all index files
          session.remove ('http://127.0.0.1:5000/v1/books?side_effect=data")
          # undo remove collection with all index files
          session.get ('http://127.0.0.1:5000/v1/books?side_effet=undo")  
          
          # get collection locks
          session.get ('http://127.0.0.1:5000/v1/books/locks")  
          # create 'custom' lock
          session.post ('http://127.0.0.1:5000/v1/books/locks/custom")  
          # delete 'custom' lock
          session.delete ('http://127.0.0.1:5000/v1/books/locks/custom")
          
          # add new document
          session.post (
            'http://127.0.0.1:5000/v1/books/documents", 
            doc.as_json ()
          )
          # modify document
          session.patch (
            'http://127.0.0.1:5000/v1/books/documents/" + row._id, 
            doc.as_json ()
          )
          # delete document by document_id
          session.delete ('http://127.0.0.1:5000/v1/books/documents/" + row._id)
          
          # truncate all documents from collection
          session.delete ('http://127.0.0.1:5000/v1/books/documents?truncate_confirm=books')
          
          # search
          session.get (
            'http://127.0.0.1:5000/v1/books/search?"
            "q=title:book"    
            "&offset=0"
            "&limit=10"
            "&snippet=30" # number of desire snippet words
            "&lang=en" # number of desire snippet words
            "&partial=author,title" # fetch partial elements of document
            "&nthdoc=0" # get nth document stored
          )
          # guess
          session.get (
            'http://127.0.0.1:5000/v1/books/guess?"
            "q=title:book"
            "clf=naivenayes" # classifier
            "top=1" # number of top scored result
            "lang=en"
          )
          # delete documents by search
          session.delete ('http://127.0.0.1:5000/v1/books/search?q=title:book")
          
          # commit document queue
          session.get ('http://127.0.0.1:5000/v1/books/commit')
          # remove document queue
          session.get ('http://127.0.0.1:5000/v1/books/rollback')  
        
        Note: DeLune doesn't check uniqueness of document ID, it means if you post multiple documents with same document ID, delune will index all of them with regardless document ID. If you want to keep uniqueness, you SHOULD use 'patch' method NOT 'post'.
          
        For more detail about API, see `app.py`_.
             
        .. _`Skitai App Engine`: https://pypi.python.org/pypi/skitai
        .. _`app.py`: https://gitlab.com/hansroh/delune/blob/master/delune/export/skitai/app.py
        
        
        Limitation
        ==============
        
        Before you test DeLune, you should know some limitation.
        
        - DeLune search cannot sort by string type field, but can by int/bit/coord types and TFIDF ranking. 
        
        - DeLune classification doesn't have purpose for accuracy but realtime (means within 1 second) guessing performance. So I used relatvely simple and fast classification algorithms. If you need accuracy, it's not fit to you.
        
        
        Configure DeLune
        ==================
        
        When indexing/learing it's not necessory to configure, but searching/guessing it should be configure. The reason why DeLune allocates memory per thread for searching and classifying on initializing.
        
        .. code:: python
        
          delune.configure (
            numthread, 
            logger, 
            io_buf_size = 4096, 
            mem_limit = 256
          )
        
         
        - numthread: number of threads which access to DeLune collections and models. if set to 8, you can open multiple collections (or models) and access with 8 threads. If 9th thread try to access to delune, it will raise error
        
        - logger: *see next chapter*
        
        - io_buf_size = 4096: Bytes size of flash buffer for repliacting index files
        
        - mem_limit = 256: Memory limit per a thread, but it's not absolute. It can be over during calculation if need, but when calcuation has been finished, would return memory ASAP.
        
        
        Finally when your app is terminated, call shutdown.
        
        .. code:: python
        
          delune.shutdown ()
          
        
        Logger
        ========
        
        .. code:: python
        
          from delune.lib import logger
          
          logger.screen_logger ()
          
          # it will create file '/var/log.delune.log', and rotated by daily base
          logger.rotate_logger ("/var/log", "delune", "daily")
          
        
        Standard Analyzer
        ====================
        
        Analyzer is needed by TEXT, TERM types.
        
        Basic Usage is:
        
        .. code:: python
        
          analyzer = delune.standard_analyzer (
            max_term = 8, 
            numthread = 1,
            ngram = True or False,
            stem_level = 0, 1 or 2 (2 is only applied to English Language),
            make_lower_case = True or False,
            stopwords_case_sensitive = True or False,
            ngram_no_space = True or False,
            strip_html = True or False,  
            contains_alpha_only = True or False,  
            stopwords = [word,...]
          )
        
        - stem_level: 0 and 1, especially 'en' language has level 2 for hard stemming
        
        - make_lower_case: make lower case for every text
        
        - stopwords_case_sensitive: it will work if make_lower_case is False
        
        - ngram_no_space: if False, '泣斬 馬謖' will be tokenized to _泣, 泣斬, 斬\_, _馬, 馬謖, 謖\_. But if True, addtional bi-gram 斬馬 will be created between 斬\_ and _馬.
        
        - strip_html
        
        - contains_alpha_only: remove term which doesn't contain alphabet, this option is useful for full-text training in some cases
        
        - stopwords: DeLune has only English stopwords list, You can use change custom stopwords. Stopwords sould be unicode or utf8 encoded bytes
        
        DeLune has some kind of stemmers and n-gram methods for international languages and can use them by this way:
        
        .. code:: python
        
          analyzer = standard_analyzer (ngram = True, stem_level = 1)
          col = delune.collection ("./col", delune.CREATE, analyzer)
          indexer = col.get_indexer ()
          document.field ("default", song, delune.TEXT, lang = "en")
        
        
        Implemented Stemmers
        ---------------------
        
        Except English stemmer, all stemmers can be obtained at `IR Multilingual Resources at UniNE`__.
        
          - ar: Arabic
          - de: German
          - en: English
          - es: Spanish
          - fi: Finnish
          - fr: French
          - hu: Hungarian
          - it: Italian
          - pt: Portuguese
          - sv: Swedish
         
        .. __: http://members.unine.ch/jacques.savoy/clef/index.html
        
        
        Bi-Gram Index
        ----------------
        
        If ngram is set to True, these languages will be indexed with bi-gram.
        
          - cn: Chinese
          - ja: Japanese
          - ko: Korean
        
        Also note that if word contains only alphabet, will be used English stemmer.
        
        
        Tri-Gram Index
        ---------------
        
        The other languages will be used English stemmer if all spell is Alphabet. And if ngram is set to True, will be indexed with tri-gram if word has multibytes.
        
        **Methods Spec**
        
          - analyzer.index (document, lang)
          - analyzer.freq (document, lang)
          - analyzer.stem (document, lang)
          - analyzer.count_stopwords (document, lang)
        
        
        Collection
        ==================
        
        Collection manages index files, segments and properties.
        
        .. code:: python
        
          col = delune.collection (
            indexdir = [dirs], 
            mode = [ CREATE | READ | APPEND ], 
            analyzer = None,
            logger = None 
          )
        
        - indexdir: path or list of path for using multiple disks efficiently
        - mode
        - analyzer
        - logger: # if logger configured by delune.configure, it's not necessary
        
        Collection has 2 major class: indexer and searcher.
        
        
        
        Indexer
        ---------
        
        For searching documents, it's necessary to indexing text to build Inverted Index for fast term query. 
        
        .. code:: python
        
          indexer = col.get_indexer (
            max_segments = int,
            force_merge = True or False,
            max_memory = 10000000 (10Mb),
            optimize = True or False
          )
        
        - max_segments: maximum number of segments of index, if it's over, segments will be merged. also note during indexing, segments will be created 3 times of max_segments and when called index.close (), automatically try to merge until segemtns is proper numbers
        
        - force_merge: When called index.close (), forcely try to merge to a single segment. But it's failed if too big index - on 32bit OS > 2GB, 64bit > 10 GB
        
        - max_memory: if it's over, created new segment on indexing
        
        - optimize: When called index.close (), segments will be merged by optimal number as possible
        
        
        For add docuemtn to indexer, create document object:
        
        .. code:: python
        
          document = delune.document ()     
        
        DeLune handle 3 objects as completly different objects between no relationship
        
        - returning content
        - snippet generating field
        - searcherble fields
        
        
        **Returning Content**
        
        DeLune serialize returning contents by pickle, so you can set any objects pickle serializable.
        
        .. code:: python
        
          document.document ({"userid": "hansroh", "preference": {"notification": "email", ...}})
          
          or 
          
          document.document ([32768, "This is smaple ..."])
        
        
        **Snippet Generating Field**  
        
        This field should be unicode/utf8 encoded bytes.
        
        .. code:: python
        
          document.snippet ("This is sample...")
        
        
        **Searchable Fields**
        
        document also recieve searchable fields:
        
        .. code:: python
        
          document.field (name, value, ftype = delune.TEXT, lang = "un", encoding = None)
          
          document.field ("default", "violin sonata in c k.301", delune.TEXT, "en")
          document.field ("composer", "wolfgang amadeus mozart", delune.TEXT, "en")
          document.field ("lastname", "mozart", delune.STRING)
          document.field ("birth", 1756, delune.INT16)
          document.field ("genre", "01011111", delune.BIT8)
          document.field ("home", "50.665629/8.048906", delune.COORD6)
          
          
        - name: if 'default', this field will be searched by simple string, or use 'name:query_text'
        - value: unicode/utf8 encode text, or should give encoding arg.
        - ftype: *see below*
        - encoding: give like 'iso8859-1' if value is not unicode/utf8
        - lang: language code for standard_analyzer, "un" (unknown) is default
          
        Avalible Field types are:
        
          - TEXT: analyzable full-text, result-not-sortable
          
          - TERM: analyzable full-text but position data will not be indexed as result can't search phrase, result-not-sortable
          
          - STRING: exactly string match like nation codes, result-not-sortable
          
          - LIST: comma seperated STRING, result-not-sortable
          
          - FNUM: foramted number, value should be int or float and format parameter required, format is "digit.digit" that number of digit interger part with zero leading, and number of float part length. It make possible to search range efficiently.
          
          - COORDn, n=4,6,8 decimal precision: comma seperated string 'latitude,longititude', latitude and longititude sould be float type range -90 ~ 90, -180 ~ 180. n is precision of coordinates. n=4 is 10m radius precision, 6 is 1m and 8 is 10cm. result-sortable
          
          - BITn, n=8,16,24,32,40,48,56,64: bitwise operation, bit makred string required by n, result-sortable
          
          - INTn, n=8,16,24,32,40,48,56,64: range, int required, result-sortable
        
        Note1: You make sure COORD, INT and BIT fields are at every documents even they havn't got a value, because these types are depend on document indexed sequence ID. If they have't a value, please set value to None NOT omit fields.
        
        Note2: FNUM 100.12345 with format="5.3" is interanlly converted into "00100.123" and negative value will be -00100.123 and MAKE SURE your values are within -99999.999 and 99999.999.
          
        Repeat add_document as you need and close indexer.
        
        .. code:: python
        
          for ...:  
            document = delune.document ()
            ...
            indexer.add_document (document) 
            indexer.close ()  
        
        If searchers using this collection runs with another process or thread, searcher automatically reloaded within a few seconds for applying changed index.
        
        
        Searcher
        ---------
        
        For running searcher, you should delune.configure () first and creat searcher.
        
        .. code:: python
          
          searcher = col.get_searcher (
            max_result = 2000,
            num_query_cache = 200
          ) 
          
        - max_result: max returned number of searching results. default 2000, if set to 0, unlimited results
        
        - num_query_cache: default is 200, if over 200, removed by access time old
        
        
        Query is simple:
        
        .. code:: python
        
          searcher.query (
            qs, 
            offset = 0, 
            fetch = 10, 
            sort = "tfidf", 
            summary = 30, 
            lang = "un"
          )
          
        - qs: string (unicode) or utf8 encoded bytes. for detail query syntax, see below
        - offset: return start position of result records
        - fetch: number of records from offset
        - sort: "(+-)tfidf" or "(+-)field name", field name should be int/bit type, and '-' means descending (high score/value first) and default if not specified. if sort is "", records order is reversed indexing order
        - summary: number of terms for snippet
        - lang: default is "un" (unknown)
        
        
        For deleting indexed document:
        
        .. code:: python
        
          searcher.delete (qs)
        
        All documents will be deleted immediatly. And if searchers using this collection run with another process or thread, theses searchers automatically reloaded within a few seconds.
        
        Finally, close searcher.
        
        .. code:: python
        
          searcher.close ()
        
        
        **Query Syntax**
        
          - violin composer:mozart birth:1700~1800 
          
            search 'violin' in default field, 'mozart' in composer field and search range between 1700, 1800 in birth field
            
          - violin allcomposer:wolfgang mozart
          
            search 'violin' in default field and any terms after allcomposer will be searched in composer field
           
          - violin -sonata birth2:1700~1800
            
            birth2 is between '1700' and '1800'
              
          - violin -sonata birth:~1800
          
            not contain sonata in default field
          
          - violin -composer:mozart
          
            not contain mozart in composer field
          
          - violin or piano genre:00001101/all
          
            matched all 5, 6 and 8th bits are 1. also /any or /none is available  
            
          - violin or ((piano composer:mozart) genre:00001101/any)
          
            support unlimited priority '()' and 'or' operators
          
          - (violin or ((allcomposer:mozart wolfgang) -amadeus)) sonata (genre:00001101/none home:50.6656,8.0489~10000)
          
            search home location coordinate (50.6656, 8.0489) within 10 Km
          
          - "violin sonata" genre:00001101/none home:50.6656/8.0489~10
          
            search exaclt phrase "violin sonata"
          
          - "violin^3 piano" -composer:"ludwig van beethoven"
        
            search loose phrase "violin sonata" within 3 terms
        
            
        Model
        =============
        
        Model manages index, train files, segments and properties.
        
        .. code:: python
        
          mdl = delune.model (
            indexdir = [dirs],
            mode = [ CREATE | READ | MODIFY | APPEND ], 
            analyzer = None, 
            logger = None
          )
        
        
        Learner
        ---------
        
        For building model, on DeLune, there're 3 steps need.
        
        - Step I. Index documents to learn
        - Step II. Build Corpus
        - Step III. Selecting features and save trained model
        
        **Step I. Index documents** 
        
        Learner use delune.labeled_document, not delune.document. And can additional searchable fields if you need. Label is name of category.
        
        .. code:: python
          
          learner = mdl.get_learner ()
          for label, document in trainset:
          
            labeled_document = delune.labeled_document (label, document)	  	      
            # addtional searcherble fields if you need
            labeled_document.field (name, value, ftype = TEXT, lang = "un", encoding = None)    
            learner.add_document (labeled_document)
        	  	  
          learner.close ()
        
        
        **Step II. Building Corpus** 
        
        Document Frequency (DF) is one of major factor of classifier. Low DF is important to searching but not to classifier. One of importance part of learning is selecting valuable terms, but so low DF terms is not very helpful for classifying new document because new document has also low probablity of appearance.
        
        So for learnig/classification efficient, it's useful to eliminate too low and too high DF terms. For example, Let's assume you index 30,000 web pages for learing and there're about 100,000 terms. If you build corpus with all terms, it takes so long time for learing. But if you remove DF < 10 and DF > 7000 terms, 75% - 80% of all terms will be removed.
        
        .. code:: python  
          
          # reopen model with MODIFY
          mdl = delune.model (indexdir, MODIFY)
          learner = mdl.get_learner ()
          
          # show terms order by DF for examin
          learner.listbydf (dfmin = 10, dfmax = 7000)
          
          # build corpus and save
          learner.build (dfmin = 10, dfmax = 7000)
          
        As a result, corpus built with about 25,000 terms. It will take time by number of terms.
        
        
        **Step III. Feature Selecting and Saving Model** 
        
        Features means most valuable terms to classify new documents. It is important understanding many/few features is not good for best result. It maybe most important to select good features for classification.
        
        For example of my URL classification into 2 classes works show below results. Classifier is NAIVEBAYES, selector is GSS and min DF is 2. Train set is 20,000, test set is 2,000.
        
          - features 3,000 => 82.9% matched, 73 documents is unclassified
          - features 2,000 => 82.9% matched, 73 documents is unclassified
          - features 1,500 => 83.4% matched, 75 documents is unclassified
          - features 1,000 => 83.6% matched, 79 documents is unclassified
          - features   500 => 83.1% matched, 86 documents is unclassified
          - features   200 => 81.1% matched, 108 documents is unclassified
          - features   50 => 76.0% matched, 155 documents is unclassified
          - features   10 => 58.7% matched, 326 documents is unclassified
        
        As results show us that over 2,000 snd under 1,000 features will be unchanged or degraded for classification quality. Also to the most classifiers, too few features increase unclassified ratio but especially to NAIVEBAYES, too many features will increase unclassified ratio cause of its calculating way.
        
        .. code:: python  
          
          mdl = delune.model (indexdir, MODIFY)
          learner = mdl.get_learner ()
          
          learner.train (
            cl_for = [
              ALL (default) | NAIVEBAYES | FEATUREVOTE | 
              TFIDF | SIMILARITY | ROCCHIO | MULTIPATH
            ],
            select = number of features if value is > 1 or ratio,
            selector = [
              CHI2 | GSS | DF | NGL | MI | TFIDF | IG | OR | 
              OR4P | RS | LOR | COS | PPHI | YULE | RMI
            ],
            orderby = [SUM | MAX | AVG],
            dfmin = 0, 
            dfmax = 0
          )
          learner.close ()
          
        - cl_for: train for which classifier, if not specified this features used default for every classifiers haven't own feature set. So train () can be called repeatly for each classifiers
        
        - select: number of features if value is > 1 or ratio to all terms. Generally it might be not over 7,000 features for classifying web pages or news articles into 20 classes.
        
        - selector: mathemetical term scoring alorithm to selecting features considering relation between term and term / term and label. Also DF, and term frequency (TF) etc.
        
        - orderby: final scoring method. one of sum, max, average value
        
        - dfmin, dfmax: In spite of it had been already removed by build(), it can be also additional removed for optimal result for specific classifier
        
        
        If you remove training data for specific classifier,
        
        .. code:: python  
          
          mdl = delune.model (indexdir, MODIFY)
          learner = mdl.get_learner ()
          
          learner.untrain (cl_for)
          learner.close ()
        
        
        **Finding Best Training Options**
        
        Generally, differnce attibutes of data set, it hard to say which options are best. It is stongly necessary number of times repeating process between train () and guess () for best result and that's not easy process.
        
        - index ()
        - build ()
        - train (initial options)
        - measure results with guess ()
        - append additional documents, build () if need
        - train (another options)
        - measure results again with guess ()
        - ...
        - find best optiaml training options with your data set
        
        For getting result accuracy, your pre-requisite data should be splitted into train set for tran () and test set for guess () to measure like `precision and recall`_.
        
        For example, there were 27,000 web pages to training set and 2,700 test set for classifying to spam page or not. Total indexed terms are 199,183 and I eliminated 94% terms by DF < 30 or DF > 7000 and remains only 10,221 terms.
        
        - F: selected features by OR(Odds Ratio) MAX
        - NB: NAIVEBAYES, RO: ROCCHIO
        - Numbers means: Matched % Ratio Excluding Unclassified (Unclassified Documents)
        
          - F 7,000: NB 97.2 (1,100), RO 95.4 (50)
          - F 5,000: NB 97.4 (493), RO 94.8 (69) 
          - F 4,000: NB 96.6 (282), RO 91.6 (96)
          - F 3,000: NB 93.2 (214), RO 86.2 (151)
          - F 2,000: NB 89.4 (293), RO 80.1 (281)
        
        Which do you choice? In my case, I choose F 5,000 with ROCCHIO cause of low unclassified ratio. But if speed was more important I might choice F 3,000 with NAIVEBAYES.
        
        Anyway everything is done, and if you has been found optimal parameters, you can optimize classifier model.
        
        .. code:: python
        
          mdl = delune.model (indexdir, delune.MODIFY, an)
          learner = mdl.get_learner ()
          learner.optimize ()
          learner.close ()
        
        Note that once called optimize (),
        
        - you cannot add additional training documents
        - you cannot rebuild corpus by calling build () again
        - but you can still call train () any time
        
        The reason why when low/high DF terms are eliminated by optimize (), related index files will be also shrinked unrecoverably for performance. Then if these works are needed, you should do from step I again.
        
        If you don't do optimize it make SIMILARITY and ROCCHIO classifiers inefficient (also it will be NOT influence to NAIVEBAYES, TFDIF, FEATUREVOTE classifiers). But you think it's more important retraining regulary rather than speed performance, you should not optimize.
        
        .. _`precision and recall`: https://en.wikipedia.org/wiki/Precision_and_recall
        
        
        **Feature Selecting Methods**
        
          - CHI2 = Chi Square Statistic
          - GSS = GSS Coefficient 
          - DF = Document Frequency
          - CF = Category Frequency
          - NGL = NGL
          - MI = Mutual Information
          - TFIDF = Term Frequecy - Inverted Document Frequency
          - IG = Information Gain
          - OR = Odds Ratio
          - OR4P = Kind of Odds Ratio(? can't remember)
          - RS = Relevancy Score
          - LOR = Log Odds Ratio
          - COS = Cosine Similarity 
          - PPHI = Pearson's PHI
          - YULE = Yule
          - RMI = Residual Mutual Information
          
        I personally prefer OR, IG and GSS selectors with MAX method.
        
        
        Classifier
        ------------
          
        Finally,
        
        .. code:: python  
          
          classifier = mdl.get_classifier ()
          classifier.quess (
            qs, 
            lang = "un", 
            cl = [ 
              NAIVEBAYES (Default) | FEATUREVOTE | ROCCHIO | 
              TFIDF | SIMILARITY | META | MULTIPATH
            ],
            top = 0,
            cond = ""
          )
          
          classifier.cluster (
            qs, 
            lang = "un"    
          )
          
          classifier.close ()
          
        - qs: full text stream to classify
        
        - lang
        
        - cl: which classifer, META is default
        
        - top: how many high scored classified results, default is 0, means high scored result(s) only
        
        - cond: conditional document selecting query. Some classifier execute calculating with lots of documents like ROCCHIO and SIMILARITY, so it's useful shrinking number of documents. This  only work when you put additional searchable fields using labeled_document.field (...).
        
        **Implemented Classifiers**
        
          - NAIVEBAYES: Naive Bayes Probablility, default guessing
          - FEATUREVOTE: Feature Voting Classifier
          - ROCCHIO: Rocchio Classifier
          - TFIDF: Max TDIDF Score
          - SIMILARITY: Max Cosine Similarity
          - MULTIPATH: Experimental Multi Path Classifier, terms of classifying document will be clustered into multiple sets by co-word frequency before guessing
          - META: merging and decide with multiple results guessed by NAIVEBAYES, FEATUREVOTE, ROCCHIO Classifiers
        
        If you need speed most of all, NAIVEBAYES is a good choice. NAIVEBAYES is an old theory but it still works with very high performance at both speed and accuracy if given proper training set.
        
        More detail for each classifier alorithm, googling please.
        
        
        **Optimizing Each Classifiers**
        
        For give some detail options to a classifier you can use setopt (classfier, option name = option value,...).
        
        
        .. code:: python  
        
          classifier = mdl.get_classifier ()
          classifier.setopt (delune.ROCCHIO, topdoc = 200)
          
        SIMILARITY, ROCCHIO classifiers basically have to compare with entire indexed document documents, but DeLune can compare with selected documents by 'topdoc' option. These number of documents will be selected by high TFIDF score for classifying performance reason. Default topdoc value is 100. If you set to 0, DeLune will compare with all documents have one of features at least. But on my experience, there's no critical difference except speed performance.
        
        Currently available options are:
        
        * ALL
        
          - verbose = False
        
        * ROCCHIO
        
          - topdoc = 100
        
        * MULTIPATH
        
          + subcl = [ FEATUREVOTE (default) | NAIVEBAYES | ROCCHIO ]
          + scoreby = [ IG (default) | MI | OR | R ]
          + choiceby = [ AVG (default) | MIN ], when scorring between term and each terms in cluster, which do you want to use value
          + threshold = 1.0, float value for creating new cluster and this value is measured with Information Gain and value range is somewhat different by number of training documents.
        
        
        Document Cluster
        -----------------
        
        TODO
        
        .. code:: python  
        
          cluster = mdl.get_dcluster ()
          
        
        Term Cluster
        -------------
        
        TODO
        
        .. code:: python  
        
          cluster = mdl.get_tcluster ()
          
            
        
        Handling Multiple Searchers & Classifiers
        ===========================================
        
        In case of creating multiple searchers and classifers, delune.task might be useful.
        Here's a script named 'config.py'
        
        .. code:: python
        
          import delune
          from delune.lib import logger
          
          def start_delune (numthreads, logger):    
            delune.configure (numthreads, logger)
                
            analyzer = delune.standard_analyzer ()
            col = delune.collection ("./data1", delune.READ, analyzer)
            delune.assign ("data1", col.get_searcher (max_result = 2000))
            
            analyzer = delune.standard_analyzer (max_term = 1000, stem = 2)
            mdl = delune.model ("./data2", delune.READ, analyzer)
            delune.assign ("data2", mdl.get_classifier ())
          
        The first argument of assign () is alias for searcher or classifier.
        
        If you call config.start_delune () at any script, you can just import delune and use it at another python scripts.
        
        .. code:: python
        
          import delune
          
          delune.query ("data1", "mozart sonatas")
          delune.guess ("data2", "mozart sonatas")
          
          # close and resign  
          delune.close ("data1")
          delune.resign ("data1")
        
        
        At the end of you app, call delune.shutdown ()
          
        .. code:: python
        
          import delune
          
          delune.shutdown ()
        
        
        Links
        ======
        
        - `GitLab Repository`_
        - Bug Report: `GitLab issues`_
        
        .. _`GitLab Repository`: https://gitlab.com/hansroh/delune
        .. _`GitLab issues`: https://gitlab.com/hansroh/delune/issues
        
        
        
        Change Log
        ============
          
          DeLune
          
          0.3 (Sep 15, 2017)
          
          - add index field aliasing to document
          - add string range searching, add new field type: ZFn
          - add multiple documents storing feature. as a result, DeLune can read only for Wissen collections
          
          0.2 (Sep 14, 2017)
          
          - fix minor bugs
          
          0.1 (Sep 13, 2017)
          
          - change package name from Wissen to DeLune
          
          Wissen Period
          
          0.13
          
          - fix using lock
          - add truncate collection API
          - fix updating document
          - change replicating way to use sticky session connection with origin server
          - fix file creation mode on posix
          - fix using lock with multiple workers
          - change wissen.document method names
          - fix index queue file locking
          
          0.12 
          
          - add biword arg to standard_analyzer
          - change export package name from appack to package
          - add Skito-Saddle app
          - fix analyzer.count_stopwords return value
          - change development status to Alpha
          - add wissen.assign(alias, searcher/classifier) and query(alias), guess(alias)
          - fix threads count and memory allocation
          - add example for Skitai App Engine app to mannual
          
          0.11 
          
          - fix HTML strip and segment merging etc.
          - add MULTIPATH classifier
          - add learner.optimize ()
          - make learner.build & learner.train efficient
          
          0.10 - change version format, remove all str*_s ()
          
          0.9 - support Python 3.x
        
          0.8 - change license from BSD to GPL V3
        
Platform: posix
Platform: nt
Classifier: License :: OSI Approved :: GNU General Public License v3 (GPLv3)
Classifier: Development Status :: 3 - Alpha
Classifier: Environment :: Console
Classifier: Topic :: Software Development :: Libraries :: Python Modules
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Science/Research
Classifier: Programming Language :: Python
Classifier: Programming Language :: Python :: 3
Classifier: Topic :: Text Processing :: Indexing
