ulif.openoffice.cachemanager -- A cache manager
===============================================

A cache manager tries to cache converted files, so that already
converted documents do not have to be converted again.

Cache Manager
=============

A cache manager expects a ``cache_dir`` parameter where it can store
the cached files. If this parameter is set to ``None`` no caching will
be performed at all:

    >>> from ulif.openoffice.cachemanager import CacheManager
    >>> cm = CacheManager(cache_dir=None)

If we pass a path, which already exists and is a file, the cache
manager will complain but still be constructed.

If we pass a path that does not exist, it will be created:

    >>> ls('home')

    >>> cm = CacheManager(cache_dir='home/mycachedir')
    >>> ls('home')
    d  mycachedir

The cache manager can register files, look for already created
conversions and pass them back if found.

We lookup a certain document which, as the cache is yet empty, cannot
be found. We create a dummy file for this purpose:

    >>> import os
    >>> open('dummysource.doc', 'w').write('Just a dummy file.')
    >>> docsource = os.path.abspath('dummysource.doc')
    >>> docsource_contents = open(docsource, 'r').read()

    >>> cm.contains(docsource, suffix='pdf')
    False

We can also pass the file contents as argument:

    >>> #cm.contains(extension = 'pdf', data = docsource_contents)

    False

The cache is based on MD5 sums of source files. Source documents are
not stored.

We can pass ``level`` to the constructor, if we want a directory level
different to 1:

    >>> cm = CacheManager(cache_dir='home/mycachedir', level=2)
    >>> cm.level
    2

This will result in a different organization of all the cached files
and directories inside the caching directory. See section below to
learn more about this mere internal feature.

Setting the level after creation of a cache manager is not
recommended.

Feeding the cache manager
=========================

We can register conversion results with the cache manager, which will
be available lateron.

Caching files
-------------

To demonstrate this, we create a dummy source file and a dummy
conversioned file:

    >>> import os
    >>> open('dummysource.doc', 'w').write('Just a dummy file.')
    >>> docsource = os.path.abspath('dummysource.doc')
    >>> docsource_contents = open(docsource, 'r').read()

    >>> open('dummyresult.pdf', 'w').write('I am not a real PDF.')
    >>> pdfresult = os.path.abspath('dummyresult.pdf')

Now we can create a cache manager and register our stuff:

    >>> cm = CacheManager(cache_dir='home/mycachedir')
    >>> cm.registerDoc(source_path=docsource,
    ...                to_cache=pdfresult)

This will create the needed directories inside the cache dir and store
all contents of the directory where the file to cache resides in it.

    >>> ls('home/mycachedir/08/08867237840fabae77b838e9c9226eb2/')
    -  data
    d  results
    d  sources

The 'data' file contains some pickled management infos.

While in 'sources' all sources with the same hash are stored, the
'results' dir contains all results belonging to a certain source:

    >>> ls('home/mycachedir/08/08867237840fabae77b838e9c9226eb2/sources')
    -  source_1


    >>> ls('home/mycachedir/08/08867237840fabae77b838e9c9226eb2/results')
    -  result_1_default

Caching files with a 'suffix'
-----------------------------

We can, however, also store a file with a certain 'suffix' in order to
cache several results for one source. For example we might want to cache
a PDF and an HTML version of the same file.

To do so, we have to provide a suffix on doc registration:

    >>> cm.registerDoc(source_path=docsource,
    ...                to_cache=pdfresult,
    ...                suffix='pdf')

    >>> ls('home/mycachedir/08/08867237840fabae77b838e9c9226eb2/sources')
    -  source_1


    >>> ls('home/mycachedir/08/08867237840fabae77b838e9c9226eb2/results')
    -  result_1__pdf
    -  result_1_default

The cache manager notices, that the source delivered was the same as
on first time and so only stored the new result with the suffix in
name.

This will become more obviours, when we want to register a certain
result file as HTML result:

    >>> cm.registerDoc(source_path=docsource,
    ...                to_cache=pdfresult,
    ...                suffix='html')

    >>> ls('home/mycachedir/08/08867237840fabae77b838e9c9226eb2/sources')
    -  source_1


    >>> ls('home/mycachedir/08/08867237840fabae77b838e9c9226eb2/results')
    -  result_1__html
    -  result_1__pdf
    -  result_1_default

It is up to the caller to choose any suffix she likes.


Getting cache results
=====================

When we want to get the result for some input file, we can do so:

    >>> cm.getCachedFile(docsource)
    '/sample-buildout/home/mycachedir/.../results/result_1_default'

    >>> cm.getCachedFile(docsource, suffix='pdf')
    '/sample-buildout/home/mycachedir/.../results/result_1__pdf'

    >>> cm.getCachedFile(docsource, suffix='html')
    '/sample-buildout/home/mycachedir/.../results/result_1__html'

If a file was not cached yet, we will get None:

    >>> cm.getCachedFile(docsource, suffix='blah') is None
    True

Collision Handling
==================

The cache manager relies very much on hash (MD5) digests to find a
cached document quickly. However, hash collisions can occur.

We create a cache manager with a trivial hash algorithm to see this:

    >>> from ulif.openoffice.cachemanager import CacheManager
    >>> class NotHashingCacheManager(CacheManager):
    ...   def getHash(self, path=None):
    ...     return 'somefakedhash'

    >>> cm_dir = 'home/newcachedir'
    >>> cm = NotHashingCacheManager(cache_dir=cm_dir)

We create two sources to store:

    >>> import os
    >>> open('dummysource1.doc', 'w').write('Just a dummy file.')
    >>> open('dummysource2.doc', 'w').write('Another dummy file.')
    >>> docsource1 = os.path.abspath('dummysource1.doc')
    >>> docsource2 = os.path.abspath('dummysource2.doc')

Now we create some dummy result files and register both pairs of them:

    >>> open('dummyresult1.pdf', 'w').write('Fake result 1')
    >>> open('dummyresult1.html', 'w').write('Fake result 2')
    >>> open('dummyresult2.pdf', 'w').write('Fake result 3')
    >>> open('dummyresult2.html', 'w').write('Fake result 4')
    >>> result1 = os.path.abspath('dummyresult1.pdf')
    >>> result2 = os.path.abspath('dummyresult1.html')
    >>> result3 = os.path.abspath('dummyresult2.pdf')
    >>> result4 = os.path.abspath('dummyresult2.html')

    >>> cm.registerDoc(source_path=docsource1,
    ...                to_cache=result1,
    ...                suffix='pdf')
    >>> cm.registerDoc(source_path=docsource1,
    ...                to_cache=result2,
    ...                suffix='html')
    >>> cm.registerDoc(source_path=docsource2,
    ...                to_cache=result3,
    ...                suffix='pdf')
    >>> cm.registerDoc(source_path=docsource2,
    ...                to_cache=result4,
    ...                suffix='html')

All these sources give the same hash and are therefore stored in the
same basket:

    >>> ls(cm_dir, 'so', 'somefakedhash', 'sources')
    -  source_1
    -  source_2

    >>> cat(cm_dir, 'so', 'somefakedhash', 'sources', 'source_1')
    Just a dummy file.

    >>> cat(cm_dir, 'so', 'somefakedhash', 'sources', 'source_2')
    Another dummy file.

All results are connected via a number in filename to their respective
source:

    >>> ls(cm_dir, 'so', 'somefakedhash', 'results')
    -  result_1__html
    -  result_1__pdf
    -  result_2__html
    -  result_2__pdf

    >>> cat(cm_dir, 'so', 'somefakedhash', 'results', 'result_1__pdf')
    Fake result 1

    >>> cat(cm_dir, 'so', 'somefakedhash', 'results', 'result_2__pdf')
    Fake result 3
