Metadata-Version: 2.1
Name: thoth-storage
Version: 0.19.17
Summary: Storage and database adapters available in project Thoth
Home-page: https://github.com/thoth-station/storages
Author: Fridolin Pokorny
Author-email: fridolin@redhat.com
Maintainer: Francesco Murdaca
Maintainer-email: fmurdaca@redhat.com
License: GPLv3+
Platform: UNKNOWN
Requires-Dist: click
Requires-Dist: voluptuous
Requires-Dist: boto3
Requires-Dist: thoth-common
Requires-Dist: amun
Requires-Dist: python-dateutil
Requires-Dist: thoth-python
Requires-Dist: pyyaml
Requires-Dist: methodtools
Requires-Dist: sqlalchemy
Requires-Dist: psycopg2-binary
Requires-Dist: sqlalchemy-utils
Requires-Dist: alembic

Thoth Storages
--------------

This library provides a library called `thoth-storages
<https://pypi.org/project/thoth-storages>`_ used in project `Thoth
<https://thoth-station.ninja>`_.  The library exposes core queries and methods
for PostgreSQL database as well as adapters for manipulating with Ceph via its
S3 compatible API.

Installation and Usage
======================

The library can be installed via pip or Pipenv from
`PyPI <https://pypi.org/project/thoth-storages>`_:

.. code-block:: console

   pipenv install thoth-storages

The library does not provide any CLI, it is rather a low level library
supporting other parts of Thoth.

You can run prepared testsuite via the following command:

.. code-block:: console

  pipenv install --dev
  pipenv run python3 setup.py test

  # To generate docs:
  pipenv run python3 setup.py build_sphinx

Running PostgreSQL locally
==========================

You can use `docker-compose.yaml` present in this repository to run a local PostgreSQL instance, (make sure you installed `podman-compose <https://github.com/containers/podman-compose>`_):

.. code-block:: console

  $ podman-compose up

After running the command above, you should be able to access a local PostgreSQL instance at `localhost:5432`. This is also the default configuration for PostgreSQL's adapter - you don't need to provide `GRAPH_SERVICE_HOST` explicitly. The default configuration uses database named `postgres` which can be accessed using `postgres` user and `postgres` password (SSL is disabled).

The provided `docker-compose.yaml` has also PGweb enabled for to have an UI for the database content. To access it visit `http://localhost:8081/ <http://localhost:8081>`_.

The provided `docker-compose.yaml` does not use any volume. After you containers restart, the content will not be available anymore.

If you would like to experiment with PostgreSQL programatically, you can use the following code snippet as a starting point:

.. code-block:: python

  from thoth.storages import GraphDatabase

  graph = GraphDatabase()
  graph.connect()
  # To clear database:
  # graph.drop_all()
  # To initialize schema in the graph database:
  # graph.initialize_schema()

Generating migrations and schema adjustment in deployment
=========================================================

If you make any changes to data model of the main PostgreSQL database, you need
to generate migrations. These migrations state how to adjust already existing
database with data in deployments. For this purpose, `Alembic migrations
<https://alembic.sqlalchemy.org>`_ are used. Alembic can (`partially
<https://alembic.sqlalchemy.org/en/latest/autogenerate.html#what-does-autogenerate-detect-and-what-does-it-not-detect>`_)
automatically detect what has changed and how to adjust already existing
database in a deployment.

Alembic uses incremental version control, where each migration is versioned and
states how to migrate from previous state of database to the desired next state - these
versions are present in `alembic/versions` directory and are automatically
generated with procedure described bellow.

If you make any changes, follow the following steps which will generate version
for you:

1. make sure your local PostgreSQL instance is running (follow `Running
   PostgreSQL locally` instructions above):

  .. code-block:: console

    $ podman-compose up

2. Run Alembic CLI to generate versions for you:

  .. code-block:: console

    # Make sure you have your environment setup:
    # pipenv install --dev
    # Make sure you are running the most recent version of schema:
    $ PYTHONPATH=. pipenv run alembic upgrade head
    # Actually generate a new version:
    $ PYTHONPATH=. pipenv run alembic revision --autogenerate -m "Added row to calculate sum of sums which will be divided by 42"

3. Review migrations generated by Alembic. Note `NOT all changes are
   automatically detected by Alembic
   <https://alembic.sqlalchemy.org/en/latest/autogenerate.html#what-does-autogenerate-detect-and-what-does-it-not-detect>`_.

4. Make sure generated migrations are part of your pull request so changes are
   propagated to deployments:


  .. code-block:: console

    $ git add thoth/storages/data/alembic/versions/

4. In a deployment, use Management API and its `/graph/initialize` endpoint to
   propagate database schema changes in deployment (Management API has to have
   recent schema changes present which are populated with new `thoth-storages`
   releases).

5. If running locally and you would like to propagate changes, run the following Alembic command to update migrations to the latest version:

  .. code-block:: console

    $ PYTHONPATH=. pipenv run alembic upgrade head


  If you would like to update schema programmatically run the following Python code:

  .. code-block:: python

    from thoth.storages import GraphDatabase

    graph = GraphDatabase()
    graph.connect()
    graph.initilize_schema()

Generate schema images
======================

You can use shipped CLI ``thoth-storages`` to automatically generate schema images out of the current models:

.. code-block:: console

  # First, make sure you have dev packages installed:
  pipenv install --dev
  PYTHONPATH=. pipenv run python3 ./thoth-storages generate-schema

The command above will produce 2 images named ``schema.png`` and
``schema_cache.png``. The first PNG file shows schema for the main PostgreSQL
instance and the latter one, as the name suggests, shows how cache schema looks
like.


If the command above fails with the following exception:

.. code-block:: python

  FileNotFoundError: [Errno 2] "dot" not found in path.

make sure you have `graphviz` package installed:

.. code-block:: console

  dnf install -y graphviz

Creating own performance indicators
===================================

You can create your own performance indicators. To create own performance
indicator, create a script which tests desired functionality of a library. An
example can be matrix multiplication script present in `performance
<https://github.com/thoth-station/performance/blob/master/tensorflow/matmul.py>`_
repository. This script can be supplied to Dependency Monkey to validate
certain combination of libraries in desired runtime and buildtime environment
or directly on Amun API which will run the given script using desired software
and hardware configuration. Please follow instructions on how to create a
performance script shown in the `README of performance repo
<https://github.com/thoth-station/performance>`_.

To create relevant models, adjust `thoth/storages/graph/models_performance.py` file
and add your model. Describe parameters (reported in `@parameters` section of
performance indicator result) and result (reported in `@result`). The name of
class should match `name` which is reported by performance indicator run.

.. code-block:: python

  class PiMatmul(Base, BaseExtension, PerformanceIndicatorBase):
      """A class for representing a matrix multiplication micro-performance test."""

      # Device used during performance indicator run - CPU/GPU/TPU/...
      device = Column(String(128), nullable=False)
      matrix_size = Column(Integer, nullable=False)
      dtype = Column(String(128), nullable=False)
      reps = Column(Integer, nullable=False)
      elapsed = Column(Float, nullable=False)
      rate = Column(Float, nullable=False)

All the models use `SQLAchemy <https://www.sqlalchemy.org/>`_.
See `docs <https://docs.sqlalchemy.org/>`_ for more info.

Online debugging of queries
===========================

You can print to logger all the queries that are performed to a PostgreSQL instance. To do so, set the following environment variable:

.. code-block::

  export THOTH_STORAGES_DEBUG_QUERIES=1

Online debugging of queries
===========================

You can print information about PostgreSQL adapter together with statisics on
the graph cache and memory cache usage to logger (it has to have at least level
`INFO` set). To do so, set the following environment variable:

.. code-block::

  export THOTH_STORAGES_LOG_STATS=1

These statistics will be printed once the database adapter is destructed.

Creating backups from Thoth deployment
======================================

You can use `pg_dump` and `psql` utilities to create dumps and restore
the database content from dumps. This tool is pre-installed in the container image
which is running PostgreSQL so the only thing you need to do is execute
`pg_dump` in Thoth's deployment in a PostgreSQL container to create a dump, use
`oc cp` to retrieve dump (or directly use `oc exec` and create the dump from the
cluster) and subsequently `psql` to restore the database content. The
prerequisite for this is to have access to the running container (edit rights).

.. code-block:: console

  # Execute the following commands from the root of this Git repo:
  # List PostgreSQL pods running:
  $ oc get pod -l name=postgresql
  NAME                 READY     STATUS    RESTARTS   AGE
  postgresql-1-glwnr   1/1       Running   0          3d
  # Open remote shell to the running container in the PostgreSQL pod:
  $ oc rsh -t postgresql-1-glwnr bash
  # Perform dump of the database:
  (cluster-postgres) $ pg_dump > pg_dump-$(date +"%s").sql
  (cluster-postgres) $ ls pg_dump-*.sql   # Remember the current dump name
  (cluster-postgres) pg_dump-1569491024.sql
  (cluster-postgres) $ exit
  # Copy the dump to the current dir:
  $ oc cp thoth-test-core/postgresql-1-glwnr:/opt/app-root/src/pg_dump-1569491024.sql  .
  # Start local PostgreSQL instance:
  $ podman-compose up --detach
  <logs will show up>
  $ psql -h localhost -p 5432 --username=postgres < pg_dump-1569491024.sql
  password: <type password "postgres" here>
  <logs will show up>

Syncing results of jobs run in the cluster
==========================================

Each job in the cluster reports a JSON which states necessary information about
the job run (metadata) and actual job results. These results of jobs are stored
on object storage `Ceph <https://ceph.io/>`_ via S3 compatible API and later on
synced via graph syncs to the knowledge graph. The component responsible for
graph syncs is `graph-sync-job
<https://github.com/thoth-station/graph-sync-job>`_ which is written generic
enough to sync any data and report metrics about synced data so you don't need
to provide such logic on each new workload registered in the system. To sync
your own results of job results (workload) done in the cluster, implement
related syncing logic in the `sync.py
<https://github.com/thoth-station/storages/blob/master/thoth/storages/sync.py>`_
and register handler in the ``_HANDLERS_MAPPING`` in the same file. The mapping
maps prefix of the document id to the handler (function) which is responsible
for syncing data into the knowledge base (please mind signatures of existing
syncing funcions to automatically integrate with ``sync_documents`` function
which is called from ``graph-sync-job``).



