==============================
Fixing DistKV network problems
==============================

As the DistKV network is fully asynchronous, there's no way to avoid
getting into trouble – there's no arbitration of inconsistent data.

This document explains how to get back out, if necessary.

Missing data
============

See the `Server protocol <server_protocol>` for details on how DistKV
works. From that document it's obvious that when a node increments its
``tick`` but the associated data gets lost (e.g. if the node or its Serf
agent crashes), you have a problem.

Worse: a server will not start if the "missing" list is non-empty. The
problem is that stale data causes difficult-to-resolve inconsistencies
when written to. TODO: allow the server to be in maintainer-only mode when
that happens.

First, run ``distkv client internal state -ndmrk``. Your output will look
somewhat like this::

    deleted:  # Ticks known to be deleted
      test1:
      - 12
    known:  # Ticks known to be superseded
      test1:
      - 1
      - - 3
        - 10
      test2:
      - 1
    missing:  # Ticks we need to worry about
      test1:
      - 2
    node: test1  # the server we just asked
    nodes:  # all known nodes and their ticks
      test1: 12
      test2: 1
    remote_missing: {}  # used in recovery
    tock: 82  # DistKV's global event counter
    
This is not healthy: The ``missing`` element contains data. You can
manually mark the offending data as stale::

   one $ distkv client internal mark test1 2
   known:
      test1:
      - - 1
        - 11
      test2:
      - 1
    node: test1
    tock: 92  # If this is not higher than before, clean your glasses ;-)
    one $

This shows that the offending ``tick`` has been successfully added to the
``known`` list. Calling ``distkv client internal state -m`` verifies that
the list is now empty.

Use the ``--broadcast`` flag to send this message to all DistKV servers,
not just the one you're a client of.

This action will allow the bad record to re-surface when the node that has
the record reconnects, assuming that there is one. You can use the ``mark``
command's ``--deleted`` flag to ensure that it will be discarded instead.

