Metadata-Version: 2.1
Name: correlate
Version: 0.5
Summary: Correlates two sets of data by matching unique or fuzzy
Home-page: https://github.com/larryhastings/correlate/
License: UNKNOWN
Author: Larry Hastings
Author-email: larry@hastings.org
Requires-Python: >=3.6
Description-Content-Type: text/markdown
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: BSD License
Classifier: Programming Language :: Python :: 3 :: Only

# correlate

## A clever brute-force correlator for kinda-messy data

##### Copyright 2019-2020 by Larry Hastings


## Overview

Let's say you have two sets of data that really represent the same data,
just in different forms.
As an example, maybe your first dataset is Wikipedia's list of all episodes of a TV
show, and your second dataset is a directory full of video files of
that TV show.  The episode *"Neurostim"* appears in both datasets but
it's represented differently in each.

Now let's say you want to match them up with each other--you want
to match up the values in the first dataset with their equivalents
in the second dataset. The thing is, it's real-world data--and
it's probably a little messy.  Perhaps the two datasets aren't in
the exact same order.  And while some matches are obvious, others are less so.
Maybe one dataset has some values not present in the other and vice-versa.

What do you do?  Sure, you could correlate the two datasets by hand.
But what if the datasets are really big?
And what do you do if they get updated?  Do you want to update the
correlation by hand too?

**correlate** solves this problem for you.  It correlates values between
two messy but strongly-related datasets.

How it works: you submit your two datasets to
**correlate**, showing it each value and the "keys" that map to that value.
You then shen set it to work.  It thinks for a while, then produces its best
guess as to how to match the two sets.  And its best guess is... hey, that's
pretty good!

In essense, **correlate** uses the uniqueness of keys as clues to pair up
its matches.  If there's a key present in both datasets, but it only maps
to one value in each dataset, odds are good that those two values
are an excellent match.

### A Real-Life Example

There's a podcast I like.  I download it as MP3 files
using an RSS feed, 1990s-style. But the metadata in the RSS
feed is junk--the episode titles are inconsistent,
and the episode numbers are almost wholly absent.

This podcast also has a
list of episodes on its website. This data is *much* cleaner,
including nice proper (unique!) episode numbers.  And it's easily
scraped.  But it's still not perfect.
The two lists of episodes aren't exactly the same, and even the episodes
that are present in both are sometimes reordered.

Obviously, I want to take the MP3s from the RSS feed,
and match them up with the nice clean metadata scraped from the website.
This gets me the best of both worlds.

But there are more than *six hundred* episodes of this
particular podcast!  Matching those by hand would be a *lot* of work.
And we get a new episode every week.
And sometimes they actually add back in old episodes, or update the
metadata on old episodes--changes which would mess up any hand-built
ordering.  And I might want to listen to more than one podcast from
this same website someday!
So I really didn't want to do all of this by hand.

Happily, after applying just a bit of intelligence to the two
datasets, **correlate** did a perfect job.

### Why correlate Works So Well

The insight that inspired **correlate** is this:
unique keys in the two datasets are probably very good matches.
Let's say the key `"egyptian"` maps to value *A1* in `dataset_a`
and value *B1*  in `dataset_b`--and it *only* maps to those two
values.  In that case, *A1* and *B1* are probably a match.

This leads to a virtuous cycle.
Let's say the word `"nickel"` maps to two values in each of the
two datasets: *A1* and *A2*, and *B1* and *B2*.
We could match those four values in two possible ways:
*A1* -> *B1* and *A2* -> *B2*,
or *A1* -> *B2* and *A2* -> *B1*.
But the key `"egyptian"` already showed that *A1* and *B1* are a
good match.  If we've already removed those two values from
consideration, we're left with *A2* -> *B2*.
And now *that* looks like a good match, too.

In short, **correlate** capitalizes on the *relative uniqueness*
of keys.


## Getting Started With correlate

### Requirements

**correlate** requires Python 3.6 or newer.  It has no
other dependencies.

If you want to run the **correlate** test suite,
you'll need to install the `rapidfuzz` package.


### The High-Level Conceptual Model

To correlate two datasets with **correlate**,
you first create a `correlate.Correlator` object.
This object contains two members
`dataset_a` and `dataset_b`; these represent
the two datasets you want to correlate.

You fill each dataset with *keys* and *values*.

A *value* is (nearly) any Python object.
Each value should represent one value from your dataset.
**correlate** doesn't examine your values--they're completely
opaque to **correlate**.

A *key* is a Python object that represents some metadata
about a value.  Keys "map" to values; **correlate** examines
keys, using matching keys in the two datasets to
match up the values between them.

Keys are usually strings--for example, the individual words
from the title of some movie, TV show, song, or book.
But keys don't have to be strings.  Instances of lots of
Python data types can be used as keys.

Once you've filled in the `correlate.Correlator` object
with your data, you call its `correlate()` method.  This
computes the matches.  It returns a
`correlate.CorrelateResult` containing
those matches, and lists of any objects from
the two datasets that didn't get matched.

The matches are returned as a list of `correlator.CorrelatorMatch`
objects.  Each object contains three members:

* `value_a`, a reference to an object from `dataset_a`,

* `value_b`, a reference to an object from `dataset_b`,

* and a floating-point `score`.

Each `CorrelatorMatch` object tells you that
**correlator** thinks that this `value_a`
maps to this `value_b`.  The `score` is a sort of mathematical
confidence level--it's a direct result of the keys and
other metadata you provide to **correlate**.  The list of
matches is sorted by `score`--higher scores first, as higher
scores represent higher confidence in the match.

That's the basics.  But **correlate** supports some very sophisticated
behavior:

* When mapping a key to a value, you may specify an optional *weight*,
which represents the relative importance of this key.  The default
weight is 1.  Higher scores indicate a higher significance; a weight of
2 tells **correlate** that this key mapped to this value is twice as
significant.

* A key can map to a value multiple times.  Each mapping can have its own weight.

* If both datasets are ordered, this ordering can optionally influence the match scores.
**correlate** calls this *ranking.*

* Keys can be "fuzzy", meaning two keys can be a partial match rather than a binary yes/no.
Fuzzy keys in **correlate** must inherit from a custom abstract base class called
`correlate.FuzzyKey`.

## Sample Code And infer_mv

If you want to get a feel for what it's like to work with **correlate**,
the package ships with some sample code you can inspect.  Take a look
at the scripts in the `tests` and `utilities` directories.

In particular, `utilities` contains a script called `infer_mv`.
`infer_mv` takes a source directory and a list of files and directories
to rename, and produces a mapping from the former to the latter.
In other words, when you run it, you're saying
"here's a source directory and a list of files and directories to rename.
For each file in the list of things to rename, find the
filename in the source directory that most closely resembles that file,
and rename the file so it's exactly like that other filename from the
source directory."  (If you ask `infer_mv` to rename a directory,
it renames all the files and directories inside that directory, recursively.)

This is useful if, for example, you have a directory where
you've already renamed the files the way you you like them, but then
you get a fresh copy from somewhere.  Simply run `infer_mv` with your
existing directory as the "source directory" and the fresh copy
as the "files".  `infer_mv` will figure out how to rename the
fresh files so they have the filenames how you like them.

Note that `infer_mv` doesn't actually do the work of renaming!
Instead, `infer_mv` prints out a *shell script* that, if executed,
performs the renames.
Why?  It's always a good idea to check over the output of **correlate**
before you commit to it.

You should use `infer_mv` like so:

    % infer_mv ../old_path *
    # look at output, if it's all fine run
    % infer_mv ../old_path * | sh

Or you can direct the output of `infer_mv` into a file, then
edit the file, then execute that.  Or something else!
Whatever works for you!


## Terminology And Requirements

### Values

Values are Python objects that represent individual elements of your two
datasets.  **correlate** doesn't examine values, and it makes very few
demands on them.  Here are the rules for values:

* Values must support `==`.
* Value comparison must be *reflexive,* *symmetric,* *transitive,* and *consistent*.
  For all these examples, `a` `b` and `c` represent values:
    * *reflexive:* A value must always compare as equal to itself.  `a == a` must evaluate to `True`.
    * *symmetric:* If `a == b` is `True`, then `b == a` must also be `True`.
    * *transitive:* If `a == b` is `True`, and `b == c` is `True`, then `a == c` must also be `True`.
    * *consistent:* If `a == b` is `True`, it must always be `True`,
       and if `a == b` is `False` it must always be `False`.

### Keys

Keys are Python objects that **correlate** uses to find matches between
the two datasets.  If a key maps to a value in `dataset_a` and also
a value in `dataset_b`, those two values might be a good match.

Keys must obey all the same rules as values.  In addition,
keys must be *hashable.*

#### Exact Keys

An "exact" key is what **correlate** calls any key that isn't a "fuzzy" key.
Strings, integers, floats, complex, `datetime` objects--they're all fine to use
as **correlate** keys, and instances of many more types too.

When considering matches, exact keys are binary--either they're an exact match
or they don't match at all.  If you need to understand partial matches you'll have
to use "fuzzy" keys.


#### Fuzzy Keys

A "fuzzy" key is a key that supports a special protocol for performing "fuzzy"
comparisons--comparisons where the result can represent imperfect or partial matches.

Technically speaking, a **correlate** "fuzzy" key
is an instance of a subclass of `correlate.FuzzyKey`.  If a key is an instance of
a subclass of that base class, it's a "fuzzy" key, and if it isn't, it's an "exact" key.

Fuzzy keys must follow the rules for keys above.
Also, the type of your fuzzy keys must also obey the same rules as keys;
they must be hashable, they must support `==`,
and their comparison must be reflexive, symmetric, transitive, and consistent.

In addition, fuzzy keys must support a method called `compare` with this signature:
`self.compare(other)`.  `other` will be another fuzzy key of the same type.  Your `compare`
function should return a number between (and including) `0` and `1`, indicating how close
a match `self` is to `other`.  If `compare` returns `1`, it's saying this is a perfect
match, that the two values are identical; if it returns `0`, it's a perfect mismatch,
telling **correlate** that the two keys have nothing in common.

**correlate** requires that `compare` also obey the four mathematical constraints required
of comparisons between keys.  In the following rules, `a` and `b` are fuzzy keys of the
same type.  `compare` must conform to these familiar four rules:

* *reflexive:* `a.compare(a)` must return `1` (or `1.0`).
* *symmetric:* If `a.compare(b)` returns *x*, then `b.compare(a)` must also return *x*.
* *transitive:* If `a.compare(b)` returns *x*, and `b.compare(c)` returns *x*,
  then `a.compare(c)` must also return *x*.
* *consistent:* If `a.compare(b)` returns *x*, it must *always* return *x*.

It's important to note: fuzzy keys of two *different* types are automatically
considered different to each other.  **correlate** won't even bother calling
`compare()` on them--it automatically assigns the comparison a fuzzy score of `0`.
This is true even for subclasses; if you declare `class MyFuzzyNumber(correlate.FuzzyKey)`
and also `class MyFuzzyInteger(MyFuzzyNumber)`,
**correlate** will never compare an instance of `MyFuzzyNumber` and `MyFuzzyKey`
to each other--it automatically assumes they have nothing in common.

(Internally
**correlate** stores fuzzy keys of different types segregated from each other.
This is a vital optimization!)

On a related note, **correlate** may optionally never
*actually* call `a.compare(a)`, either.  That is, if the exact same key
maps to a value in both `dataset_a` and `dataset_b`, **correlate**
is permitted to skip calling `compare()` and instead automatically
assign the comparison a fuzzy score of `1`.  **correlate** currently
does this--but this is not guaranteed,
as it's only a small optimization, and conditions may change.


## API

`Correlator(default_weight=1)`

> The correlator class.  `default_weight` is the weight used
> when you map a key to a value without specifying an explicit weight.

`Correlator.dataset_a`

`Correlator.dataset_b`

> Instances of `Correlator.Dataset` objects representing the two sets of data you want to correlate.  Initially empty.

`Correlator.datasets`

> A list containing the two datasets: `[dataset_a, dataset_b]`


`Correlator.correlate(*,
            minimum_score=0,
            score_ratio_bonus=1,
            ranking=BestRanking,
            ranking_bonus=0,
            ranking_factor=0,
            key_reuse_penalty_factor=1,
            reuse_a=False,
            reuse_b=False)`

> Correlates the two datasets.  Returns a `correlate.CorrelatorResult` object.
>
> `minimum_score` is the minimum permissible score for a match.  It must be
> greater than or equal to 0.
>
> `score_ratio_bonus` specifies the weight of a bonus awarded to a match based on the ratio of
> the actual score computed between these two values divided by the maximum possible score.
>
> `ranking` specifies which approch to computing ranking **correlate** should use.
> The default value of `BestRanking` means **correlate** will try all approaches
> and choose the one with the highest cumulative score across all matches.
> Other values include `AbsoluteRanking` and `RelativeRanking`.
>
> `ranking_bonus` specifies the weight of the bonus awarded to a match
> based on the proximity of the two values in their respective datasets, as specified
> by their rankings.  The closer the two values are to the same position in their
> respective datasets, the higher a percentage of the `ranking_bonus` will be awarded.
>
> `ranking_factor` specifies the ratio of the base score of a match that is multiplied
> by the proximity of the two values in their respective datasets.  If you ues `ranking_factor=0.4`,
> then a match only automatically keeps 60% of its original score; some percentage
> of the remaining 40% will be re-awarded based on the proximity of the two values.
>
> (You can't use both a nonzero `ranking_bonus` and a nonzero `ranking_factor` in the
> same correlation.  Pick at most one!)
>
> `key_reuse_penalty_factor` is a multiplier applied to the score calculated
> for a key each time a key is re-mapped to a value.  The second time a key is used,
> its score is multiplied by `key_reuse_penalty_factor`; the third time,
> by `key_reuse_penalty_factor**2`, and so on.
> The default value of 1 means every use of a key gets the same score.
> `key_reuse_penalty_factor` should be greater than or equal to 0, and less than or equal to 1.
>
> `reuse_a` permits values in `dataset_a` to be matched to more than one value in `dataset_b`.
> `reuse_b` is the same but for values in `dataset_b` matching `dataset_a`.
> If you set both reuse flags to True, the `correlate.CorrelatorResult.matches`
> list returned will contain *every* possible match.

`Correlate.Dataset()`

> The class for objects representing a dataset.  Behaves somewhat like
> a write-only dict.

`Correlator.Dataset.set(key, value, weight=default_weight)`

> Adds a new correlation.
>
> You can use `Dataset[key] = value` as a shortcut for `Dataset.set(key, value)`.

`Correlator.Dataset.set_keys(keys, value, weight=default_weight)`

> Map multiple keys to a single value, all using the same weight.
> `keys` must be an iterable containing keys.

`Correlator.Dataset.value(value, *, ranking=None)`

> Annotates a value with extra metadata.  Currently only one metadatum
> is supported: `ranking`.
>
> `ranking` represents the position of this value in the dataset,
> if the dataset is ordered.  `ranking` should be an integer
> representing the ranking; if this value is the 19th in the dataset,
> you should supply `ranking=19`.

`Correlator.str_to_keys(s)`

> A convenience function.
> Converts string `s` into a list of string keys using a reasonable approach.
> Lowercases the string, converts some common punctuation into spaces, then splits
> the string at whitespace boundaries.  Returns a list of strings.


### Getting Good Results Out Of Correlate

Unfortunately, you can't always expect perfect results with **correlate**
every time.  You'll usually have to play with it at least a little.

#### Ranking

Naturally, the first step with **correlate** is to plug in your data.
I strongly encourage you to add ranking information if possible.

If the two datasets are ordered, and equivalent items should appear in
roughly the same place in each of the two datasets, ranking information
can make a *sizeable* improvement in the quality of your matches.
To use ranking information, you set the `ranking` for each value in each dataset
that you can, and specify either `ranking_bonus` or `ranking_factor` when
running `correlate()`.
Which one you use kind of depends on how much confidence you have in the
ordering of your datasets.  If you think your ranking information is pretty
accurate, you should definitely use `ranking_factor`; this exerts a much
stronger influence on the matches.
If you have a low confidence in the ordering of your datasets,
choose `ranking_bonus`, which only provides a little nudge.

#### Minimum Score

Once you've plugged in all your data, you should run the correlation,
print out the result in sorted order with the best
matches on top, then scroll to the bottom and see what the *worst*
5% or 10% of matches look like.

If literally all your matches are already perfect--congratulations!
You're *already* getting good results out of **correlate** and you
can stop reading here!  But if you're not that lucky, you've got more
work to do.

The first step in cleaning up **correlate's** output is usually
to stop it from making bad matches by setting a `minimum_score`.

When you have bad matches, it's usually because the two datasets don't map
perfectly to each other.  If there's a value in `dataset_a` that really
has no good match in `dataset_b`, well, **correlate** doesn't really
have a way of knowing that.   So it may match that value to something
anyway.

Look at it this way: the goal of **correlate** is to find matches between
the two datasets.  If it's made all the good matches it can, and there's
only one item left in each of the the two datasets, and they have *anything*
in common at all, **correlate** will match those two values together
out of sheer desparation.

However!  Bad matches like these tend to have a very low score.
And usually all those bad matches are clumped together
at the bottom.  There'll probably be an inflection point
where the scores drop off significantly and the matches go from good to bad.

This is what `minimum_score` is for.  `minimum_score` tells **correlate**
the minimum permissible score for a match.  When you have a clump of bad
matches at the bottom, you simply set `minimum_score` to be somewhere
between the highest bad match and the lowest good match--and
your problem is solved!

(Technically, **minimum_score** isn't actually the *minimum* score.
It's ever-so-slightly *less* than the lowest
permitted score.  As in, for a match to be considered viable, its score
must be *greater than* **minimum_score.**  The default value for
**minimum_score** is 0, which means **correlate** will keep
any match with a positive score.)

Unfortunately it's hard to predict what to set `minimum_score` to in advance.
Its value really depends on your data set--how many keys you have, how good
the matches are, what weights you're using, everything.  It's much more
straightforward to run the correlation, look over the output, find where the
correlations turn bad, and set a minimum score.  With large data sets there's
generally a sudden and obvious dropoff in score, associated with **correlate**
making poor matches.  That makes it pretty easy: set the minimum score so it
keeps the last good match and forgets the rest.  But there's no predicting what
that score will be in advance--every data set is different, and it's really
an emergent property of your keys and weights--so
you'll have to calibrate it correctly for each correlation you run.

(Sometimes there are good matches mixed in with the bad ones at the bottom.
When that happens, the first step is generally to fix *that,* so that the
bad ones are all clumped together at the bottom.  I can't give any general-purpose
advice on what to do here; all I can say is, start experimenting with changes
to your datasets.  Change your keys, adjust your weights, run the correlation
again and see what happens.  Usually when I do this, I realize something I can
do to improve the data I feed in to **correlate**, and I can fix the problem
externally.)

#### Weights

If you're still not getting the results you want, the next adjustment you
should consider is increasing the weight of
keys that provide a clear signal.  If the datasets you're comparing
have some sort of unique identifier associated with each value--like
an episode number, or release date--you should experiment with giving those
keys a heavier weight.  Heavily-weighted keys like this can help
**correlate** zero in on the best matches right away.

It's up to you what that weight
should be; I sometimes use weights as heavy as 5 for super-important keys,
which means this one single key will have the same weight as 5 normal
keys.  Note that a weight of 5 on the mapping in `dataset_a` and
`dataset_b` means that, if those keys match, they'll have a base score
of 25!  If that key only appears once in each dataset, that will almost
*certainly* result in a match.

But weighting can be a dual-edged sword.  If your data has mistakes
in it, a heavy weighting of this bad data magnifies those mistakes.  One bad
heavily-weighted key on the wrong value can inflate the score of a bad match
over the correct match.  And that can have a domino effect--if *A1* should match
to *B1*, but it get mapped to *B43* instead, that means *B1* is probably
going to get mismatched too.  Which deprives another value of its correct
match.  And so on and so on and so on.

#### Too-Common Keys

Similarly, if there are super-common keys that aren't going to help with
the correlation, consider throwing them away and not even feeding them in as
data.  Keys that map to most or all of the values in a dataset add little
clarity, and will mainly serve just to make **correlate** slower.
I usually throw away the word "The", and the name of the podcast or show.
(When correlating filenames, I may throw away the file extension too.)

Then again, often leaving them in won't hurt anything, and it can occasionally
be helpful!  The way **correlate** works, it considers multiple maps of a
key to a value as different things--if you map the key `"The"` to a value
twice, **correlate** understands that those are two separate mappings.
And if there's only one value in each dataset that has two `"The"` mappings,
that can be a very strong signal indeed.  So it's really up to you.
Throwing away largely-redundant keys is a speed optimization, but it
shouldn't affect the quality of your matches.

(The best of both worlds: for common keys, try throwing away the *first* one.)

#### Check Your Inputs

As always, it's helpful to make sure your code is doing what you intend it to.
Several times I've goofed up the mechanism I use to feed data sets into
**correlate**; for example, instead of feeding in words as keys, I've occasionally
fed in the individual characters in those words as keys.  (Like, instead of
the single key `"booze"`, I accidentally fed in the five keys
`'b'`, `'o'`, `'o'`, `'z'`, and `'e'`.)
However, the **correlate** algorithm works so well,
it still did a shockingly good job!  (Though it was a *lot* slower.)

I've learned to double-check that I'm inputting the mappings and weights I meant
to, with a debugger or with turning on the debug print statements in **correlate**
itself.  Making sure you gave **correlate** the right data can make it not only
much more accurate, it might make it faster too!

#### Normalize Strings

When using strings as keys from real-world sources, I recommend you
*normalize* the strings:
lowercase the strings, remove most or all punctuation, break the strings up into
individual keys at word boundaries.  In the real world, punctuation
and capitalization can both be inconsistent, so throwing it away can help
dispel those sorts of inconsistencies.  **correlate**
provides a utility function called `correlate.str_to_keys()` that does this
for you--but you can use any approach you like.

You might also consider *interning* your strings.  In my limited experimentation
this provided a small but measurable speedup.

#### Sharpen Your Fuzzy Keys

If you're using fuzzy keys, the most important thing you can do is *sharpen
your keys.*  Fuzzy string-matching libraries have a naughty habit of scoring
not-very-similar strings as not *that* much less than almost-exactly-the-same
strings.  If you give that data unaltered to **correlate,** that "everything
looks roughly the same" outlook will be reflected in your final results as
mediocre matches.  In those cases it's best to force your fuzzy
matches to extremes.  The best technique is simply to have a minimum score
for fuzzy maches in these cases.  Squaring or cubing the resulting
score is also a cheap way to attenuate weaker fuzzy scores while preserving
the stronger fuzzy scores.


### What Do These Scores Mean?

The scores you seee in the results are directly related to the data you
gave to **correlate**.  The scores really only have as much or as
little meaning as you assign to them.

If you don't enjoy the unpredictable nature of **correlate** scores,
consider calling `normalize()` on your Correlate result object.
This normalizes the scores as follows: the highest score measured
will be adjusted to 1.0, `minimum_score` will be adjusted to 0.0,
and every other score will be adjusted linearly between those two.

Mathematically:

    score = the original score for this match
    highest_score = highest score of any match
    minimum_score = the minimum_score passed in to correlate()
    delta = highest_score - minimum_score
    normalized_score = (score - minimum_score) / delta



## The Algorithm

> If the implementation is hard to explain, it's a bad idea.
> --*The Zen Of Python* by Tim Peters

At the heart of **correlate** is a brute-force algorithm.

**correlate** computes every possible "match"--every mapping of a value in
`dataset_a` to a value in `dataset_b` where the two values have keys in common.
(That's why it takes so long to compute--it's what computer scientists would
call an `O(n**2)` algorithm.)
For exact keys, it uses set intersections to ignore pairs of values that have
nothing in common, discarding early matches it knows will have a score of 0.
Sadly, it can't do that for fuzzy keys, which is why
fuzzy keys tend to slow **correlate** down even more.

For each key that matches between the two values, it computes a score.  It
then adds all those scores together, computing the final cumulative
score for the "match", which it may modifiy based on the various bonuses
and factors.
It then iterates over these scores in sorted order, highest score first.
For every match where neither of the two values have been used
in a match yet, it counts that as a "match" and adds it to the output.
(Assuming `reuse_a` and `reuse_b` are both `False`.)

One important detail: **correlate** is 100%
deterministic.  Randomness can creep in around the edges in Python
programs; for example, if you ever iterate over a dictionary,
the order you will see the keys will vary from run to run.
**correlate** eliminates these sources of randomness.
Given the exact same inputs, it performs the same operations
in the same order and produces the same result, every time.

There are a number of concepts involved with how the **correlate**
algorithm works, each of which I'll explain in exhausting detail
in the following sub-sections.

### Rounds

If you call **correlate** as follows:

    c = correlate.Correlator()
    o = object()
    c.dataset_a.set('a', o)
    c.dataset_a.set('a', o)

then the key `'a'` really *is* mapped to value `o` twice,
and those two mappings can have different weights.
It's best to think of repeated keys like this as actually
being two different keys--identical, but distinct.
It's ike having files with the same filename in two
different directories.  They have the same *filename,* but
they're not the same *file.*

**correlate** calls groups of these multiple mappings *"rounds"*.
A "round" contains all the keys from the Nth time they were
repeated; round 0 contains every key, round 1 contains all the
keys that were repeated twice, round 2 contains all the keys
that were repeated three times, etc.
Rounds are per-value, and there are as
many rounds as the maximum number of redundant mappings of
any single key to any particular value in a dataset.

Consider this example:

    c = correlate.Correlator()
    o = object()
    c.dataset_a.set('a', o, weight=1)
    c.dataset_a.set('a', o, weight=3)
    c.dataset_a.set('a', o, weight=5)
    c.dataset_a.set('b', o)
    c.dataset_a.set('b', o)
    c.dataset_a.set('c', o)

    o2 = object()
    c.dataset_b.set('d', o2)
    c.dataset_b.set('d', o2)
    c.dataset_b.set('e', o2)
    c.dataset_b.set('f', o2)

Here, the value `o` in `dataset_a` would have three rounds:

* Round 0 would contain the keys `{'a', 'b', 'c'}`.
* Round 1 would contain the keys `{'a', 'b'}`.
* Round 2 would contain only one key,`{'a'}`.

And `o2` in `dataset_b` would have only two rounds:

* Round 0 would contain the keys `{'d', 'e', 'f'}`.
* Round 1 would contain only one key,`{'d'}`.

And conceptually the `"a"` in round 0 is a different key
from the `"a"` in round 1, etc.

For exact keys, rounds are directly matched iteratively
to each other; the exact keys in round 0 for a value in
`dataset_a` are matched to the round 0 exact keys for a value in
`dataset_b`, round 1 in `dataset_a` is matched to
round 1 in `dataset_b`, and so on.
If one side runs out of rounds early, you stop; if you compute
the intersection of a round and they have nothing in common,
you stop.

One invariant property: each subsequent round has a subset of
the keys before it.  The set of keys in round **N+1** *must*
be a subset of the keys in round **N**.

What about weights?  Higher weights are sorted to lower rounds.
The weight for a key *k* in round **N-1** *must* be greater than
or equal to the weight of *k* in round **N**.
In the above example, the `'a'` in round 0 has weight 5, in round 1
it has weight 3, and in round 2 it has weight 1.
(It doesn't matter what order you insert them in, **correlate**
internally stores the weights in sorted order.)

Thus, round 0 always contains every exact key mapped to
a particular value, with their highest weights.

Rounds can definitely help find the best matches.  If the
key `"The"` maps to most of your values once,
that's not particularly interesting, and it won't affect the scores
very much one way or another.  But if there's only one
value in each dataset that `"The"` maps to *twice,*
that's a very strong signal indeed!  **correlate** does an
*excellent* job of noticing unique-ness like that and factoring
it into the scoring.



### Scoring Exact Keys

For each match it considers, **correlate** computes the intersection of
the keys that map to each of those two values in the two datasets, then
computes a score based on each of those key matches.
The formula used for scoring matches between exact keys
is the heart of **correlate**, and was the original inspiration for
writing the library in the first place.

**correlate** stores the keys per-round in `set()` objects, and computes
this intersection with the marvelously fast `set.intersection()`.

**correlate** computes the score for an individual key as follows:

    key = a key present in both dataset_a and dataset_b
    value_a = a value from dataset_a that key maps to
    value_b = a value from dataset_b that key maps to
    round_number = which round this key is in
    weight_a = weight of the mapping from key -> value_a in dataset_a in round round_number
    weight_b = weight of the mapping from key -> value_b in dataset_b in round round_number
    values_a = the count of values in dataset_a that this key maps to in round round_number
    values_b = the count of values in dataset_b that this key maps to in round round_number
    score_a = weight_a / values_b
    score_b = weight_b / values_a
    semifinal_score = score_a * score_b * (key_reuse_penalty_factor ** round_number)

This `semifinal_score` is then further adjusted based on ranking information,
if used.  (See below.)

Thus, the fewer values a key maps to in a dataset, the higher it scores.
A key that's only mapped once in each dataset scores 4x
higher than a key mapped twice in each dataset.

This scoring formula has a virtuous-feeling mathematical
property I call *"conservation of score".*  Each key that
you add to a round in a dataset adds 1 to the total cumulatve
score of all possible matches; when you map a key to multiple
values, you divide this score up evenly between those values.
For example, if the key `x` is mapped to three values in `dataset_a`
and four values in `dataset_b`, each of those possible matches
only gets 1/12 of that key's score, and the final cumulative
score for all matches only goes up by 1/12.  So a key always
adds 1 to the sum of all scores across all *possible* matches,
but only increases the actual final score by the amount of
signal it *actually* communicated.

Also, now you see why repeated keys can be so interesting.
They add 1 for *each round* they're in, but that score is only
divided by the number of values they're mapped to *in that round!*

### Fuzzy Keys

Once upon a time, **correlate** was small and beautiful and
marvelously fast.  But that version could only support exact keys.
By the time fuzzy keys were completely implemented and feature-complete
and working great, **correlate** was much more complex and... "practical".
It's because fuzzy keys introduce a lot of complex behavior, resulting in
tricky scenarios that just couldn't happen with exact keys.

Consider this example:

>    Your two datasets represent lists of farms.  Both datasets list
>    animals, but might have generic information ("horse") or might
>    have specifics ("Clydesdale").  You create a fuzzy key subclass called
>    `AnimalKey` that can handle matching these together;
>    `AnimalKey("Horse/Clydesdale")` matches `AnimalKey("Horse")`,
>    though with a score less than 1 because it isn't a perfect match.
>
>    The same farm, *Farm X*, is present in both datasets:
>
>    * In `dataset_a`, the keys `AnimalKey("Horse/Clydesdale")`
>    and `AnimalKey("Horse/Shetland Pony")` map to Farm X.
>
>    * In `dataset_b`, the key `AnimalKey("Horse")` maps to Farm X *twice.*
>
>    Question: should one of the `"Horse"` keys match `"Horse/Clydesdale"`,
>    and the other `"Horse"` key match `"Horse/Shetland Pony"`?
>
>    Of course they should!

The scoring used for fuzzy keys is conceptually the same as the scoring
for exact keys, including the concept of "rounds".  In practice, fuzzy key
scoring is much more complicated; there are some multipliers I elided
in the description for exact keys because they're always 1, and some other
things that are easy to compute for exact keys that we must do the hard way
for fuzzy keys.  (There's a whole section at the end of this document about
the history of fuzzy key scoring in **correlate**, in case you're interested.)

Also, it's reasonable for a single value in a dataset to have multiple fuzzy
keys of the same type, which means that now we could have multiple keys
in one dataset in contention for the same key in the other dataset.
In the above example with farms and horses, **correlate** will need to
compare both `AnimalKey("Horse/Clydesdale")` and `AnimalKey("Horse/Shetland Pony")`
from `dataset_a` to `AnimalKey("Horse")` in `dataset_b`.

But **correlate** doesn't add up every possible fuzzy score generated by a
key; when computing the final score, a fuzzy key is only matched against
one other fuzzy key.  If fuzzy keys *FA1* and *FA2* map to value *VA*
in `dataset_a`, and fuzzy key *FB* maps to value *VB* in `dataset_b`,
**correlate** will consider *FA1* -> *FB* and *FA2* -> *FB*
and only keep the match with the highest score.  This match "consumes"
those two keys, and they can't be matched again.  (Again: when I say "key"
here, I mean "this key in this round".)

Wait!  It gets even *more* complicated!
It's entirely possible for a key in one round in `dataset_a` to be
matched to a key from a *different* round in `dataset_b`, again like the
sample of the farms and horses above.  That's right: fuzzy keys can match
keys from *different rounds!*  Computing proper fuzzy scores thus
requires tracking separately how many reuses of each key
we've used so far so that *key_reuse_penalty_factor* can be computed
properly.  Where exact keys use very precise "rounds", fuzzy keys
require a more dynamic approach.  In essense, an unused key in round *N*
can "survive" to rounds *N+1*.  That's what the above example with
farms and ponies is showing us; in round 0, if `"Horse/Clysedale"` in `dataset_a`
gets matched to `"Horse"` in `dataset_b`, `"Horse/Shetland Pony"`
in `dataset_a` goes unmatched and continues on to round 1.  This also
made scoring more complicated.  (For more on this, check out the test
suite.  There's a regression test that exercises this exact behavior.)

So here's how **correlate** computes fuzzy matches.  For each iteration,
it computes all possible fuzzy matches between all viable keys,
stores the results in a list, sorts the list with highest score first,
and keep the highest-scored fuzzy match between two keys that haven't
been "used" yet.  Every key that doesn't get "used" proceeds on to the
next iteration, if there is one, and if that same key was mapped in that
round too the early one displaces the later one.  If fuzzy key *FKA*
is mapped twice to value *VA*, and the first *FKA* doesn't get
used in round 0, that mapping sticks around to the next iteration,
and the second mapping continues to wait.  Complicated, huh!

(One additional subtle point: the *weights* of these fuzzy key mappings
doesn't influence this aspect of scoring; they're only used in calculating
the semi-final score, below.  **correlate** prefers the *best* matches,
not the *most important* matches.)

At last, here's the scoring algorithm for fuzzy keys:

    value_a = a value from dataset_a
    value_b = a value from dataset_b
    key_a = a fuzzy key in dataset_a that maps to value_a
    key_b = a fuzzy key in dataset_b that maps to value_b
    round_a = the round number for this mapping of key_a -> value_a in dataset_a
    round_b = the round number for this mapping of key_b -> value_b in dataset_b
    weight_a = weight of this mapping from key_a -> value_a in round_a in dataset_a
    weight_b = weight of this mapping from key_b -> value_b in round_b in dataset_b
    fuzzy_score = the result of key_a.compare(key_b)
    cumulative_a = the cumulative score of all successful matches between key_a and all fuzzy keys in dataset_b
    cumulative_b = the cumulative score of all successful matches between key_b and all fuzzy keys in dataset_a
    reuse_penalty_a = key_reuse_penalty_factor ** round_a
    reuse_penalty_b = key_reuse_penalty_factor ** round_b
    score_ratio_a = (fuzzy_score * reuse_penalty_a) / cumulative_a
    score_ratio_b = (fuzzy_score * reuse_penalty_b) / cumulative_b
    unweighted_score = fuzzy_score * score_ratio_a * score_ratio_b
    score_a = weight_a * unweighted_score_a
    score_b = weight_b * unweighted_score_b
    semifinal_score = score_a * score_b

`cumulative_a` is the sum of all `fuzzy_score` scores for matches using `key_a`.
`unweighted_score` is used when choosing which matches to keep per-round.
`semifinal_score` is the actual score added to the total for this match.
Again, this `semifinal_score` is further permuted by ranking, if used.

It's hard to see it in all that messy math, but fuzzy keys maintain
*"conservation of score"* too.  The total score contributed by a fuzzy key
across all possible matches is guaranteed to be no less than 0 and
no more than 1.  But the amount of score it actually contributes
to the final cumulative score is dependent on how many of those
matches are actually used.  A fuzzy key that, when matched,
always produced a `fuzzy_score` of 1 would behave identically
to an exact key with respect to *"conservation of score"*.


### Score Ratio Bonus

There's a "bonus" score calculated using `score_ratio_bonus`.  It's scored for the
overall mapping of a value in `dataset_a` to a value in `dataset_b`.
This bonus is one of the last things computed for a match, just before ranking.

The bonus is calculated as follows:

    value_a = a value from dataset_a
    value_b = a value from dataset_b
    actual_a = total actual score for all keys that map to value_a in dataset_a
    actual_b = total actual score for all keys that map to value_b in dataset_b
    possible_a = total possible score for all keys that map to value_a in dataset_a
    possible_b = total possible score for all keys that map to value_b in dataset_b
    bonus_weight = score_ratio_bonus * (actual_a + actual_b) / (possible_a + possible_b)

This bonus calculated with `score_ratio_bonus` clears up the
ambiguity when the set of keys mapping to one value is a subset of the keys
mapping to a different value in the same dataset.  The higher percentage
of keys that match, the larger this bonus will be.

Consider this example:

    c = correlate.Correlator()
    c.dataset_a.set('breakin', X)

    c.dataset_b.set('breakin', Y)
    c.dataset_b.set_keys(['breakin', '2', 'electric', 'boogaloo'], Z)

Which is the better match, `X->Y` or `X->Z`?
In early versions of **correlate**, both matches got the exact same score.
So it was the luck of the draw as to which match **correlate** would choose.
`score_ratio_bonus` disambiguates this scenario.  It awards a larger bonus
to `X->Y` than it does to `X->Z`,  because a higher percentage of the keys
matched between `X` and `Y`.
That small boost is usually all that's needed to let **correlate**
disambiguate these situations and pick the correct match.

Two things to note.  First, when I say "keys", this is another situation
where the same key mapped twice to the same value is conceptually considered
to be two different keys.
In the example in the **Rounds** subsection above, where `value_a` is `o` and
`value_b` is `o2`, `possible_a` would be 6 and `possible_b` would be 4.

Second, the scores used to compute `actual` and `possible` are *unweighted.*
If a match between two fuzzy keys resulted in a fuzzy score of `0.3`,
that adds `0.3` to both `actual_a` and `actual_b`, but each of those fuzzy
keys adds `1.0` to `possible_a` and `possible_b` respectively.
All modifiers, like weights and `key_reuse_penalty_factor`,
are ignored when computing `score_ratio_bonus`.


### Choosing Which Matches To Keep: The "Match Boiler"

As mentioned previously, **correlate** iterates over all the matches
sorted by score, and keeps the matches with the highest score where
neither `value_a` nor `value_b` has been matched yet.

Actually that's no longer quite true.
Late in development of **correlate**, I realized there was a small
problem lurking with this approach.  Happily it had a relatively
easy fix, and the fix didn't make **correlate** any slower in the
general case.

Here's the abstract problem: if you're presented with a list of match
objects called `matches`,
where each item has three attributes `value_a`, `value_b`, and `score`,
how would you compute an optimal subset of `matches` such that:

* every discrete value of `value_a` and `value_b` appears only once, and
* the sum of the `score` attributes is maximized?

Finding the perfectly optimal subset would be expensive--not
quite `O(N**2)` but close.  It'd require a brute-force approach,
where, for every item in `matches` that shared a value of `value_a` or
`value_b` with another item, you'd have to run an experiment to
"look ahead" and see what happens if you chose that item.  You
compute the resulting score for each of these items, then
keep the item that resulted in the highest cumulative score.

**correlate** needs to solve this problem in two different contexts:

* when processing the list of fuzzy matches produced during each "round" of fuzzy matching, and
* when processing the overall list of matches computed between `dataset_a` and `dataset_b`
  at the end of the correlation process.

But **correlate** can't use this hypothetical "optimal" algorithm,
because it'd take too long--you probably want your correlation
to finish before our sun turns into a red giant.
Instead, **correlate** used a considerably
cheaper "greedy" algorithm.  As already described several times:
**correlate** sorts
the list of matches by `score`, then iterates down the list highest
to lowest.  For every item, if we haven't already "kept" another item
with the same `value_a` or `value_b`, "keep" this item in the `results`
list, and remember the `value_a` and `value_b`.

This approach completes in linear time.  In theory, it's not guaranteed
to produce optimal results.  In practice, with real-world data, it should.
Except there *is* a plausible problem lurking in this approach!

Consider: what if two values in the list are both viable, and they have the
*same* score, and they have either `value_a` or `value_b` in common?  It's
ambiguous as to which match **correlate** will choose.  But choosing the
wrong one *could* result in objectively less-than-optimal scoring.

Here's a specific example:

* `dataset_a` contains fuzzy keys `fka1` and `fka2`.
* `dataset_b` contains fuzzy keys `fkbH` and `fkbL`.
  Any match containing `fkbH` has a higher score than any match containing `fkbL`.
  (H = high scoring, L = low scoring.)
* The matches`fka1->fkbH` and `fka2->fkbH` have the same score.
* The match `fka1->fkbL` has a lower score than `fka2->fkbL`.

**correlate** should prefer `fka2->fkbL` to `fka1->fkbL`.
But it can only pick that match if it previously picked `fka1->fkbH`.
And there's no guarantee that it would!  If two items in the list have
the same score, it's ambiguous which one **correlate** would choose.
To handle this properly it needs to look ahead and experiment.

My solution for this is what I call the "match boiler", or the "boiler"
for short.  The boiler uses a hybrid approach: it uses the greedy linear
algorithm when scores are unique.  If it encounters a run of items
with matching scores, and where any of those items have `value_a` or
`value_b` in common, it recursively tries the experiment where it keeps
each of those items in turn.  It does the look-ahead, and sums the score
from each recursive experiment, then keeps the experiment with the highest
score.

With the "match boiler" in place, **correlate** seems to produce optimal
results even in these rare ambigous situations.

(Using the boiler to analyze fuzzy rounds got pretty complicated!
It had to support a callback each time it kept a value, which let
the fuzzy keys submit new matches for consideration from subsequent
rounds.)


Even with the boiler, you can still contrive scenarios where **correlate**
will produce arguably sub-optimal results.  If `a1` and `a2` are values
in `dataset_a`, and `b1` and `b2` are values in `dataset_b`, and the
matches have these scores:

    a1->b1 == 10
    a1->b2 == 9
    a2->b1 == 8
    a2->b2 == 1

In this scenario, the boiler will pick `a1->b1`, which means it's
left with `a2->b2`.  Total score: 11.  But if it had picked `a1->b2`,
that means it would get to pick `a2->b1`, and the total score would be 17!
Is that better?  In an abstract, hypothetical scenario like this, it's
hard to say for sure.

In practice I don't think this is really a problem.  Handling the
ambiguous scenario where items had identical scores is already
"gilding the lily", considering how rare it happens with real-world data.
Anyway, when would real data behave in this contrived way?  How
could `a1` score so highly against `b1` and `b2`, but `a1` scores
high against `b1` but low against `b2`?
This scenario simply isn't realistic, and fixing it would likely be
computationally expensive for the general case.  So it's just not
worth it.  Relax, YAGNI.


### Ranking

Ranking information can help a great deal.
If a value in `dataset_a` is near the beginning, and the order
of values is significant, then we should prefer matching it to values
in `dataset_b` near the beginning too.  Matching the first
value in `dataset_a`
against the last value in `dataset_b` is probably bad.

Conceptually it works as follows: when scoring a match,
measure the distance between
the two values and let that distance influence the score.  The closer
the two values are to each other, the higher the resulting score.

But how do you compute that delta?  What do the ranking numbers mean?
**correlate** supports two possible interpretations
of the rankings, what we'll call *absolute* and *relative* ranking.
These two approaches differ in how they compare the ranking numbers,
as follows:

* *Absolute* ranking assmes the ranking numbers are the same
  for both datasets.  `ranking=5` in `dataset_a` is a perfect
  match to `ranking=5` in `dataset_b`.
* *Relative* ranking assumes that the two datasets represent the
  same range of data, and uses the ratio of the ranking of a value
  divided by the highest ranking set in that dataset to compute
  its relative position.  If the highest ranking we saw in a
  particular dataset was `ranking=150`,
  then a value that has `ranking=12` set is calculated to be 8%
  of the way from the beginning to the end.  This percentage
  is calculated similarly for both datasets, and the distance
  between two values is the distance between these two percentages.

For example, if `dataset_a` had 100 items ranked 1 to 100,
and `dataset_b` had 800 items ranked 1 to 800,
a value to `dataset_a` with `ranking=50` in *absolute* ranking
would be considered closest to a value in `dataset_b` with `ranking=50`,
but when using *relative* ranking
it'd be considered closest to a value in `dataset_b` with `ranking=400`.

Which one does **correlate** use?  It's configurable with the `ranking`
parameter to `correlate()`.  By default it uses "best" ranking.
"Best" ranking means **correlate** will compute scores using *both*
methods and choose the approach with the highest overall score.
You can override this by supplying a different value to `ranking`
but this shouldn't be necessary.  (Theoretically it should be faster
to use a specific ranking approach.  Unfortunately this hasn't been
optimized yet, so using only one ranking doesn't really speed things up.)


Ranking is the last step in computing the score of a match.
As for how ranking affects the score, it depends on whether you
use `ranking_bonus` or `ranking_factor`.

Both approaches start with these four calculations:

    semifinal_scores_sum = sum of all "semifinal" scores above
    ranking_a = the ranking value computed for value_a
    ranking_b = the ranking value computed for value_b
    ranking_score = 1 - abs(ranking_a - ranking_b)

`ranking_bonus` is then calculated per-match as follows:

    bonus = ranking_score * ranking_bonus
    final_score = semifinal_scores_sum + bonus

`ranking_factor` is also calculated per-match, as follows:

    unranked_portion = (1 - ranking_factor) * semifinal_scores_sum
    ranked_portion = ranking_factor * semifinal_scores_sum * ranking_score
    final_score = unranked_portion + ranked_portion

(If you don't use either, the final score for the match is effectively
`semifinal_scores_sum`.)

Obviously, `ranking` must be set on both values in both datasets to
properly compute `ranking_score`.  If it's not set on *both* values
being considered for a match, **correlate** still applies
`ranking_bonus` or `ranking_factor` as usual, but it skips the
initial four calculations and just uses a `ranking_score` of 0.


## Final Random Topics

### Debugging

When all else fails... what next?

**correlate** can optionally produce an enormous amount of debug output.
The main feature is showing every match it tests, and the score arrived
at for that match, including every step along the way.  This log output
quickly gets very large; even a comparison of 600x600 elements will produce
tens of megabytes of output.

Unfortunately, producing this much debugging output incurred a measurable
performance penalty--even when you had logging turned off!  It was mostly
in computing the "f-strings" for the log, but also simply the calls to
the logging functions added overhead too.

The solution I settled on is: by default, each of the debug print statements
is commented out.
**correlate** ships with a custom script preprocessor called
`debug.py` that can toggle debugging on and off, by uncommenting and
re-commenting the debug code.

How does it know which lines to uncomment?  Each line of the logging-only
code ends with a special string: "`#debug`".

To turn on this logging, run the `debug.py` script in the same directory
as **correlate's** `__init__.py` script.  Each time you run it, it'll
toggle the debug print statements.
Note that the debug feature in **correlate** requires Python 3.8 or higher,
because it frequently uses the beloved "equals sign inside f-strings" syntax.

By default the logging is sent to stdout.  If you want to override where
it's sent, write your own `print` function, and assign it to your
`Correlator` object before calling `correlate()`.

The format of the log is undocumented and subject to change.  Good luck!
The main thing you'll want to do is figure out the "index" of the values
in datasets a and b that you want to compare, then search for `" (index_a) x (index_b) "`.
For example, if the match you want to see is between value index 35 in `dataset_a`
and value index 51 in `dataset_b`, search in the log for `" 35 x 51 "`.
(The leading and trailing spaces means your search will skip over,
for example, `235 x 514`.)


### Alternate Fuzzy Scoring Approaches That Didn't Work

The math behind fuzzy scoring is a bit surprising, at least to me.
If you boil down the formula to its constituent factors,
you'll notice one of the factors is `fuzzy_score` *cubed.*
Why is it *cubed?*

The simplest answer: that's the first approach that worked
properly.  To really understand why, you'll need to understand the
history of fuzzy scoring in **correlate**--all the approaches
I tried first that *didn't* work.

Initially, the score for a fuzzy match was simply the fuzzy score
multiplied by the weights and `key_reuse_penalty_factor`.
This was always a dumb idea; it meant fuzzy matches had *way* more
impact on the score than they should have.  This was particularly
true when you got down to the last 10% or 20% of your matches,
by which point the score contributed by exact keys had
fallen off a great deal.  This approach stayed in for what is,
in retrospect, an embarassingly long time; I'd convinced myself
that fuzzy keys were innately more *interesting* than exact keys,
and so this comparative importance was fitting.

Once I realized how dumb that was, the obvious approach was to score
them identically to exact keys--divide the fuzzy score by the product
of the number of keys this *could* have matched against in each
of the two datasets.  This was obviously wrong right away.
In the "YTJD" test, every value had the same fuzzy keys, depending
on the test: every value always had a fuzzy date key, and depending
on the test it might have a fuzzy title key and/or a fuzzy episode number
key too.  So each of the 812 values in the first dataset had one
fuzzy key for each type, and each of the 724 values in the second dataset
did too.  Even if we got a perfect fuzzy match, the maximum score
for a fuzzy match was now `1.0 / (812 * 724)` which is `0.0000017`.  So now
we had the opposite problem: instead of being super important, even a perfect
fuzzy match contributed practically nothing to the final score.

After thinking about it for a while,
I realized that the exact key score wasn't *really* being
divided by the number of *keys* in the two datasets, per se;
it was being divided by the total possible *score* contributed
by that key in each of the two datasets.  So instead of
dividing fuzzy scores by the number of keys, they should be divided
by the cumulative fuzzy score of all matches involving those two keys.
That formula looks like
`fuzzy_score / (sum_of_fuzzy_scores_for_key_in_A * sum_of_fuzzy_scores_for_key_in_B)`.

This was a lot closer to correct!  But this formula had a glaring new problem.
Let's say that in your entire correlation, `dataset_a` only had one
fuzzy key that maps to a single value,
and `dataset_a` only had one fuzzy key that also only maps to a single value.
And let's say the fuzzy score you get from matching those two keys
is `0.000001`--a really terrible match.
Let's plug those numbers into our formula.  We get `0.000001 / (0.000001 * 0.000001)`
which is `1000000.0`.  A million!  That's crazy!  We've taken an absolutely
terrible fuzzy match and inflated its score to be nonsensically high.
Clearly that's not right.

This leads us to the formula that actually works.  The insight here
is that the same formula needs to work identically for exact keys.
If you take this formula and compute it where every `fuzzy_score`
is 1 (or 0), it produces the same result as the formula for exact keys.
The final trick is that we can multiply by `fuzzy_score` wherever we need
to, because multiplying by 1 doesn't change anything.  That means
the resulting formula will still be consistent with the exact keys
scoring formula.  And what worked was the formula where we multiply by
`fuzzy_score` three times!

Here again is the formula used to compute the score for a fuzzy match,
simplified to ignore weights and rounds:

    value_a = a value from dataset_a
    value_b = a value from dataset_b
    key_a = a fuzzy key in dataset_a that maps to value_a
    key_b = a fuzzy key in dataset_b that maps to value_b
    fuzzy_score = the result of key_a.compare(key_b)
    cumulative_a = the cumulative score of all successful matches between key_a and all fuzzy keys in dataset_b
    cumulative_b = the cumulative score of all successful matches between key_b and all fuzzy keys in dataset_a
    score_ratio_a = fuzzy_score / cumulative_a
    score_ratio_b = fuzzy_score / cumulative_b
    unweighted_score = fuzzy_score * score_ratio_a * score_ratio_b

The final trick really was realizing what `score_ratio_a` represents.
Really, it represents the ratio of how much *this* fuzzy match for `key_a`
contributed to the sum of *all* fuzzy matches for `key_a`
across all successful matches in `dataset_a`.
