Metadata-Version: 2.1
Name: kdelikelihood
Version: 0.1
Summary: A python package to compute kernel-density-estimated likelihood
Home-page: https://git.ufz.de/fischsam/kde-likelihood
Author: Samuel M. Fischer
Author-email: samuel.fischer@ufz.de
License: GNU Lesser General Public License v3 (LGPLv3)
Classifier: Programming Language :: Python :: 3
Classifier: License :: OSI Approved :: GNU Lesser General Public License v3 (LGPLv3)
Classifier: Operating System :: OS Independent
Description-Content-Type: text/markdown
License-File: LICENSE

# KDE-Likelihood

A python package to compute kernel-density-estimated likelihood.

## Installation

To compile and use this package, you need python 3, pip, and a c++ compiler. On Linux, the compiler should not be a concern; on Windows, you need the Visual Studio 2019 compiler, which you may get by downlaoding and installing the [Buildtools for Visual Studio](https://visualstudio.microsoft.com/de/vs/older-downloads/).

If you have python and pip installed and available on the `PATH`, navigate to the project folder in which you find `setup.py`. Execute

```
pip install .
```




## Usage example:

```python

import numpy as np
from kdelikelihood import ParallelLikelihoodComputer, get_bandwidth_by_silverman


# Some model for simulating the data
def model(parameters, sampleSize=10000):
    result = np.zeros((sampleSize, 3))

    # Suppose the first column contains some count data
    result[:, 0] = np.random.poisson(parameters[0], size=sampleSize)

    # Suppose the second column contains some positive data
    result[:, 1] = np.random.randn(sampleSize) ** 2 * parameters[1] + result[:, 0]

    # Suppose the third column contains some unconstrained data
    result[:, 2] = np.random.randn(sampleSize) * result[:, 1] * parameters[2]

    return result


# The observed data (each row is an independent sample)
observedData = np.array(
    [
        [1.0, 1.11921911, -0.71177555],
        [6.0, 6.02855478, 8.37566854],
        [3.0, 9.57724975, 6.41826056],
        [5.0, 5.64429256, 6.24338937],
        [8.0, 9.57156202, 7.87944723],
        [4.0, 4.84531121, -0.65146309],
        [7.0, 7.03391557, -8.34715355],
        [3.0, 3.04287902, 0.92252008],
        [7.0, 7.00074299, 0.98825773],
        [5.0, 5.83488089, -6.66077578],
        [6.0, 6.26802035, -6.32007646],
        [4.0, 4.00099543, -5.72002288],
        [2.0, 2.61069588, 1.98710941],
        [7.0, 8.08870818, -11.84123493],
        [4.0, 4.20187965, 2.96009114],
        [2.0, 3.21568321, -0.51097464],
        [3.0, 9.08856868, 10.7647393],
        [5.0, 5.00020691, -4.58271643],
        [5.0, 11.4222031, 5.95202002],
        [5.0, 5.64400458, -4.22623137],
        [6.0, 7.08425759, -3.3731975],
        [4.0, 4.59795136, -4.37783711],
        [4.0, 7.20161146, -3.36315821],
        [3.0, 3.00051945, -2.39977869],
        [1.0, 3.05606476, -1.18190334],
        [2.0, 2.18191751, 2.29985054],
        [5.0, 5.11884191, -0.74355017],
        [5.0, 5.92561278, -8.76376851],
        [2.0, 4.19454749, 13.87107752],
        [5.0, 5.00930303, -0.34471651],
    ]
)

# The type of data we are considering in each column:
# 0: unconstrained real numbers
# 1: non-negative real numbers
# 2: natural numbers
modes = [2, 1, 0]

# The size of the sample we are generating for each parameter set
# to estimate the likelihood
sampleSize = 10000

# Bandwidths. Smaller values mean reduced bias if the sample is large enough.
# Bigger values mean reduced stochasticity of the result.
# As a start, we could use the rule of thumb of Silverman
bandwidths = get_bandwidth_by_silverman(observedData, sampleSize)

# Tolerance for the result
# This is not really a rigorous quantity. Bigger values make
# computations faster; smaller values increase the accuracy
atol = 1e-10

# To avoid the "curse of dimensionality", we can consider parts of the data
# as independent. However, since in our examples the data are interdependent,
# we say they all belong to the same group. That is, we have one group only.
dataGroups = 1

# Set up a likelihood computer based on the observed data
with ParallelLikelihoodComputer(observedData, bandwidths, modes, dataGroups, atol) as cmp:
    # Some parameters:
    # (here, this are the true parameters used to generate the data set, but
    #  in practice, we would not know of course)
    parameters = [5, 1, 1]

    # Get some data set from a simulation (each row is a "quasi-independent" sample)
    simulatedData = model(parameters)

    # Compute the likelihood
    logLikelihood = cmp.compute_log_likelihood(simulatedData)

    print("Ln(Likelihood):", logLikelihood)

    # To optimize this, we could use the suggested workflow implemented in the package,
    # which uses the package py-bobyqa (https://github.com/numericalalgorithmsgroup/pybobyqa/)
    from kdelikelihood import maximize_log_likelihood

    # Define the objective function
    def log_likelihood(parameters):
        # Generate a sample from the model
        sample = model(parameters, sampleSize)

        # Return the negative log-likelihood, as pybobyqa is a
        # minimization algorithm
        return cmp.compute_log_likelihood(sample)

    # Define some bounds for the parameters
    bounds = [(1e-10, 20), (0, 20), (0, 20)]


    result = maximize_log_likelihood(log_likelihood, bounds)

    print(result)

    # For comparison, the likelihood estimated based on the true parameters:
    print("Likelihood of the original parameters:", log_likelihood(parameters))

```
