Metadata-Version: 2.4
Name: ntqr
Version: 0.7
Summary: Tools for the logic of evaluation using unlabeled data
Keywords: evaluation,algebra,machine learning,logic,ai safety
Author-email: Andres Corrada-Emmanuel <andres.corrada@dataengines.com>, Walker Lee <walker.lee@dataengines.com>, Adam Sloat <adam.sloat@dataengines.com>
Description-Content-Type: text/markdown
Classifier: Development Status :: 3 - Alpha
Classifier: Framework :: Jupyter
Classifier: Intended Audience :: Information Technology
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: MIT License
Classifier: Natural Language :: English
Classifier: Topic :: Scientific/Engineering
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: Topic :: Scientific/Engineering :: Mathematics
License-File: LICENSE.txt
Requires-Dist: sympy>=1.13.2
Requires-Dist: typing_extensions>=4.12.2
Requires-Dist: matplotlib>=3.10.0
Requires-Dist: sphinx>=8.2.3 ; extra == "sphinx"
Requires-Dist: myst-parser>=4.0.0 ; extra == "sphinx"
Requires-Dist: myst-nb>=1.2.0 ; extra == "sphinx"
Requires-Dist: matplotlib>=3.10.0 ; extra == "sphinx"
Project-URL: Documentation, https://ntqr.readthedocs.org/en/latest
Project-URL: Home, https://github.com/andrescorrada/IntroductionToAlgebraicEvaluation/tree/main/python
Provides-Extra: sphinx

# Logic tools to make your AI safer

![NTQR](./img/NTQRpt24.png)

```console
~$: pip install ntqr
```

:::{figure-md}
![Prevalence estimates](./img/threeLLMsBIGBenchMistakeMultistepArithmetic.png)

**The only evaluations possible for three LLMs (Claude, Mistral, ChatGPT) that
graded a fourth one (PaLM2) doing the multistep-arithmetic test in the BIG-Bench-Mistake
dataset. Using the axiom for the single binary classifier, we can reduce each LLMs possible
evaluations as grader of the PaLM2 to the circumscribed planes inside the space of
all possible evaluations.**
:::


Evaluation of noisy decision makers in unsupervised settings is a fundamental
safety engineering problem. This library contains algorithms that treat this
problem from a logical point of view by considering the axiomatic relationships
that must govern how groups may be performing given how they agree and disagree
in their answers to a common test.

A simple demonstration of the power of logic to clarify possible group evaluations
for noisy experts is given by the example of two of them taking a common test
and disagreeing on at least one answer. We can immediately deduce that it is
impossible for **both** to be 100% correct. Notice the power of this elimination
argument. We do not need to know anything about the test or its correct answers.
By just looking at how they agree and disagree, we can immediately deduce what
group evaluations are impossible. The NTQR package carries this out by formulating
algebraic logical axioms that must be obeyed by all evaluations of a given type
or model.

Using logic as an aid for safety is well known in other industries. Formal
software verification is one requirement for certifying the software that
carries out emergency shutdowns of nuclear plants. Autonomous vehicles are
field tested using linear temporal logic.

But logic seems out of place with how we engineer AI agents nowadays,
mostly as statistical algorithms that depend and rely on a well developed set
of ideas from probability theory. In addition, for some readers it may remind
them of the symbolic AI systems that were developed before probability theory
became the basic tool for building algorithms. They used knowledge bases and
tried to reason about the world and their actions. The logic of unsupervised
evaluation we are talking about is none of those.

It is not a logic about how to make decisions. That is what the symbolic
systems try to do. They do this using a **world model**. These are hard
define except in simulated worlds. Evaluations are not like that. A binary
test is a binary test in any domain. The logic of unsupervised evaluation
requires **evaluation models**, not world models.
And these are trivial to specify and construct.

The upcoming version (0.7) can now handle an arbitrary number of classifiers
and labels. When evaluations are carried out in the integer space of correct
and incorrect label responses, the axioms of unsupervised evaluation become
linear. These axioms, in addition to other logical constraints on the counts,
define a linear optimization problem for the set of possible group evaluations.
The current version of NTQR is not taking advantage of this fact. Future
versions will incorporate optimized solvers such as
[PuLP](https://coin-or.github.io/pulp/).

Brief guide:
1. Formal verification of unsupervised evaluations: The NTQR package is
  working out the logic for verifying unsupervised evaluations - what are
  the group evaluations that are logically consistent with how the test takers
  agree and disagree on multiple-choice exams? The page "Formal verification of
  evaluations" explains this further.
2. A way to stop infinite monitoring chains: Who grades the graders? Montioring
  unsupervised test takers raises the specter of endless graders or monitors.
  By having a logic of unsupervised evaluation, we can stop those infinite
  chains. We can verify that pairs of classifiers are misaligned, for example.
  Take a look at the "Logical Alarms" Jupyter notebook.
3. Jury evaluation theorems: Jury decision theorems - when does the crowd
  decide wisely? - go as far back as Condorcet's 1785 theorem proving that
  majority voting makes the crowd wiser for better than average jury members.
  The NTQR package contains jury evaluation theorems - when does the crowd
  evaluate itself wisely? It turns out it does this better than majority voting
  can decide. This has important consequences for how we shoud design
  safer Ai systems. Check out the "Evaluation is easier than decision"
  page.

>**Warning**
This library is under heavy development and is presently meant only
for research and educational purposes. AI or any safety engineering is
not solvable by any one tool. These tools are meant to be part of a broader
safety monitoring system and are not meant as standalone solutions.
NTQR algorithms are meant to complement, not supplant, other safety tools.


