Metadata-Version: 2.3
Name: darum
Version: 1
Summary: DAfny Resource Usage Measurement
Author: Horacio Mijail Anton Quiles
Author-email: hmijail@gmail.com
Requires-Python: >=3.12,<4.0
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Requires-Dist: ansi2html (>=1.9.2)
Requires-Dist: hvplot (>=0.9.2)
Requires-Dist: psutil (>=6.0.0)
Requires-Dist: pygments (>=2.18.0)
Requires-Dist: quantiphy (>=2.19)
Requires-Dist: sh (>=2.0.7)
Description-Content-Type: text/markdown

# DAfny Resource Usage Measurement

A set of tools to detect and help diagnose brittle verification.

*Darum? Ach, warum!*

## What does Darum do? 

It analyzes Dafny's verification costs to find brittleness and helps the user diagnose it.

## Brittleness?
Dafny verifies code by translating it internally into assertions, that then are verified by the Z3 solver.

For a long time, the common advice to help Dafny code verify successfully was to add context information to the code through manual assertions, so that Z3 would have enough information to find a solution. This works well for small projects. However, the solver can eventually have too much information and get lost following unproductive rabbitholes while searching for a solution. This causes long verification times or timeouts, which bogs down development. Worse, the verification of Dafny code depends on a measure of randomness, which causes Z3 to follow changing paths when searching for proofs, even for semantically unchanged Dafny code. This causes variability in verification times, possibly spanning multiple orders of magnitude. This is known as "verification brittleness".

Since recently, Dafny has been adding functionalities to help the user control what information reaches Z3. The problem then is that one rarely knows what allowed Z3 find a proof in the past, or what is confusing it now. Indeed, at the Dafny level, the only thing we get back is the result (verification successful, failure or timeout), plus how costly was it for the solver to reach that result.

At Consensys, we found that the distribution of verification costs in bigger codebases turns heavily multimodal. Hence, the standard statistics (average, covariance) typically used to report costs obscure important information; namely, that cost variation is not smooth, but can change drastically. And because of the inherent randomness, these variations happen for no reason discernible to the user.

**Darum analyzes the solver costs, to find what parts of the code show the highest variability. It's at these points where applying Dafny's brittleness control functionalities will have the biggest effect.**

## What exactly is Darum?

Darum consists of 3 loosely coupled tools:
* `dafny_measure`: a wrapper around `dafny measure-complexity` for easier management of the iterative verification process. It captures the logs and augments them with transient information for easier reference:
  - The timestamp and arguments used in this verification run
  - Dafny's stdout and stderr
  - The input file's contents and a hash, to avoid confusion when comparing successive versions of the code

  Additionally, `dafny_measure` warns when some Dafny toolchain bugs are detected, like when it gets stuck ([#5316](https://github.com/dafny-lang/dafny/issues/5316)) or z3 processes are leaked (#[5616](https://github.com/dafny-lang/dafny/issues/5616)).

* `plot_distribution`: a tool to analyze and present the logs generated by `dafny_measure`.
  - It runs some tests on the verification results contained in the log
  - Scores the results heuristically for their potential for improvement
  - Presents the results in summary tables
  - Plots the most interesting results for further analysis.

* `compare_distribution`: a tool to compare verification runs at different Assertion Batch granularity, i.e. with and without "Isolate Assertions" mode.

## Installation

Darum's tools are written in Python and available in Pypi.

Probably the easiest way to install DARUM is using `pipx`, which should be available in all common package managers, like `brew` in macOS.

```
$ brew install pipx
...
$ pipx install darum
```
This will make DARUM's tools available as common CLI commands.

Eventually you can upgrade DARUM with
```
$ pipx upgrade darum
```

## Usage

Each of the tools has a `--help` argument that lists the available options.

In general, the workflow will be:
1. Run `dafny_measure`
2. Plot the logfile with `plot_distribution` (happens automatically when running `dafny_measure`)
3. If some member looks interesting/suspicious, run `dafny_measure` again in IA mode, possibly with `--filter-symbol` to focus only on that member
4. Plot the new logfile with `plot_distribution`
5. And/or compare both logfiles with `compare_distribution`.

```bash
$ dafny_measure myfile.dfy
...
Generated logfile XYZ.log

$ plot_distribution XYZ.log
...

$ dafny_measure myfile.dfy --isolate-assertions --extra "--filter-symbol expensiveFunction"
...

$ compare_distribution XYZ.log -i XYZ_IA.log
...
```

For further details about how Darum works and usage strategies, please see the file [Details.md](<Details.md>).

#### How many iterations to run with `dafny_measure`? (`-i` argument)

The default is 10, which in practice seem to work well. Bigger numbers (100 iterations or more) might be interesting to get more detail on how the distribution really looks like in badly behaved code: what are its modes, and how extreme it can get.

Note that a higher number of iterations can trigger bugs in Dafny, and fail midway without producing a log (Dafny issue [#5316](https://github.com/dafny-lang/dafny/issues/5316)). Plan accordingly.
To work around this, there's some functionality in Darum to analyze multiple small logfiles, which can be more reliable than trying to generate a big logfile at once. (Darum issue [#1](https://github.com/hmijail/darum/issues/1))

## Interpreting the results

Take a look at the [walkthrough](docs/Walkthrough.md)

### The plots

#### Plain plots

##### What is an acceptable span?
Badly behaving members seem to blow up their span rather abruptly, so probably a single run will quickly give an idea of which members need work. However, as a rule of thumb: in IA mode, spans seem to grow abruptly once they go over 3-5%. In default mode, this happens over 10%.


##### Worst offenders

ABs are scored according to their characteristics, including the fact that a non-successful AB makes subsequent ABs in the same member unreliable.

The top N ABs are plotted. For plots with failures/OoRs, these are plotted aside for attention.

The plot starts in transparent mode to make it easier to see where bars overlap. Clicking on the legend makes the corresponding plot easier to see.

Verification results that happen rarely are specially important. Hence, the Y axis is logarithmic to better show those single, rare results.


#### Comparative plots

This type of plot compares RUs of members verified in default mode (plotted with X symbols) against the RU *total* of those same members verified in IA mode (plotted as spikes). The comparisons are always member-wise: member-wide single AB vs sum of a member's ABs. Note the Log X axis.

Things to notice in each member:
* Ideal behavior happens when a member verifies robustly and, ideally, cheaply. In the plot, this translates to Xs clustered together (meaning stability) and much farther to the left (meaning cheaper) than the spikes, which will also be clustered together (meaning that even the DAs verify stably for this member). This is the typical situation for short, simple members.
* Wide X dispersion means brittleness in default mode. If the spikes are still clustered, this means that the DAs are still stable.
  * Maybe Z3 is finding a proof following different paths, or the search space is just too big. Reducing it to force Z3 into the fast path would help.
  * Notably, if some of the X show higher cost than the spikes, this is a clear indication that plainly breaking down the AB into smaller batches would help by trading happenstance speed for stability.
* If even the spikes are dispersed, this means that even the individual DAs show brittleness.
  * It's likely that some DA can't verify by itself; the member-level AB verified successfully because it discovered extra context while proving previous DAs. Hence, it'd be helpful to make the discovered context explicit. Plotting individually the IA mode log will show what DA is failing.
* Robust member verification (clustered X) with brittle IA verification (disperse spikes) might point to brittleness waiting to happen!
  * Lucky, side-effect discoveries while proving early DAs in the AB are allowing subsequent DAs to be proven. If the lucky discovery stops happening, suddenly there are failures in the AB.


### The table/s

Default mode distribution plots contain a table analyzing the logs. For convenience, the column "diag" in the table summarizes the situation for each row with a series of icons:
* 📊 : This element was plotted because was among the top scores.
* ⌛️ : This element finished as Out of Resources.
* ❌ : This element explicitly failed verification.
* ❗️ : This element had *both* successful and failed/OoR results. Note that purely successful results are not highlighted because they're the hoped default; but fliflopping results merit extra attention. [According to the Dafny developers](https://github.com/dafny-lang/dafny/issues/5615#issuecomment-2223290919), as long as there is a successful result, the goal should be to stabilize it to avoid the failures – as opposed to discarding the success because of the failures.
* ❓ : This element had a single success across all iterations. It's probably prioritary to stabilize its success.
* ⛔️ : This element was excluded from the plot on request.

IA mode distribution plots contain 2 tables. The first one is equivalent to the one just described, only applied to the individual ABs. The second table shows the total costs at the member level, but still in IA mode.


### Comments

If interesting/atypical situations were detected while preparing the plots, they will be listed in a Comments section at the bottom of the plot page.


## Pitfalls

### Start in standard mode, dig into IA mode once a problem is apparent

As mentioned, a member can verify stably in default mode, but in IA mode present ABs that are brittle or even fail. The significance of this situation is still unclear, therefore:
1. Probably there's no immediate harm in leaving the member as it is. However, you might still want to keep an eye on it in case that any small change in the member triggers brittleness.
2. More importantly, It's probably best to **focus on fixing problems that can be first be found at the member level with standard verification mode**. On the contrary, starting by looking for problems at the IA mode might cause you to spend effort on trouble that maybe isn't really there. Dafny's `--filter-symbol` is a great way to avoid the temptation of fishing for unnecessary trouble in IA mode.

### Keep in mind that, in IA mode, ABs are "chained". If one fails, the rest are removed.

To minimize noise, once an AB fails in IA mode, we ignore subsequent ABs in the same iteration. In such a case, the tables will show that some ABs have a smaller number of results than previous ones. In the extreme case of only 1 result remaining for a given AB, it will be tagged with the ❗️ icon.

## Some remedies to keep in mind

### Dafny standard library

Since version 4.4, Dafny includes a standard library that provides pre-verified code. This library includes helper lemmas for things like bitwise manipulation and non-linear arithmetic, which are typical verification pain points. Using this library instead of implementing one's own version might save much work; or, if a reimplementation is needed, can at least offer examples of how things were implemented by Dafny's own developers.

### Section on Verification debugging in Ref Manual

See [here](https://dafny.org/dafny/DafnyRef/DafnyRef.html#sec-brittle-verification).


# Hacking

Clone the repo to your system and install it in editable mode with pipx, poetry or similar tools.

```
cd darum
pipx install -e .
```

DARUM's plots use HoloViz, a Python library that configures and deploys Javascript libraries into a standalone, interactive HTML page. This complexity combined with Python's own packaging complexities mean that if you try to run DARUM's scripts directly, the generated HTML pages will fail to work.

Hence, **remember to use the user-facing scripts installed by tools like pipx and poetry instead**.
