Metadata-Version: 2.1
Name: flardl
Version: 0.0.8
Summary: Flardl
Home-page: https://github.com/hydrationdynamics/flardl
License: BSD-3-Clause
Keywords: downloads,asynchronous,high-performance,multi-dispatching,queueing,adaptive,elastic,adaptilastic,federated
Author: Joel Berendzen
Author-email: joel@generisbio.com
Requires-Python: >=3.9,<3.13
Classifier: Development Status :: 3 - Alpha
Classifier: Environment :: Console
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: BSD License
Classifier: Natural Language :: English
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Topic :: Internet :: WWW/HTTP
Requires-Dist: anyio (>=3.7.0)
Requires-Dist: attrs (>=22.2.0,<23.0.0)
Requires-Dist: httpx[http2] (>=0.23.3)
Requires-Dist: loguru (>=0.6.0)
Requires-Dist: tqdm (>=4.64.1)
Requires-Dist: trio (>=0.22.0,<0.23.0)
Requires-Dist: uvloop (>=0.17.0) ; sys_platform != "win32"
Project-URL: Changelog, https://github.com/hydrationdynamics/flardl/releases
Project-URL: Documentation, https://flardl.readthedocs.io
Project-URL: Repository, https://github.com/hydrationdynamics/flardl
Description-Content-Type: text/markdown

# Flardl - Adaptive Multi-Site Downloading of Lists

[![PyPI](https://img.shields.io/pypi/v/flardl.svg)][pypi status]
[![Python Version](https://img.shields.io/pypi/pyversions/flardl)][pypi status]
[![Docs](https://img.shields.io/readthedocs/flardl/latest.svg?label=Read%20the%20Docs)][read the docs]
[![Tests](https://github.com/hydrationdynamics/flardl/workflows/Tests/badge.svg)][tests]
[![Codecov](https://codecov.io/gh/hydrationdynamics/flardl/branch/main/graph/badge.svg)][codecov]
[![Repo](https://img.shields.io/github/last-commit/hydrationdynamics/flardl)][repo]
[![Downloads](https://pepy.tech/badge/flardl)][downloads]
[![Dlrate](https://img.shields.io/pypi/dm/flardl)][dlrate]
[![Codacy](https://app.codacy.com/project/badge/Grade/5d86ff69c31d4f8d98ace806a21270dd)][codacy]
[![Snyk Health](https://snyk.io/advisor/python/flardl/badge.svg)][snyk]

[pypi status]: https://pypi.org/project/flardl/
[read the docs]: https://flardl.readthedocs.io/
[tests]: https://github.com/hydrationdynamics/flardl/actions?workflow=Tests
[codecov]: https://app.codecov.io/gh/hydrationdynamics/flardl
[repo]: https://github.com/hydrationdynamics/flardl
[downloads]: https://pepy.tech/project/flardl
[dlrate]: https://github.com/hydrationdynamics/flardl
[codacy]: https://www.codacy.com/gh/hydrationdynamics/flardl?utm_source=github.com&utm_medium=referral&utm_content=hydrationdynamics/zeigen&utm_campaign=Badge_Grade
[snyk]: https://snyk.io/advisor/python/flardl

> Who would flardls bear?

[![logo](https://raw.githubusercontent.com/hydrationdynamics/flardl/main/docs/_static/flardl_bear.png)][logo license]

[logo license]: https://raw.githubusercontent.com/hydrationdynamics/flardl/main/LICENSE.logo.txt

## Features

_Flardl_ downloads lists of files from one or more servers
using a novel adaptive asynchronous approach. Download rates
are **typically more than 300X higherr** than synchronous
utilities such as*curl*, while use of multiple servers
provides better robustness in the face of varying network
and server loads. Download rates depend on network bandwidth,
latencies, list length, file sizes, and HTTP protocol used,
but even a single server on another continent can usually
saturate a gigabit connection after about 50 files using
_flardl_.

## Fishing Theory

Collections of files generated by natural or human activity such
as natural-language writing, protein structure determination,
or genome sequencing tend to have **size distributions with
long tails**. For collections with long-tail distributions, one
finds many more examples of big files than of small files at
a given additive distance above or below the peak (model) value.
Examples of analytical forms of long-tail distributions include
Zipf, power-law, and log-norm distributions. A real-world example
of a long-tail distribution is shown in the figure below, which
plots the file-size histogram for 1000 randomly-sampled examples
CIF structure files from the [Protein Data Bank](https://rcsb.org)
along with a kernel-density estimate and fits to log-normal and
normal distributions.

![sizedist](https://raw.githubusercontent.com/hydrationdynamics/flardl/main/docs/_static/file_size_distribution.png)

There are big effects on overall statistics from the big files in
the long tail, effects that are frequently ignored in queueing
literature and many queuing algorithms which treat collections
as normal-ish. The biggest single issue, which can be seen in
the difference between normal-distribution fits to a
randomly-selected 5% and the full 1000 points in the figure above,
is that **mean values are neither stable nor characteristic of the
distribution**. Unlike on normal distributions--means of runs drawn
from them grow larger with the size of the run. Because of the appreciable
likelihood of drawing a really large file to be downloaded, the total
download time $t_{\rm tot}$ and therefore the mean per-file download rate
$\overline{k_{\rm file}}$ both depend strongly on how many big-file outliers
are included in your sample. If you are downloading multiple files
simultaneously, the overall download time may also depend strongly on
where in the list the large files happen to occur, because those at
the end can cause an "overhang" of a single stream waiting for that file.

While the mean per-file download rate varies a lot between runs, the
_most-common_ per-file download rate $\tilde{k}_{\rm file}$ can be more
consistent, at least on the timescale of days. If you are downloading a
long list of files at the same time that someone else on your LAN
is watching a video, then you may not achieve the same saturation
bit rate $b{\rm sat}$ as when you're the only network user. The modal
file size of a collection can be quite stable over time, so we have hope
that if we formulate download times in terms of the modal file size
and that day's estimated server latencies and achievable download
bit rate, the situation might be more tractable still.

Even more than maximizing download rates, the highest priority must
be to **avoid black-listing by a server**. Most public-facing servers
have policies to recognize and defend against Denial-Of-Service (DOS)
attacks. The response to a DOS event, at the very least, causes the
server to dump your latest request, which is usually a minor nuisance
as it can be retried later. Far worse is if the server responds by
severely throttling further requests from your IP address for hours
or sometime days. Worst of all, your IP address can get the "death
penalty" and be put on a permanent blacklist that may require manual
intervention for removal. You generally don't know thThe simplest
possibility of le trigger levels for these policies. Blacklisting
might not even be your personal fault, but a collective problem.
I have seen a practical class of 20 students brought to a complete
halt by a server's 24-hour black-listing of the institution's
public IP address.

An analogy might help us here. Let's say you are a person who
enjoys keeping track of statistics, and you decide to try
fishing. At first, you have a single fishing rod and you go
fishing at a series of local lakes where your catch consists
of small bony fishes called "crappies". Your records reval
that while the rate of catching fishes can vary from day to
day--fish might be hungry or not--the average size of your
catch is pretty stable. Bigger ponds tend to have bigger fish
in them, and it might take slightly longer to reel in a bigger
crappie than a small one, but big and small averages out to
that pond.

Then one day you decide you love fishing so much, you drive
to the coast and charter a fishing boat. On that boat,
you can set out as many lines as you want (up to some limit)
and fish in parallel. At first, you seem to be catching the
ocean-going equivalent of crappies, small bony fishes. But
then you hook a small shark, which not only takes a lot of
your time and attention to reel in, but which totally skews
your estimate of average weight of your catch. You know that
if you can catch a small shark, then maybe if you fish for
long enough you might catch a big shark, or even a small whale.
But you and your crew can only effecively reel in so
many hooked lines at once. Putting out more lines than
that effective limit of hooked- plus waiting-to-be-hooked
lines only results in fishes waiting on the line, when they
may break the line or get partly eaten before you can reel
them in.

Here I propose and implement a method called **adaptilastic
queuing** that gives robust performance in real situations
while being simple enough to be easily understood and coded.
The basis of edaptilastic queueing is keeping the total
request-queue depth just high enough to achieve saturation.
The method launches a large number of requests at the most-likely
per-file rate at saturation, up to some maximum permissible
per-server queue depth $D_{i}_{\rm max}$ (either by guess or
by previous knowledge of individual servers) during the period
before any transfers have completed. As transfers are completed,
the method estimates the total-over-all-servers depth at which
saturation was achieved, and updates its estimate of the
achievable line bit rate and the most-likely per-file return
rate on a per-server basis as the bases for managing future
requests. Servers that return modal-length files (crappies)
more quickly thus are given a better chance at nabbing an
open queue slot without penalizing a server that happened
to draw a big download (whale).

## Requirements

_Flardl_ is tested under python 3.11, on Linux, MacOS, and
Windows and under 3.9 and 3.10 on Linux. Under the hood,
_flardl_ relies on [httpx](https://www.python-httpx.org/) and is supported
on whatever platforms that library works under, for both HTTP/1.1 and HTTP/2.
HTTP/3 support could easily be added via
[aioquic](https://github.com/aiortc/aioquic) once enough servers are
running HTTP/3 to make that worthwhile.

## Installation

You can install _Flardl_ via [pip] from [PyPI]:

```console
$ pip install flardl
```

## Usage

_Flardl_ has no CLI and does no I/O other than downloading and writing
files. See test examples for usage.

## Contributing

Contributions are very welcome.
To learn more, see the [Contributor Guide].

## License

Distributed under the terms of the [BSD 3-clause_license][license],
_Flardl_ is free and open source software.

## Issues

If you encounter any problems,
please [file an issue] along with a detailed description.

## Credits

_Flardl_ was written by Joel Berendzen.

[pypi]: https://pypi.org/
[file an issue]: https://github.com/hydrationdynamics/flardl/issues
[pip]: https://pip.pypa.io/

<!-- github-only -->

[license]: https://github.com/hydrationdynamics/flardl/blob/main/LICENSE
[contributor guide]: https://github.com/hydrationdynamics/flardl/blob/main/CONTRIBUTING.md

