Metadata-Version: 2.1
Name: flardl
Version: 0.0.7
Summary: Flardl
Home-page: https://github.com/hydrationdynamics/flardl
License: BSD-3-Clause
Keywords: downloads,asynchronous,high-performance,multi-dispatching,queueing,adaptive,federated
Author: Joel Berendzen
Author-email: joel@generisbio.com
Requires-Python: >=3.9,<3.13
Classifier: Development Status :: 3 - Alpha
Classifier: Environment :: Console
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: BSD License
Classifier: Natural Language :: English
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Topic :: Internet :: WWW/HTTP
Requires-Dist: anyio (>=3.7.0)
Requires-Dist: attrs (>=22.2.0,<23.0.0)
Requires-Dist: httpx[http2] (>=0.23.3)
Requires-Dist: loguru (>=0.6.0)
Requires-Dist: tqdm (>=4.64.1)
Requires-Dist: trio (>=0.22.0,<0.23.0)
Requires-Dist: uvloop (>=0.17.0) ; sys_platform != "win32"
Project-URL: Changelog, https://github.com/hydrationdynamics/flardl/releases
Project-URL: Documentation, https://flardl.readthedocs.io
Project-URL: Repository, https://github.com/hydrationdynamics/flardl
Description-Content-Type: text/markdown

# Flardl - Adaptive Multi-Site Downloading of Lists

[![PyPI](https://img.shields.io/pypi/v/flardl.svg)][pypi status]
[![Python Version](https://img.shields.io/pypi/pyversions/flardl)][pypi status]
[![Docs](https://img.shields.io/readthedocs/flardl/latest.svg?label=Read%20the%20Docs)][read the docs]
[![Tests](https://github.com/hydrationdynamics/flardl/workflows/Tests/badge.svg)][tests]
[![Codecov](https://codecov.io/gh/hydrationdynamics/flardl/branch/main/graph/badge.svg)][codecov]
[![Repo](https://img.shields.io/github/last-commit/hydrationdynamics/flardl)][repo]
[![Downloads](https://pepy.tech/badge/flardl)][downloads]
[![Dlrate](https://img.shields.io/pypi/dm/flardl)][dlrate]
[![Codacy](https://app.codacy.com/project/badge/Grade/5d86ff69c31d4f8d98ace806a21270dd)][codacy]
[![Snyk Health](https://snyk.io/advisor/python/flardl/badge.svg)][snyk]

[pypi status]: https://pypi.org/project/flardl/
[read the docs]: https://flardl.readthedocs.io/
[tests]: https://github.com/hydrationdynamics/flardl/actions?workflow=Tests
[codecov]: https://app.codecov.io/gh/hydrationdynamics/flardl
[repo]: https://github.com/hydrationdynamics/flardl
[downloads]: https://pepy.tech/project/flardl
[dlrate]: https://github.com/hydrationdynamics/flardl
[codacy]: https://www.codacy.com/gh/hydrationdynamics/flardl?utm_source=github.com&utm_medium=referral&utm_content=hydrationdynamics/zeigen&utm_campaign=Badge_Grade
[snyk]: https://snyk.io/advisor/python/flardl

> Who would flardls bear?

[![logo](https://raw.githubusercontent.com/hydrationdynamics/flardl/main/docs/_static/flardl_bear.png)][logo license]

[logo license]: https://raw.githubusercontent.com/hydrationdynamics/flardl/main/LICENSE.logo.txt

## Features

_Flardl_ downloads lists of files from one or more servers
using a novel adaptive asynchronous approach. Download rates
are **typically more than 300X higherr** than synchronous
utilities such as*curl*, while use of multiple servers
provides better robustness in the face of varying network
and server loads. Download rates depend on network bandwidth,
latencies, list length, file sizes, and HTTP protocol used,
but even a single server on another continent can usually
saturate a gigabit connection after about 50 files using
_flardl_.

## Fishing Theory

Collections of files generated by natural or human activity such
as natural-language writing, protein structure determination,
or genome sequencing tend to have **size distributions with
long tails**. For collections with long-tail distributions, one
finds many more examples of big files than of small files at
a given additive distance above or below the peak (model) value.
Examples of analytical forms of long-tail distributions include
Zipf, power-law, and log-norm distributions. A real-world example
of a long-tail distribution is shown in the figure below, which
plots the file-size histogram for 1000 randomly-sampled examples
CIF structure files from the [Protein Data Bank](https://rcsb.org) along with
a kernel-density estimate and fits to log-normal and normal
distributions.

![sizedist](https://raw.githubusercontent.com/hydrationdynamics/flardl/main/docs/_static/file_size_distribution.png)

The effects of
the big files in the long tail are frequently ignored in queuing
algorithms.

The nature of long-tail distributions is such that **mean values are nearly
worthless** because--unlike on normal distributions--means of runs drawn
from them grow larger with the size of the run. Because of the appreciable
likelihood of drawing a really large file to be downloaded from a long-tail
distribution, he total download time and therefore the mean downloading rate
depends strongly on how many large-size outliers are included in your sample. Timings of algorithms that do
If you are downloading multiple files simultaneously, the overall download
time may also depend strongly on whether a large file happens to occur at
the end of the list, causing an "overhang" of wwaiting for a single file.
Theories and algorithms based on overall times or mean rates won't
work very well on the long-tail distributions that often characterize
real collections. T

**Modal values are a good statistic for power-law distributions**, unlike
means. To put that another way, the average download time $\overline{t_{dl}}$
varies a lot
between runs, but the _most-common_ download time
$\tilde{t}_{dl}$ can be pretty
consistent. The mode of file lengths and the mode of download bit rate
are both quantities that are easy to estimate for a
collection and a collection and rarely change. If one happens to select
the biggest files for downloading, or if one happens to try downloading
a long collection at the same time that someone is watching a high-bit-rate
video on the same shared connection, then it's easy to adjust a bit
for just that time.

Here I propose a heuristic called **adaptive-depth queuing**
that gives robust performance in real situations while being simple
enough to be easily understood and coded.

Even more than maximizing download rates, the highest priority must
be to **avoid black-listing by a server**. Most public-facing servers
have policies to recognize and defend against Denial-Of-Service (DOS)
attacks. The response to a DOS event, at the very least, causes the server to
dump your latest request, which is usually a minor nuisance
as it can be retried later. Far worse is
if the server responds by severely throttling further requests from your
IP address for hours or sometime days.
Worst of all, your IP address can get the "death penalty" and be put
on a permanent blacklist that may require manual intervention for
removal. You generally don't know thThe simplest
possibility of le trigger levels for these policies.
Worse still, it might not even be you. I have seen a practical class
of 20 students brought to a complete halt
by a server's 24-hour black-listing of the institution's IP address.

Simply launching a large number of requests and letting the
servers sort it out is a strategy that maximizes the chance
of black-listing for two reasons. First, this strategy results in
equal division of transfers without regard to varying transfer sizes or
server latencies. Second,

Given that a single server can saturate a gigabit
connection, given enough simultaneous downloads, a better
strategy is to **keep the total request-queue depth just high enough to
achieve saturation**. This goal can be achieved by launching a large
number of requests, up to some maximum permissible queue depth
$Q_{\rm max}$ (either by guess or by previous knowledge of individual
servers), during the server latency period when no transfers have been
completed. As transfers are completed, one can then calculate the
saturation bandwidth $B$ and
the total-over-all-servers depth at which saturation was achieved,
$Q_{\rm sat}$

running the
request For those who are lucky enough to be on
a multi-gigabit connection, it's a good idea to limit the bandwidth
to something you know the set of servers you are using won't complain
about. It would be nice if one could query a server for an acceptable
request queue depth which would guarantee no DOS response or other
server throttling, but I have not seen such a mechanism implemented.

## Requirements

_Flardl_ is tested under python 3.11, on Linux, MacOS, and
Windows and under 3.9 and 3.10 on Linux. Under the hood,
_flardl_ relies on [httpx](https://www.python-httpx.org/) and is supported
on whatever platforms that library works under, for both HTTP/1.1 and HTTP/2.
HTTP/3 support could easily be added via
[aioquic](https://github.com/aiortc/aioquic) once enough servers are
running HTTP/3 to make that worthwhile.

## Installation

You can install _Flardl_ via [pip] from [PyPI]:

```console
$ pip install flardl
```

## Usage

_Flardl_ has no CLI and does no I/O other than downloading and writing
files. See test examples for usage.

## Contributing

Contributions are very welcome.
To learn more, see the [Contributor Guide].

## License

Distributed under the terms of the [BSD 3-clause_license][license],
_Flardl_ is free and open source software.

## Issues

If you encounter any problems,
please [file an issue] along with a detailed description.

## Credits

_Flardl_ was written by Joel Berendzen.

[pypi]: https://pypi.org/
[file an issue]: https://github.com/hydrationdynamics/flardl/issues
[pip]: https://pip.pypa.io/

<!-- github-only -->

[license]: https://github.com/hydrationdynamics/flardl/blob/main/LICENSE
[contributor guide]: https://github.com/hydrationdynamics/flardl/blob/main/CONTRIBUTING.md

