What's new in 1.6.1 (in development)
=================================

* Added specialized POPCNT implementations for all 8-byte-multiple
fingerprint lengths up to 1024 bytes, plus a faster implementation for
8-byte-multiple lengths beyond that. Previously there were only
specialized implementations for 24-, 64-, 112-, 128-, and 256-byte
fingerprints, which are the most common in cheminformatics.

In one benchmark, small fingerprints (<256 bits) are about 20% faster,
medium fingerprints (256 to 1024 bits) are about 10% faster, and
larger fingerprints are a few percent faster.

* Added two new FingerprintArena methods. ``sample()`` randomly selects
a subset of the fingerprints and returns them in a new arena.
``train_test_split()`` returns two randomly selected and disjoint
subsets of the area, typically used as a training set and a test set.

BUG FIX: Fixed bug in fpcat where using ``--reorder`` would write the
FPS header twice.


What's new in 1.6 (24 June 2020)
=================================

* Added specialized POPCNT fixed-length popcount implementations for
512, 881, and 1024 fingerprints, plus an implementation for n*1024
bits (2<=n<=8). (Backported from chemfp 3.)

* Replaced the fast integer-based rejection test with a faster
and exact popcount exclusion test. (Backported from chemfp 3.)

The overall performance is about 10-20% faster for common fingerprint
sizes. (166-bit is about 15% faster, 881-bit is about 20% faster,
1024-bit is about 15% faster, and 2048 is about 10% faster.)

* Implemented `--query` support for simsearch so you can search for a
SMILES (or other format) on the command-line. It uses the target
fingerprint type to convert the record into a fingerprint.

* Added Tversky functions to chemfp.bitops. Chemfp 1.6 does not
support Tversky search, but you may use the Tversky implementation to
validate your own code.

* Improved error handling for oe2fps, ob2fps, and rdkit2fps when the
underlying toolkit is not installed.

* Increase the FPS reader block size to 100,000 bytes, which is about
10x larger. This speeds up simsearch by about 15%. (The size was
chosen in 2010 to optimize search times for hardware from that era.)

* BUG FIX: Fixed bug which prevented reading FPS files using the
Windows newline convention.

What's new in 1.5 (16 August 2018)
=================================

BUG FIX: the k-nearest symmetric Tanimoto search code contained a flaw
when there was more than one fingerprint with no bits set and the
threshold was 0.0. Since all of the scores will be 0.0, the code uses
the first k fingerprints as the matches. However, they put all of the
hits into the first search result (item 0), rather than the
corresponding result for each given query. This also opened up a race
condition for the OpenMP implementation, which could cause chemfp to
crash.

* The threshold search used a fast integer-based rejection test before
computing the exact score. The rejection test is now included in the
count and k-nearest algorithms, making them about 10% faster.

* Unindexed search (which occurs when the fingerprints are not in
popcount order) now uses the fast popcount implementations rather than
the generic byte-based one. The result is about 5x faster.

* Changed the simsearch --times option for more fine-grained
reporting. The output (sent to stderr) now looks like:

  open 0.01 read 0.08 search 0.10 output 0.27 total 0.46

where 'open' is the time to open the file and read the metadata,
'read' is the time spent reading the file, 'search' is the time for
the actual search, 'output' is the time to write the search results,
and 'total' is the total time from when the file is opened to when the
last output is written.

* Added SearchResult.format_ids_and_scores_as_bytes() to improve the
simsearch output performance when there are many hits. Turns out the
limiting factor in that case is not the search time but output
formatting. The old code uses Python calls to convert each score to a
double. The new code pushes that code into C. My benchmark used a
k=all --NxN search of ~2,000 PubChem fingerprints to generate about 4M
scores. The output time went from 15.60s to 5.62s. (The search time
was only 0.11 on my laptop.)

* There is a new option, "report-algorithm" with the corresponding
environment variable CHEMFP_REPORT_ALGORITHM. The default does
nothing. Set it to "1" to have chemfp print a description of the
search algorithm used, including any specialization, and the number of
threads. For examples:

  chemfp search using threshold Tanimoto arena, index, single threaded (generic)
  chemfp search using knearest Tanimoto arena symmetric, OpenMP (generic), 8 threads

This feature is only available if chemfp is compiled with OpenMP
support.

* Better error handling in simsearch so that I/O error prints an error
message and exit rather than give a full stack trace.

* Chemfp 3.3 added the options "use-specialized-algorithms" and
"num-column-threads", and the corresponding environment variables
CHEMFP_USE_SPECIALIZED_ALGORITHMS and CHEMFP_NUM_COLUMN_THREADS. These
are supported for future-compatibility, but will alway be 0 and 1,
respectively.

* Don't warn about the CHEMFP_LICENSE or CHEMFP_LICENSE_MANAGER
variables. These are used by chemfp versions which require a license key.

* Fixed bugs in bitops.get_option(). The C API returned an error value
and raised an exception on error, and the Python API forgot to return
the value.

* The setup code now recognizes if you are using clang and will set
the OpenMP compiler flags.

What's new in 1.4 (19 March 2018)
=================================

This version mostly contains bug fixes and internal improvements. The
biggest additions are the fpcat command-line program, support for Dave
Cosgrove's 'flush' fingerprint file format, and support for
'fromAtoms' in some of the RDKit fingerprints.

The configuration has changed to use setuptools.

Previously the command-line programs were installed as small
scripts. Now they are created and installed using the
"console_scripts" entry_point as part of the install process. This is
more in line with the modern way of installing command-line tools for
Python.

If these scripts are no longer installed correctly, please let me
know.

The :ref:`fpcat <fpcat>` command-line tools was back-ported from
chemfp 3.1. It can be used to merge a set of FPS files together, and
to convert to/from the flush file format. This version does not
support the FPB file format.

If you have installed the `chemfp_converters package
<https://pypi.python.org/pypi/chemfp-converters/>`_ then chemfp will
use it to read and write fingerprint files in flush format. It can be
used as output from the *2fps programs, as input and output to fpcat,
and as query input to simsearch.

Added "fromAtoms" support for the RDKit hash, torsion, Morgan, and
pair fingerprints. This is primarily useful if you want to generate
the circular environment around specific atoms of a single molecule,
and you know the atom indices. If you pass in multiple molecules then
the same indices will be used for all of them. Out-of-range values are
ignored.

The command-line option is "--from-atoms", which takes a
comma-separated list of non-negative integer atom indices. For
examples:

        --from-atoms 0
	--from-atoms 29,30

The corresponding fingerprint type strings have also been updated. If
fromAtoms is specified then the string "fromAtoms=i,j,k,..." is added
to the string. If it is not specified then the fromAtoms term is not
present, in order to maintain compability with older types
strings. (The philosophy is that two fingerprint types are equivalent
if and only if their type strings are equivalent.)

The --from-atoms option is only useful when there's a single query and
when you have some other mechanism to determine which subset of the
atoms to use. For example, you might parse a SMILES, use a SMARTS
pattern to find the subset, get the indices of the SMARTS match, and
pass the SMILES and indices to rdk2fps to generate the fingerprint for
that substructure.

Be aware that the union of the fingerprint for --from-atoms X and the
fingerprint for --from-atoms Y might not be equal to the fingerprint
for --from-atoms X,Y. However, if a bit is present in the union of the
X and Y fingerprints then it will be present in the X,Y fingerprint.

Why?  The fingerprint implementation first generates a sparse count
fingerprint, then converts that to a bitstring fingerprint. The
conversion is affected by the feature count. If a feature is present
in both X and Y then X,Y fingerprint may have additional bits sets
over the individual fingerprints.

The ob2fps, rdk2fps, and oe2fps programs now also include the chemfp
version information on the software line of the metadata. This
improves data provenance because the fingerprint output might be
affected by a bug in chemfp.

The Metadata 'date' is now always a datetime instance, and not a
string. If you pass a string into the Metadata constructor, like
Metadata(date="datestr"), then the date will be converted to a
datetime instance. Use "metadata.datestamp" to get the ISO string
representation of the Metadata date.

Fixed a bug where a k=0 similarity search using an FPS file as the
targets caused a segfault. The code assumed that k would be at least
1. With the fix, a k=0 search will read the entire file, checking for
format errors, and return no hits.

Fixed a bug where only the first ~100 queries against an FPS
target search would return the correct ids. (Forgot to include the
block offset when extracting the ids.)

Fix a bug where if the query fingerprint had 1 bit set and the
threshold was 0.0 then the sublinear bounds for the Tanimoto searches
(used when there is a popcount index) failed to check targets with 0
bits set.

What's new in 1.3 (30 Aug 2017)
=================================

Added a help description for FP2, FP3, FP4, and MACCS in ob2fps.

Updated the #software line to include "chemfp/1.3" in addition to the
toolkit information. 

Backported search.contains_fp() and search.contains_arena() from
chemfp 2.1.

Dropped support for the old OE Binary format.

Added --version to the command-line tools. (Suggested by Noel
O'Boyle.)

Removed chemfp.Watcher. It was added because the C compiler for one of
my customers was a couple of years old and didn't handle OpenMP. I
wrote a clustering version for them which used threads instead, and
used the Watcher code to handle handle ^C. Now everyone has OpenMP so
this isn't needed.



What's new in 1.3a1 (30 Aug 2017)
=================================

This version of chemfp only supports Python 2.7. It may work on Python
2.6 though that is not supported. Chemfp will not work on Python 2.5.
For Python 3.5+ support, contact me to buy a copy of chemfp-3.1.

WARNING: Changed the default nBitsPerHash for RDKitFingerprint from 4
to 2 to match the RDKit default. 

RDKit changed its hash and MACCS fingerprint implementation a few
years ago. Updated chemfp to identify newer implementations as
"RDKit-Fingerprint/2" and "RDKit-MACCS166/2".

Added support for RDKit-Pattern and RDKit-Avalon fingerprints. The new
rdkit2fps command-line options are "--pattern" and "--avalon".

RDKit-Pattern/1 is from very old versions of RDKit. RDKit-Pattern/2 is
up to 2016, RDKit-Pattern/3 is from 2017.3 and RDKit-Pattern/4 will be
in 2017.9.

Added a definition for key 44 to the 'RDMACCS'. It was missing in
version 1. Chemfp supports both definitions. The rdkit2fps option
"--rdmaccs" uses the most recent version. To be specific, specify
either "--rdmaccs/1" or "--rdmaccs/2".

Removed support for OEGraphSim v1.0, which OpenEye replaced in 2010.

New OpenEye-MACCS166/3 fingerprint type, to match OEGraphSim v2.2.0.

Improved the FPS reader performance. Simsearch in '--scan' mode is
about 40% faster and '--memory' load time is about 10%
faster. chemfp.load_fingerprints() is about 15% faster. (Measured as
(old_time-new_time)/old_time.)

Improved the similarity search performance of the 166-bit MACCS keys
by about 40%.

The k-nearest arena search (used in NxM searches) is now parallelized.

Added chemfp.search.contains_fp() and chemfp.search.contains_arena()
for fingerprint screening. The first finds the target fingerprints
which contain all of the on-bits of the query fingerprint, and the
second does the same for a query arena.

SearchResults now implements to ".to_csr()" method, which returns a
SciPy sparse row matrix that can be passed to scikit-learn for
clustering. This method requires both SciPy and NumPy. It has also
gained a '.shape' attribute, a 2-element tuple where shape[0] is the
number of rows (i.e. the number of queries) and shape[1] is the number
of targets.

Backported the FPS reader and writer code from chemfp-3.0 as well
as support for io.Location.

Renamed chemfp.read_structure_fingerprints() to
chemfp.read_molecule_fingerprints(). The old API is still valid, but
the first call to it will generate a warning message.

Fix: Some of the Tanimoto calculations stored intermediate values as a
double. Some of the values, like 0.6, cannot be represented exactly as
a double. As a result, some Tanimoto scores were off by 1 ulp (the
last bit in the double). They are now exactly correct.

Fix: if the query fingerprint had 1 bit set and the threshold was 0.0
then the sublinear bounds for the Tanimoto searches (used when there
is a popcount index) failed to check targets with 0 bits set.

Fix: If a query had 0 bits then the k-nearest code for a symmetric
arena returned 0 matches, even when the threshold was 0.0. It now
returns the first k targets.

Fix: There was a bug in the sublinear range checks. It should only
occur in the symmetric searches the batch_size is larger than the
number of records with a popcount just outside of the expected range.

Changed rdkit2fps, ob2fps, and oe2fps so the default --errors is
'ignore' instead of 'strict'. This is based on a lot of feedback
asking how to make those tools ignore errors. I decided that silent
errors (at the chemfp level, but toolkits still send warnings and
errors to stderr) were simply not the right thing for those tools.

Missing identifers in oe2fps, rdkit2fps or ob2fps will always be
logged to stderr, even if --errors is ignore. If --errors is strict
then missing identifiers will cause processing to exit.

The configuration of the --with-* or --without-* options (for OpenMP
and SSSE3) support, can now be specified via environment variables. In
the following, the value "0" means disable (same as "--without-*") and
"1" means enable (same as "--with-*"):
  CHEMFP_OPENMP -  compile for OpenMP (default: "1")
  CHEMFP_SSSE3  -  compile SSSE3 popcount support (default: "1")
  CHEMFP_AVX2   -  compile AVX2 popcount support (default: "0")

This makes it easier to do a "pip install" directly on the tar.gz file
or use chemfp under an automated testing system like tox, even when
the default options are not appropriate. For example, the default C
compiler on Mac OS X doesn't support OpenMP. If you want OpenMP
support then install gcc and specify it with the "CC". If you don't
want OpenMP support then you can do:

  CHEMFP_OPENMP=0 pip install chemfp-1.3a1.tar.gz

Backported bitops functions from chemfp-3.0. The new functions are:
  hex_contains, hex_contains_bit, hex_intersect, hex_union, hex_difference,
  byte_hex_tanimoto, byte_contains_bit,
  byte_to_bitlist, byte_from_bitlist,
  hex_to_bitlist, hex_from_bitlist,
  hex_encode, hex_encode_as_bytes, hex_decode

The new hex encode/decode functions are important if you want to write
code which is forward compatible for Python 3, where s.encode("hex")
is no longer supported.

What's new in 1.1p1 (12 Feb 2013)
=================================

Fixed memory leaks caused by using Py_BuildValue with an "O" instead
of an "N". This caused the reference count on the return arena strings
to be too high, so they were never garbage collected. This should only
affect people who made and destroyed many arenas.

Removed unneeded lock in threshold arena searches. This should give
better parallelism when there are many hits (eg, with a low threshold)
when there are multiple threads.

What's new in 1.1 (5 Feb 2013)
==============================

New methods to look up a record, record index, or fingerprint given
the record identifier. These are:

  arena.get_by_id(id)
  arena.get_index_by_id(id)
  arena.get_fingerprint_by_id(id)

Added or updated all of the docstrings for the public API.

Documented that the search methods on the FingerprintArena instance
are deprecated - use chemfp.search instead. These will generate
warning message in the next release and after that will be removed.

Renamed arena.copy_subset() to arena.copy().

Changed the arena.copy() method so that by default it reorders the
fingerprints if indices are specified, and by default the (sub)arena
ordering is preserved.

Added a cache for getattr(subarena, "ids"). Otherwise subarena.ids[i]
took O(len(subarena.ids)) time instead of O(1) time.

Renamed chemfp.readers to chemfp.fps_io and decoders.py to
encodings.py. These were not part of the public API but may be in
upcoming versions, so it's best to change them now.

Detect and raise an exception if the metadata size doesn't match the
fingerprint size passed to the arena builder. Thanks to Greg Landrum
for spotting this bug!

What's new in 1.1b7 (patch release)
===================================

Fixed a problem when the code is compiled on an old compiler which
doesn't understand the POPCNT inline assembly then run on a machine
which implements POPCNT.

What's new in 1.1b6 (5 Dec 2012)
================================

Added methods to count the number of hits in the search results which
are within a given score range, and to compute the cumulative score
(also called the "raw score") of those hits. These are:

   SearchResults.count_all(min_score=None, max_score=None, interval="[]")
   SearchResults.cumulative_score_all(min_score=None, max_score=None, interval="[]")
   SearchResult.count(min_score=None, max_score=None, interval="[]")
   SearchResult.cumulative_score(min_score=None, max_score=None, interval="[]")

Arenas now have a "copy_subset(indices, reorder=True)" method. This
selects a subset of the entries in the arena and makes a new arena.
Here's how to select a random subset of 100 entries from an arena:

  import random
  subset_indices = random.sample(xrange(len(arena)), 100)
  new_arena = arena.copy_subset(subset_indices)

(NOTE: 'copy_subset' was renamed 'copy' for the final 1.1 release.)

Fixed a bug in the Open Babel patterns FPS output: the 'software' line
needed a space between the Open Babel and chemfp versions.


What's new in 1.1b5 (23 April 2012)
===================================

The command-line search tools support an --NxN option for when the
queries and targets are the same. (The search results do not include
the diagonal term.)  The implemention takes advantage of the symmetry
to get almost a two-fold performance increase. This option assumes
that everything will fit into memory.

Added public APIs for the symmetric searches.

New popcount algorithms:
  - Lauradoux and POPCNT versions contributed by Kim Walisch
      These are 2x and 3x faster than the original method.
  - SSSE3 version by Imran Haque, Stanford University
      This is about 2.5x faster than the original method.
      Use --without-ssse3 to disable support for that method.
  - Gilles method, which can be better than the original method.

The timings depend very much on the compiler, CPU features, and choice
of 32- vs 64- bit architecture. For example, Lauradoux is slower than
the lookup tables for 32 bit systems. chemfp selects the best method
at import run-time. Use chemfp.bitops.set_alignment_method to force a
specific method.

The new popcount algorithms require a specific fingerprint alignment
and padding. Use the new "alignment" option in load_fingerprints() to
specify an alignment. The default uses an alignment based on the
available methods and the fingerprint size. (It will be 8 or less
unless you have SSSE3 hardware but not SSE4.2, and your fingerprint is
larger than 224 bits, in which case it's 64 bytes.)

Optional OpenMP support. This is used when the query is an arena. If
your compiler does not support OpenMP then use "--without-openmp" to
disable it.

Support for RDKit's Morgan fingerprints.

Support for Daylight's Circular and Tree fingerprints (if you have
OEGraphSim 2.0.0 installed.)

New decoder for Daylight's "binary2ascii" encoding.

Fixed a memory overflow bug which caused crashes on some Windows and
Linux machines.

Changed the API so that "arena.ids" or "subarena.ids" refers to the
identifiers for that arena/subarena, and "arena.arena_ids" and
"subarena.arena_ids" refers to the complete list of identifiers for
the underlying arena. This is what my code expected, only I got the
implementation backwards. Two of the test cases should have failed
with swapped attributes but it looks like I assumed the computer was
right and made the tests agree with the wrong values. Also added more
tests to highlight other places where I could make a mistake between
'ids' and 'arena_ids.' This fix resolves a serious error identified by
Brian McClain of Vertex.

Moved most memory management code from Python to C. The speedup is
most noticable when there is a hit density (eg, when the threshold is
below 0.5).

Created a new 'Results' return object, which lets you sort the hits in
different ways, and request only the score, or only the ids, or both
from the hitlist.  The arena search results specifically are stored in
a C data structure. This new API greatly simplfies implementing some
types of clustering algorithms, reduces memory overhead, and improves
performance.

Added Alex Grönholm's 'futures' package as a submodule. It greatly
simplifies making a thread- or process pool. It is a backport of the
code in Python 3.2.

Added Nilton Volpato's 'progressbar' package as a submodule. Use it to
show a text-based progress bar in chemfp-based search tools.

Added an experimental "Watcher" module by Allen Downey. Use it to
handle ^C events, which otherwise get sent to an arbitary thread. It
works by spawning a child process. The main process listens for a ^C
and forwards that as a os.kill() to the child process. This will
likely only work on Unix systems.

What's new in 1.0 (20 Sept 2011)
================================

The chemfp format is now a tab-delimited format. I talked with two
people who have spaces in their ids: one in their corporate ids and
the other wants to use IUPAC names. In discussion with others, having
a pure tab-delimited format would not be a problem with the primary
audience.

The simsearch output format is also tab delimited.

Completely redeveloped the in-memory search interface. The core data
structure is a "FingerprintArena", which can optionally hold
population count information.

The similarity searches use a compressed row representation,
which is a more efficient use of memory and reduces the number
Python-to-C calls I need to make.

The FPS knearest search is push oriented, and keeps track of the
identifiers at the C level.

Major restructuring of the API so that public functions are at the top
of the "chemfp" package. Made high-level functions for the expected
common tasks of searching an FPSReader and a FingerprintArena.

The oe2fps, ob2fps, and rdkit2fps readers now support multiple
structure filenames. Each filename is listed on its own "source" line.

New --id-tag to use one of the SD tag fields rather than the title
line. This is needed for ChEBI where you should use --id-tag "ChEBI ID"
to get ids like "CHEBI:776".

New --aromaticity option for oe2fps, and a corresponding "aromaticity"
field in the FPS header.

Improved docstring comments.

Improved error reporting.

Added error handling options "strict", "report", and "ignore."

More comprehensive test suite (which, yes, caught several errors).


What's new in 0.95
==================

Cross-platform pattern-based fingerprint generation, and specific
implementations of a CACTVS/PubChem-like substructure fingerprint and
of RDKit's MACCS patterns.


What's new in 0.9.1
===================

Support for Python 2.5.

What's new in 0.9
=================

Major update from 0.5. Changes to the API, code
cleanup, new search API, and more. Since there are
no earlier users, I won't go into the details. :)
