# Changelog

The format of this changelog is based on [Keep A Changelog][keep-a-changelog].
Hlink adheres to semantic versioning as much as possible.

## v4.0.0 (Unreleased)

### Added

* Added support for randomized parameter search to model exploration. [PR #168][pr168]
* Created an `hlink.linking.core.model_metrics` module with functions for computing
metrics on model confusion matrices. Added the F-measure model metric to model
exploration. [PR #180][pr180]
* Added this changelog! [PR #189][pr189]

### Changed

* Overhauled the model exploration task to use a nested cross-validation approach.
[PR #169][pr169]
* Changed `hlink.linking.core.classifier` functions to not interact with `threshold`
and `threshold_ratio`. Please ensure that the parameter dictionaries passed to these
functions only contain parameters for the chosen model. [PR #175][pr175]
* Simplified the parameters required for `hlink.linking.core.threshold.predict_using_thresholds`.
Instead of passing the entire `training` configuration section to this function,
you now need only pass `training.decision`. [PR #175][pr175]
* Added a new required `checkpoint_dir` argument to `SparkConnection`, which lets hlink set
different directories for the tmp and checkpoint directories. [PR #182][pr182]
* Swapped to using `tomli` as the default TOML parser. This should fix several issues
with how hlink parses TOML files. `load_conf_file()` provides the `use_legacy_toml_parser`
argument for backwards compatibility if necessary. [PR #185][pr185]

### Deprecated

* Deprecated the `training.param_grid` attribute in favor of the new, more flexible
`training.model_parameter_search` table. This is part of supporting the new randomized
parameter search. [PR #168][pr168]

### Removed

* Removed functionality for outputting "suspicious" training data from model exploration.
We determined that this is out of the scope of model exploration step 2. This change
greatly simplifies the model exploration code. [PR #178][pr178]
* Removed the deprecated `hlink.linking.transformers.interaction_transformer` module.
This module was deprecated in v3.5.0. Please use
[`pyspark.ml.feature.Interaction`][pyspark-interaction-docs] instead. [PR #184][pr184]
* Removed some alternate configuration syntax which has been deprecated since
v3.0.0. [PR #184][pr184]
* Removed `hlink.scripts.main.load_conf` in favor of a much simpler approach to
finding the configuration file and configuring spark. Please call
`hlink.configs.load_config.load_conf_file` directly instead. `load_conf_file` now
returns both the path to the configuration file and its contents as a mapping. [PR #182][pr182]

## v3.8.0 (2024-12-04)

### Added

* Added optional support for the XGBoost and LightGBM gradient boosting
machine learning libraries. You can find documentation on how to use these libraries
[here][gradient-descent-ml-docs]. [PR #165][pr165]
* Added a new `hlink.linking.transformers.RenameVectorAttributes` transformer which
can rename the attributes or "slots" of Spark vector columns. [PR #165][pr165]

### Fixed

* Corrected misleading documentation for comparisons, which are not the same thing
as comparison features. You can find the new documentation [here][comparison-docs]. [PR #159][pr159]
* Corrected the documentation for substitution files, which had the meaning of the
columns backwards. [PR #166][pr166]

## v3.7.0 (2024-10-10)

### Added

* Added an optional argument to `SparkConnection` to allow setting a custom Spark
app name. The default is still to set the app name to "linking". [PR #156][pr156]

### Changed

* Improved model exploration step 2's terminal output, logging, and documentation
to make the step easier to work with. [PR #155][pr155]

### Fixed

* Updated all modules to log to module-level loggers instead of the root logger. This gives
users of the library more control over filtering logs from hlink. [PR #152][pr152]


## v3.6.1 (2024-08-14)

### Fixed

* Fixed a crash in matching step 0 triggered when there were multiple exploded columns
in the blocking section. Multiple exploded columns are now supported. [PR #143][pr143]

## v3.6.0 (2024-06-18)

### Added

* Added OR conditions in blocking. This new feature supports connecting some or
all blocking conditions together with ORs instead of ANDs. Note that using many ORs in blocking
may have negative performance implications for large datasets since it increases
the size of the blocks and makes each block more difficult to compute. You can find
documentation on OR blocking conditions under the `or_group` bullet point [here][or-groups-docs].
[PR #138][pr138]

## v3.5.5 (2024-05-31)

### Added

* Added support for a variable number of columns in the array feature selection
transform, instead of forcing it to use exactly 2 columns. [PR #135][pr135]

## v3.5.4 (2024-02-20)

### Added

* Documented the `concat_two_cols` column mappings transform. You can see the
documentation [here][concat-two-cols-docs]. [PR #126][pr126]
* Documented column mapping overrides, which can let you read two columns with
different names in the input files into a single hlink column. The documentation for
this feature is [here][column-mapping-overrides-docs]. [PR #129][pr129]

### Fixed

* Fixed a bug where config validation checks did not respect column mapping overrides.
[PR #131][pr131]

## v3.5.3 (2023-11-02)

### Added

* Added config validation checks for duplicate comparison features, feature selections,
and column mappings. [PR #113][pr113]
* Added support for Python 3.12. [PR #119][pr119]
* Put the config file name in the script prompt. [PR #123][pr123]

### Fixed

* Reverted to keeping invalid categories in training data instead of erroring out.
This case actually does occasionally happen, and so we would rather not error out
on it. This reverts a change made in [PR #109][pr109], released in v3.5.2. [PR #121][pr121]

## v3.5.2 (2023-10-26)

### Changed

* Made some minor updates to the format of training step 3's output. There are now
3 columns: `feature_name`, `category`, and `coefficient_or_importance`. Feature
names are not suffixed with the category value anymore. [PR #112][pr112]
* BUG reverted in v3.5.3: Started erroring out on invalid categories in training
data instead of creating a new category for them. [PR #109][pr109]

### Fixed

* Fixed a bug with categorical features in training step 3. Each categorical feature
was getting a single coefficient when each *category* should get its own coefficient
instead. [PR #104][pr104], [PR #107][pr107]


## v3.5.1 (2023-10-23)

### Added

* Made a new training step 3 to replace model exploration step 3, which was buggy.
Training step 3 saves model feature importances or coefficients when `training.feature_importances`
is set to true. [PR #101][pr101]

### Removed

* Removed the buggy implementation of model exploration step 3. Training step 3 replaces
this. [PR #101][pr101]

## v3.5.0 (2023-10-16)

### Added

* Added support for Python 3.11. [PR #94][pr94]
* Created a new `multi_jaro_winkler_search` comparison feature. This is a complex
comparison feature which supports conditional Jaro-Winkler comparisons between
lists of columns with similar names. You can read more in the documentation [here][multi-jaro-winkler-search-docs].
[PR #99][pr99]

### Changed

* Upgraded from PySpark 3.3 to 3.5. [PR #94][pr94]

### Deprecated

* Deprecated the `hlink.linking.transformers.interaction_transformer` module.
Please use PySpark 3's [`pyspark.ml.feature.Interaction`][pyspark-interaction-docs]
class instead. Hlink's `interaction_transformer` module is scheduled for removal
in version 4. [PR #97][pr97]

### Fixed

* Fixed a bug where the hlink script's autocomplete feature sometimes did not work
correctly. [PR #96][pr96]

## v3.4.0 (2023-08-09)

### Added

* Created a new `convert_ints_to_longs` configuration setting for working with CSV
files. Documentation for this setting is available [here][ints-to-longs-docs]. [PR #87][pr87]
* Improved the link tasks documentation by adding more detail. This page is available
[here][link-tasks-docs]. [PR #86][pr86]

### Removed

* Dropped the `comment` column from the script's `desc` command. This column was
always full of nulls and cluttered up the screen. [PR #88][pr88]

## v3.3.1 (2023-06-02)

### Changed

* Updated documentation for column mapping transforms. [PR #77][pr77]
* Updated documentation for the `present_both_years` and `neither_are_null` comparison
types, clarifying how they are different. [PR #79][pr79]

### Fixed

* Fixed a bug where comparison features were marked as categorical whenever the
`categorical` key was present, even if it was set to false. [PR #82][pr82]

## v3.3.0 (2022-12-13)

### Added

* Added logging for user input to the script. This is extremely helpful for diagnosing
errors. [PR #64][pr64]
* Added and improved documentation for several comparison types. [PR #47][pr47]

### Changed

* Started writing to a unique log file for each script run. [PR #55][pr55]
* Updated and improved the tutorial in examples/tutorial. [PR #63][pr63]
* Changed to pyproject.toml instead of setup.py and setup.cfg. [PR #71][pr71]

### Fixed

* Fixed a bug which caused Jaro-Winkler scores to be 1.0 for two empty strings. The
scores are now 0.0 on two empty strings. [PR #59][pr59]

## v3.2.7 (2022-09-14)

### Added

* Added a configuration validation that checks that both data sources contain the id column. [PR #13][pr13]
* Added driver memory options to `SparkConnection`. [PR #40][pr40]

### Changed

* Upgraded from PySpark 3.2 to 3.3. [PR #11][pr11]
* Capped the number of partitions requested at 10,000. [PR #40][pr40]

### Fixed

* Fixed a bug where `feature_selections` was always required in the config file.
It now defaults to an empty list as intended. [PR #15][pr15]
* Fixed a bug where an error message in `conf_validations` was not formatted correctly. [PR #13][pr13]

## v3.2.6 (2022-07-18)

### Added

* Made hlink installable with `pip` via PyPI.org.

## v3.2.1 (2022-05-24)

### Added

* Improved logging during startup and for the `LinkTask.run_all_steps()` method.
[PR #7][pr7]

### Changed

* Added code to adjust the number of Spark partitions based on the size of the input
datasets for some link steps. This should help these steps scale better with large
datasets. [PR #10][pr10]

### Fixed

* Fixed a bug where model exploration's step 3 would run into a `TypeError` due to
trying to manually build up a file path. [PR #8][pr8]

## v3.2.0 (2022-05-16)

### Changed

* Upgraded from Python 3.6 to 3.10. [PR #5][pr5]
* Upgraded from PySpark 2 to PySpark 3. [PR #5][pr5]
* Upgraded from Java 8 to Java 11. [PR #5][pr5]
* Upgraded from Scala 2.11 to Scala 2.12. [PR #5][pr5]
* Upgraded from Scala Commons Text 1.4 to 1.9. This includes some bug fixes which
may slightly change Jaro-Winkler scores. [PR #5][pr5]

## v3.1.0 (2022-05-04)

### Added

* Started exporting true positive and true negative data along with false positive
and false negative data in model exploration. [PR #1][pr1]

### Fixed

* Fixed a bug where `exact_all_mult` was not handled correctly in config validation.
[PR #2][pr2]

## v3.0.0 (2022-04-27)

### Added

* This is the initial open-source version of hlink.


[pr1]: https://github.com/ipums/hlink/pull/1
[pr2]: https://github.com/ipums/hlink/pull/2
[pr5]: https://github.com/ipums/hlink/pull/5
[pr7]: https://github.com/ipums/hlink/pull/7
[pr8]: https://github.com/ipums/hlink/pull/8
[pr10]: https://github.com/ipums/hlink/pull/10
[pr11]: https://github.com/ipums/hlink/pull/11
[pr13]: https://github.com/ipums/hlink/pull/13
[pr15]: https://github.com/ipums/hlink/pull/15
[pr40]: https://github.com/ipums/hlink/pull/40
[pr47]: https://github.com/ipums/hlink/pull/47
[pr55]: https://github.com/ipums/hlink/pull/55
[pr59]: https://github.com/ipums/hlink/pull/59
[pr63]: https://github.com/ipums/hlink/pull/63
[pr64]: https://github.com/ipums/hlink/pull/64
[pr71]: https://github.com/ipums/hlink/pull/71
[pr77]: https://github.com/ipums/hlink/pull/77
[pr79]: https://github.com/ipums/hlink/pull/79
[pr82]: https://github.com/ipums/hlink/pull/82
[pr86]: https://github.com/ipums/hlink/pull/86
[pr87]: https://github.com/ipums/hlink/pull/87
[pr88]: https://github.com/ipums/hlink/pull/88
[pr94]: https://github.com/ipums/hlink/pull/94
[pr96]: https://github.com/ipums/hlink/pull/96
[pr97]: https://github.com/ipums/hlink/pull/97
[pr99]: https://github.com/ipums/hlink/pull/99
[pr101]: https://github.com/ipums/hlink/pull/101
[pr104]: https://github.com/ipums/hlink/pull/104
[pr107]: https://github.com/ipums/hlink/pull/107
[pr109]: https://github.com/ipums/hlink/pull/109
[pr112]: https://github.com/ipums/hlink/pull/112
[pr113]: https://github.com/ipums/hlink/pull/113
[pr119]: https://github.com/ipums/hlink/pull/119
[pr121]: https://github.com/ipums/hlink/pull/121
[pr123]: https://github.com/ipums/hlink/pull/123
[pr126]: https://github.com/ipums/hlink/pull/126
[pr129]: https://github.com/ipums/hlink/pull/129
[pr131]: https://github.com/ipums/hlink/pull/131
[pr135]: https://github.com/ipums/hlink/pull/135
[pr138]: https://github.com/ipums/hlink/pull/138
[pr143]: https://github.com/ipums/hlink/pull/143
[pr152]: https://github.com/ipums/hlink/pull/152
[pr155]: https://github.com/ipums/hlink/pull/155
[pr156]: https://github.com/ipums/hlink/pull/156
[pr159]: https://github.com/ipums/hlink/pull/159
[pr165]: https://github.com/ipums/hlink/pull/165
[pr166]: https://github.com/ipums/hlink/pull/166
[pr168]: https://github.com/ipums/hlink/pull/168
[pr169]: https://github.com/ipums/hlink/pull/169
[pr175]: https://github.com/ipums/hlink/pull/175
[pr178]: https://github.com/ipums/hlink/pull/178
[pr180]: https://github.com/ipums/hlink/pull/180
[pr182]: https://github.com/ipums/hlink/pull/182
[pr184]: https://github.com/ipums/hlink/pull/184
[pr185]: https://github.com/ipums/hlink/pull/185
[pr189]: https://github.com/ipums/hlink/pull/189

[ints-to-longs-docs]: config.html#data-sources
[link-tasks-docs]: link_tasks
[pyspark-interaction-docs]: https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.ml.feature.Interaction.html
[multi-jaro-winkler-search-docs]: comparison_features.html#multi-jaro-winkler-search
[concat-two-cols-docs]: column_mappings.html#concat-two-cols
[column-mapping-overrides-docs]: column_mappings.html#advanced-usage
[or-groups-docs]: config.html#blocking
[gradient-descent-ml-docs]: models
[comparison-docs]: comparisons
[keep-a-changelog]: https://keepachangelog.com/en/1.0.0/
