# Comparison types, transform add-ons, aggregate features, and household aggregate features

This page has information on the different comparison types available for the `[[comparison_features]]`
section, along with some attributes available to all of the comparison types and some aggregate features
that are not configurable.

## Comparison types
Each header below represents a comparison type.  Transforms are used in the context of `comparison_features`.

```
[[comparison_features]]
alias = "relatematch"
column_name = "relate_div_100"
comparison_type = "equals"
categorical = true
```

### maximum_jaro_winkler
Finds the greatest Jaro-Winkler value among the cartesian product of multiple columns.  For example, given an input of `column_names = ['namefrst', 'namelast']`, it would return the maximum Jaro-Winkler name comparison value among the following four comparisons: 
```
[('namefrst_a', 'namefrst_b'),
 ('namefrst_a', 'namelast_b'),
 ('namelast_a', 'namefrst_b'),
 ('namelast_a', 'namelast_b')]
 ```
* Attributes:
  * `column_names` -- Type: list of strings.  Required.  The list of columns used as input for the set of comparisons generated by taking the cartesian product.
  
 ```
[[comparison_features]]
alias = "maximum_jw"
column_names = ["namelast", "namefrst"]
comparison_type = "maximum_jaro_winkler"
```
 

### jaro_winkler

Returns the Jaro-Winkler comparison score for a given column.
* Attributes:
  * `column_name` -- Type: `string`. Required.  The column to compare using the Jaro-Winkler score.
```
[[comparison_features]]
alias = "namefrst_jw"
column_name = "namefrst"
comparison_type = "jaro_winkler
```

### jaro_winkler_street
Uses an additional geographic column value to filter for major location changes before comparing street names.  If boundary column A is not equal to boundary column B, a Jaro-Winkler score of zero is returned.  If boundary column A and B are equal, the Jaro-Winkler comparison score of the street columns is returned.
* Attributes:
  * `column_name` -- Type: `string`. Required. The input street column.
  * `boundary` -- Type: `string`. Required.  An input column to match on before comparing street name values.  
```
[[comparison_features]]
alias = "jw_street"
column_name = "street"
boundary = "enum_dist"
comparison_type = "jaro_winkler_street"
```

### max_jaro_winkler

Returns the greatest Jaro-Winkler value from the comparisons of a list of names.
* Attributes:
  * `column_name` -- Type: `string`. Required.  Input column containing a list of names to compare (such as related household members, or neighborhood surnames).
```
[[comparison_features]]
alias = "related_individual_max_jw"
column_name= "namefrst_related"
comparison_type = "max_jaro_winkler"
```

### equals

Asserts that values are the same for both compared columns using SQL: `a.{column_name} IS NOT DISTINCT FROM b.{column_name}`

```
[[comparison_features]]
alias = "relatematch"
column_name = "relate_div_100"
comparison_type = "equals"
categorical = true
```

### f1_match
Evaluates if the first name initial A matches either the first name first initial B or either the first or second middle initial of B.  If so, returns 1.  Otherwise, returns 2.

1 = First initial of first first name A matches first initial of any of potential match first names B

2 = mismatch

Uses the following SQL query: 
```
"CASE WHEN (
    (a.{fi} IS NOT DISTINCT FROM b.{fi}) OR 
    (a.{fi} IS NOT DISTINCT FROM b.{mi0}) OR
    (a.{fi} IS NOT DISTINCT FROM b.{mi1})
) THEN 1 ELSE 2 END"
```

```
[[comparison_features]]
alias = "f1_match"
first_init_col = "namefrst_init"
mid_init_cols = ["namefrst_mid_init", "namefrst_mid_init_2"]
comparison_type = "f1_match"
categorical = true
```

### f2_match
Evaluates if first middle initial A is empty/null.  If so, return 0. 
Otherwise, if either first or second middle initial A is not null and matches first name initial B, or first or second middle initial B, return 1.
Otherwise, return 2. 

1 = First initial of A second first name matches first initial of any of potential match first names B

2 = mismatch

0 = no second first name A

Uses the following SQL:
```
CASE WHEN ((a.{mi0} == '') OR (a.{mi0} IS NULL)) THEN 0 WHEN (
    (a.{mi0} IS NOT DISTINCT FROM b.{fi}) OR
    ((a.{mi1} IS NOT NULL) AND (a.{mi1} IS NOT DISTINCT FROM b.{fi})) OR
    (a.{mi0} IS NOT DISTINCT FROM b.{mi0}) OR
    (a.{mi0} IS NOT DISTINCT FROM b.{mi1}) OR
    ((a.{mi1} IS NOT NULL) AND (a.{mi1} IS NOT DISTINCT FROM b.{mi0})) OR
    ((a.{mi1} IS NOT NULL) AND (a.{mi1} IS NOT DISTINCT FROM b.{mi1}))
) THEN 1 ELSE 2 END
```
* Attributes:
   * `first_init_col` -- Type: `string`. Required.  First name initial input column.
   * `mid_init_cols` -- Type: list of strings.  Required.  List of first and second middle initial input columns.
```
[[comparison_features]]
alias = "f2_match"
first_init_col = "namefrst_init"
mid_init_cols = ["namefrst_mid_init", "namefrst_mid_init_2"]
comparison_type = "f2_match"
categorical = true
```

### not_equals
Asserts that values are distinct between compared individuals using SQL: `a.{column_name} IS DISTINCT FROM b.{column_name}`.  Used mainly in caution flag features (f_caution, m_caution, sp_caution).
* Attributes:
  * `column_name` -- Type: `string`. Required. Input column to compare.

```
[[comparison_features]]
alias = "m_caution"
column_names = ["mbpl", "mother_birthyr", "stepmom", "momloc"]
comparison_type = "caution_comp_4"
categorical = true
[comparison_features.comp_a]
column_name = "mbpl"
comparison_type = "not_equals"
[comparison_features.comp_b]
column_name = "mother_birthyr"
comparison_type = "abs_diff"
gt_threshold = 5
[comparison_features.comp_c]
column_name = "stepmom"
comparison_type = "parent_step_change"
[comparison_features.comp_d]
column_name = "momloc"
comparison_type = "present_both_years"
```

### equals_as_int
Checks for equality using equals sign and returns boolean result in integer form.  Uses SQL: `CAST(a.{col} = b.{col} as INT)`
* Attributes:
  * `column_name` -- Type: `string`. Required. Input column to compare.

```
[[comparison_features]]
alias = "namelast_equal_as_int"
column_name = "namelast_clean"
comparison_type  = "equals_as_int"
```
### all_equals
Asserts whether the values in all given columns match.  Uses a SQL expression generated by joining `a.{col} = b.{col}` and `AND` for each given column.
* Attributes:
  * `column_names` -- Type: list of strings.  Required. List of the columns to evaluate if all are equal across records being compared.
```
[[comparison_features]]
alias = "exact"
column_names = ["namefrst_unstd", "namelast_clean"]
comparison_type = "all_equals"
```

### not_zero_and_not_equals

Checks that both values are present (not null) and nonzero and that they are not equal to one another. Evaluates
to a boolean. This is primarily useful when a value of 0 indicates some kind of incomparibility akin to the
value being missing.

See also [present_and_equal_categorical_in_universe](#present-and-equal-categorical-in-universe), which is a similar
but more general comparison type.

* Attributes:
  * `column_name` -- Type: `string`. Required. Input column to compare.

```
[[comparison_features]]
alias = "fbpl_nomatch"
column_name = "fbpl"
comparison_type = "not_zero_and_not_equals"
```

### or
Allows for the concatenation of up to four comparison features into one feature using a SQL `OR` between the generated clause for each sub-comparison.
* Attributes:
  * `column_names` -- Type: list of strings. Required. A list of all input columns used by sub-comparisons.
  * `comp_a`, `comp_b` -- Type: Object. Required.  Sub-comparison using any of the comparison feature types documented in this section.
  * `comp_c`, `comp_d` -- Type: Object. Optional.  Sub-comparison using any of the comparison feature types documented in this section.
```
[[comparison_features]]
alias = "sp_caution"
column_names = ["spouse_bpl", "spouse_birthyr", "durmarr"]
comparison_type = "or"
[comparison_features.comp_a]
column_name = "spouse_bpl"
comparison_type = "not_equals"
[comparison_features.comp_b]
column_name = "spouse_birthyr"
comparison_type = "abs_diff"
lower_threshold = 5
[comparison_features.comp_c]
column_name = "durmarr"
comparison_type = "new_marr"
upper_threshold = 7
```

### and

Allows for the concatenation of up to four comparison features into one feature using a SQL `AND` between the generated clause for each sub-comparison.
* Attributes:
  * `column_names` -- Type: list of strings. Required. A list of all input columns used by sub-comparisons.
  * `comp_a`, `comp_b` -- Type: Object. Required.  Sub-comparison using any of the comparison feature types documented in this section.
  * `comp_c`, `comp_d` -- Type: Object. Optional.  Sub-comparison using any of the comparison feature types documented in this section.

In this example, the `and` comparison appears in `[comparison_features.comp_b]`.

```
[[comparison_features]]
alias = "street_jw"
comparison_type = "times"
column_names = ["street","county", "statefip"]
[comparison_features.comp_a]
column_name = "street"
comparison_type = "jaro_winkler"
lower_threshold = 0.9
[comparison_features.comp_b]
comparison_type = "and"
column_names = ["county", "statefip"]
[comparison_features.comp_b.comp_a]
column_name = "county"
comparison_type = "equals"
[comparison_features.comp_b.comp_b]
column_name = "statefip"
comparison_type = "equals"
```

### times
Takes the output of two sub-comparisons and multiplies them together after casting as floats.
* Attributes:
  * `column_names` -- Type: list of strings. Required. A list of all input columns used by sub-comparisons.
  * `comp_a`, `comp_b` -- Type: Object. Required.  Sub-comparison using any of the comparison feature types documented in this section.  `comp_a` and `comp_b` can also have sub-comparisons, as in the given example.
```
[[comparison_features]]
alias = "street_jw"
comparison_type = "times"
column_names = ["street","county", "statefip"]
[comparison_features.comp_a]
column_name = "street"
comparison_type = "jaro_winkler"
lower_threshold = 0.9
[comparison_features.comp_b]
comparison_type = "and"
column_names = ["county", "statefip"]
[comparison_features.comp_b.comp_a]
column_name = "county"
comparison_type = "equals"
[comparison_features.comp_b.comp_b]
column_name = "statefip"
comparison_type = "equals"
```

### caution_comp_3
Generates an SQL expression in the form `(comparison A OR comparison B) AND comparison C`.

* Attributes:
  * `column_names` -- Type: list of strings. Required. A list of all input columns used by sub-comparisons.
  * `comp_a`, `comp_b`, `comp_c` -- Type: Object. Required.  Sub-comparisons using any of the comparison feature types documented in this section.

```
[[comparison_features]]
alias = "sp_caution"
column_names = ["spouse_bpl", "spouse_birthyr", "durmarr", "sploc"]
comparison_type = "caution_comp_3"
categorical = true
[comparison_features.comp_a]
column_name = "spouse_bpl"
comparison_type = "not_equals"
[comparison_features.comp_b]
column_name = "spouse_birthyr"
comparison_type = "abs_diff"
gt_threshold = 5
[comparison_features.comp_c]
column_name = "durmarr"
comparison_type = "new_marr"
upper_threshold = 7
```

### caution_comp_3_012

Similar to `caution_comp_3`, but first checks the value of comparison C. If comparison C evaluates to false,
then `caution_comp_3_012` evaluates to 2. Otherwise, it evaluates to the result of `caution_comp_3`, so 0 or 1.

* Attributes:
  * `column_names` -- Type: list of strings. Required. A list of all input columns used by sub-comparisons.
  * `comp_a`, `comp_b`, `comp_c` -- Type: Object. Required. Sub-comparison using any of the comparison feature types documented in this section.

### caution_comp_4
Generates an SQL expression in the form `(comparison A OR comparison B OR comparison C) AND comparison D`.

* Attributes:
  * `column_names` -- Type: list of strings. Required. A list of all input columns used by sub-comparisons.
  * `comp_a`, `comp_b`, `comp_c`, `comp_d` -- Type: Object. Required.  Sub-comparisons using any of the comparison feature types documented in this section.

```
[[comparison_features]]
alias = "m_caution"
column_names = ["mbpl", "mother_birthyr", "stepmom", "momloc"]
comparison_type = "caution_comp_4"
categorical = true
[comparison_features.comp_a]
column_name = "mbpl"
comparison_type = "not_equals"
[comparison_features.comp_b]
column_name = "mother_birthyr"
comparison_type = "abs_diff"
gt_threshold = 5
[comparison_features.comp_c]
column_name = "stepmom"
comparison_type = "parent_step_change"
[comparison_features.comp_d]
column_name = "momloc"
comparison_type = "present_both_years"
```

### caution_comp_4_012

Similar to `caution_comp_4`, but first checks the value of comparison D. If comparison D evaluates to false,
then `caution_comp_4_012` evaluates to 2. Otherwise, it evaluates to the result of `caution_comp_4`, so 0 or 1.

* Attributes:
  * `column_names` -- Type: list of strings. Required. A list of all input columns used by sub-comparisons.
  * `comp_a`, `comp_b`, `comp_c`, `comp_d` -- Type: Object. Required. Sub-comparisons using any of the comparison feature types documented in this section.

### any_equals
Used to compare middle initials and first names under specific circumstances.  
If middle initial A is not empty/null and is the same as either middle initial B or first name B, 
OR if first name A is not empty/null and is the same as middle initial B.
* Attributes:
  * `column_names` -- Type: list of strings. Required. The first input column should be the middle initial column, and the second input column should be the first name column.
```
[[comparison_features]]
alias = "mid_init_match"
column_names = ["namefrst_mid_init", "namefrst_unstd"]
comparison_type = "any_equals"
```

### either_are_1
Checks if the column value for either A or B is equal to 1.
* Attributes:
  * `column_name` -- Type: `string`. Required. Input column to compare to 1.
  
```
[[comparison_features]]
alias = "either_1"
column_name = "nativity"
comparison_type = "either_are_1"
categorical = true
```

### either_are_0
Checks if the column value for either A or B is equal to 0.
* Attributes:
  * `column_name` -- Type: `string`. Required. Input column to compare to 0.
  
```
[[comparison_features]]
alias = "either_0"
column_name = "nativity"
comparison_type = "either_are_0"
categorical = true
```

### second_gen_imm
Checks if individual A is a second-generation immigrant by looking for `nativity` value of 2, 3, or 4 (one or both parents foreign-born). 
* Attributes:
  * `column_name` -- Type: `string`. Required. Input should be the name of the nativity column.
```
[[comparison_features]]
alias =  "sgen"
column_name = "nativity"
comparison_type = "second_gen_imm"
categorical = true
```

### rel_jaro_winkler
Uses a Scala function to determine the number of people in the input column with a name similarity score (Jaro-Winkler) greater than or equal to the given `jw_threshold`, an age difference less than or equal to the given `age_threshold`, and matching sex for the sample A individual and the sample B potential match.  Takes a column generated with the feature selection transform `related_individual_rows` as input (list of person data objects to compare).  Can be used for related or unrelated individuals, depending on the input column specified.

* Attributes:
  * `column_name` -- Type: `string`.  The input column with data in the form of a list of person data objects.
  * `name_col` -- Type: `string`. 
  * `birthyr_col` -- Type: `string`.
  * `jw_threshold` -- Type: `float`. 
  * `age_threshold` -- Type: `int`.
  
```
[[comparison_features]]
alias = "rel"
column_name = "namefrst_related_rows"
name_col = "namefrst_unstd"
birthyr_col = "replaced_birthyr"
comparison_type = "rel_jaro_winkler"
jw_threshold = 0.9
age_threshold = 5
```

### extra_children
Using a Scala function, checks to see if there are children present in sample B who are not present in sample A, but based on relate codes, age, sex, and name, we would have expected to be present in A.  Returns a count of suspected "extra" children.  Takes a column generated with the feature selection transform `related_individual_rows` as input (list of person data objects to compare).
* Attributes:
  * `column_name` -- Type: `string`.  The input column with data in the form of a list of person data objects.
  * `relate_col` -- Type: `string`. The name of the column with the `relate` code.
  * `histid_col` -- Type: `string`. The name of the id column.
  * `name_col` -- Type: `string`.  The name of the column containing the first name for comparison.
  * `birthyr_col` -- Type: `string`. The name of the column containing the birth year.
  * `year_b` -- Type: `int`. The year that sample B was taken.
  * `jw_threshold` -- Type: `float`.  The minimum acceptable Jaro-Winkler score to consider a match.
  * `age_threshold` -- Type: `int`. The maximum acceptable age difference to consider a match.
  
```
[[comparison_features]]
alias = "extra_children"
column_name = "namefrst_related_rows"
relate_col = "relate"
histid_col = "histid"
name_col = "namefrst_unstd"
birthyr_col = "replaced_birthyr"
year_b = 1910
comparison_type = "extra_children"
jw_threshold = 0.8
age_threshold = 2
```

### jaro_winkler_rate
Uses a Scala function to calculate the percentage of individuals who have a Jaro-Winkler score greater than or equal to the given threshold.  Rate returned as a percentage as a float data type.
* Attributes:
  * `column_name` -- Type: `string`.  The input column with data in the form of a list of person data objects.  The input column seen below ("namelast_neighbors")was generated using a "neighbor_aggregate" feature selection.
  * `jw_threshold` -- Type: `float`. The minimum Jaro-Winkler threshold to consider an acceptable match.

In the following example, a `lower_threshold` feature add-on is used to convert the returned rate to a boolean asserting whether it meets the given minimum threshold. (>= 5% of neighbors have a Jaro-Winkler score >= 0.95)
```
[[comparison_features]]
alias = "nbors"
comparison_type = "times"
column_names = ["namelast_neighbors", "county", "statefip"]
[comparison_features.comp_a]
column_name = "namelast_neighbors"
comparison_type = "jaro_winkler_rate"
jw_threshold = 0.95
lower_threshold = 0.05
[comparison_features.comp_b]
comparison_type = "and"
column_names = ["county", "statefip"]
[comparison_features.comp_b.comp_a]
column_name = "county"
comparison_type = "equals"
[comparison_features.comp_b.comp_b]
column_name = "statefip"
comparison_type = "equals"
```

### sum
Adds the column values for A and B together (takes the sum).
* Attributes:
  * `column_name` -- Type: `string`. The input column to be added.
 
```
[[comparison_features]]
alias = "namelast_popularity_sum"
column_name = "namelast_popularity"
comparison_type = "sum"
```

### length_b
Returns the length of the column value in record B using the SQL `size()` function.
* Attributes:
  * `column_name` -- Type: `string`. The name of the input column to take the length of in dataset B.

### abs_diff
Takes the absolute value of the difference between the values of the given column in datasets A and B.
* Attributes:
  * `column_name` -- Type: `string`.  The input column to evaluate.
  * `not_equals` -- Type: `int`. OPTIONAL.  You can specify a value for the column to be considered invalid input, in which case the expression would return the value -1 instead of an absolute difference.  For example, if you are evaluating the difference in marriage duration values, and "99" is a placeholder value for "unknown" in the data, you can exclude those values from consideration using this attribute.

```
[[comparison_features]]
alias = "byrdiff"
column_name = "replaced_birthyr"
comparison_type = "abs_diff"

[[comparison_features]]
alias = "mardurmatch"
column_name = "durmarr"
not_equals = 99
comparison_type = "abs_diff"
btwn_threshold = [9, 14]
categorical = True
```

### b_minus_a
Returns the value of subtracting the value of column A from the value of column B.
* Attributes:
  * `column_name` -- Type: `string`.  The input column to evaluate.
  * `not_equals` -- Type: `int`. OPTIONAL.  You can specify a value for the column to be considered invalid input, in which case the expression would return the value -1 instead of an absolute difference.  For example, if you are evaluating the difference in marriage duration values, and "99" is a placeholder value for "unknown" in the data, you can exclude those values from consideration using this attribute.
```
[[comparison_features]]
alias = "mardurmatch"
column_name = "durmarr"
not_equals = 99
comparison_type = "b_minus_a"
btwn_threshold = [5,14]
categorical = true
```

### geo_distance
Uses a lookup table to find the geographic distance between locations.  The SQL expression is generated by `hlink/linking/core/dist_table.py`. There are several ways to configure this feature. You can look up distances in the given file using one or two keys (specified with the `key_count` attribute).  You can also optionally have a secondary look-up table that serves as a back-up value in the case that the primary look-up does not contain a value for the locations given.  This is particularly useful for county distance, as you can set the primary join to be across counties, but set up a secondary join on state, which has much fewer combinations and thus less risk of nulls, to fill in when the counties specified aren't in the look-up.

* Attributes:
  * `key_count` -- Type: `int`. The number of keys used to join on the primary (or only) look-up table. Acceptable values are 1 or 2. Ex: for state and county, key_count = 2.  For just state, key_count = 1 even though there is county_a and county_b.  
  * `distances_file` -- Type: `string` of path.  Path to the distances look-up file.
  * `table_name` -- Type: `string`.  What to name the table that will be generated from the distances file.  If you want to do multiple look-ups, if the table_name is the same across all feature specifications, it will only be read in once.
  
  * Attributes for `key_count = 1`:
    * `column_name` -- Type: `string`. The column in the input data that you want to use as a key to look up the geographic distance.
    * `loc_a` -- Type: `string`.  First column to join on in the look-up table (where to find the value coming from the `column_name` column A).
    * `loc_b` -- Type: `string`.  Second column to join on in the look-up table (where to find the value coming from the `column_name` column B).
    * `distance_col` -- Type: `string`. Name of the column containing the geographic distance in the look-up table.
  
  * Attributes for `key_count = 2`:
    * `column_names` -- Type: list of strings.  The two columns you want to use as keys to look up the geographic distance.
    * `source_column_a` -- Type: `string`. First column to join on in the source data.
    * `source_column_b` -- Type: `string`. Second column to join on in the source data.
    * `loc_a_0` -- Type: `string`.  First column to join on in the look-up table.
    * `loc_a_1` -- Type: `string`.  First column to join on in the look-up table.
    * `loc_b_0` -- Type: `string`.  Second column to join on in the look-up table.
    * `loc_b_1` -- Type: `string`.  Second column to join on in the look-up table.
    * `distance_col` -- Type: `string`. Name of the column containing the geographic distance in the look-up table.
   
  * Attributes if using a secondary join:
    * `secondary_key_count` -- Type: `int`. The number of keys used to join on the secondary (backup) look-up table. Acceptable values are 1 or 2.
    * `secondary_table_name` -- Type: `string`.  What to name the table that will be generated from the `secondary_distances_file`.  If you want to do multiple look-ups, if the table_name is the same across all feature specifications, it will only be read in once.
    * `secondary_distances_file` -- Type: `string` of path.  Path to the secondary distances look-up file.
    * `secondary_source_column` -- Type: `string`. The column in the input data that you want to use as a key in the secondary geographic distance look-up.
    * `secondary_loc_a` -- Type: `string`. First column to join on in the secondary look-up table.
    * `secondary_loc_b` -- Type: `string`. Second column to join on in the secondary look-up table.
    * `secondary_distance_col` -- Type: `string`. Name of the column containing the geographic distance in the secondary look-up table.
  
```
[[comparison_features]]
alias = "state_distance"
comparison_type = "geo_distance"
key_count = 1
table_name = "state_distance_lookup"
distances_file = "/path/to/county_state_distance.csv"
column_name = "bpl"
loc_a = "statecode1"
loc_b = "statecode2"
distance_col = "dist"


[[comparison_features]]
alias = "county_distance"
comparison_type = "geo_distance"
column_names = ["county", "statefip"]
key_count = 2
table_name = "county_distance_lookup"
distances_file = "/path/to/county_1900_1910_distances_km.csv"
# columns to join on in the data
source_column_a = "county"
source_column_b = "statefip"

# column names from the csv lookup file
loc_a_0 = "from_icpsrctyi"
loc_a_1 = "to_icpsrctyi"
loc_b_0 = "from_statefip"
loc_b_1 = "to_statefip"
distance_col = "distance_km"

# SECONDARY JOIN
secondary_key_count = 1
secondary_table_name = "state_distance_lookup"
secondary_distances_file = "/path/to/state_1900_1910_distances_km.csv"
secondary_source_column = "statefip"
secondary_loc_a = "from_statefip"
secondary_loc_b = "to_statefip"
secondary_distance_col = "distance_km"
```

### fetch_a

Gets the value of column A.

* Attributes:
  * `column_name` -- Type: `string`. Required. The column to get the value from.

```
[[comparison_features]]
alias = "race"
column_name = "race"
comparison_type = "fetch_a"
categorical = true
```


### fetch_b

Gets the value of column B.

* Attributes:
  * `column_name` -- Type: `string`. The column to get the value from.

```
[[comparison_features]]
alias = "race"
column_name = "race"
comparison_type = "fetch_b"
categorical = true
```

### present_both_years

Checks whether column A and column B are both present and both positive (> 0).
Evaluates to 1 if both are present and positive and 0 otherwise.

* Attributes:
  * `column_name` -- Type: `string`. The column to check. Must be a column with a numerical type.

```
[[comparison_features]]
alias = "sp_caution"
column_names = ["spouse_bpl", "spouse_birthyr", "durmarr", "sploc"]
comparison_type = "caution_comp_4"
categorical = true
[comparison_features.comp_a]
column_name = "spouse_bpl"
comparison_type = "not_equals"
[comparison_features.comp_b]
column_name = "spouse_birthyr"
comparison_type = "abs_diff"
gt_threshold = 5
[comparison_features.comp_c]
column_name = "durmarr"
comparison_type = "new_marr"
upper_threshold = 7
[comparison_features.comp_d]
column_name = "sploc"
comparison_type = "present_both_years"
```

### neither_are_null

Checks that neither column A nor column B is null or the empty string `''`.
Evaluates to 1 if neither column is null or `''` and evaluates to 0 otherwise.

* Attributes:
  * `column_name` -- Type: `string`. The column of type string to check.

```
[[comparison_features]]
alias = "mpres"
column_name = "m_namefrst"
comparison_type = "neither_are_null"
categorical = true
```

### present_and_matching_categorical

Checks that both column A and column B are present and that they match according to SQL's `IS DISTINCT FROM`. Evaluates to 0, 1, or 2:

0 → columns are both present and match

1 → columns are both present but are distinct

2 → one or both columns are missing

* Attributes:
  * `column_name` -- Type: `string`. Required. The column to check.

### present_and_not_equal

Checks that column A and column B are both present but are not equal.

* Attributes:
  * `column_name` -- Type: `string`. The column to check.

### present_and_equal_categorical_in_universe

Checks that column A and column B are both present, are not equal to the not-in-universe value NIU,
and are equal to each other according to SQL's `IS DISTINCT FROM`. Evaluates to 0 if either column is missing or
if either column is the NIU value. Otherwise, evaluates to 0 if the columns are distinct or 1 if
the columns are equal.

* Attributes:
  * `column_name` -- Type: `string`. Required. The column to check.
  * `NIU` -- Type: same as the type of the input column. Required. The not-in-universe value to use in the check.

```
[[comparison_features]]
alias = "mfbplmatch"
column_name = "nativity"
comparison_type = "present_and_equal_categorical_in_universe"
NIU = "0"
categorical = true
```

### sql_condition

This is a flexible comparison type that allows users to write their own SQL expressions to be
evaluated. Favor using a different comparison type if that's a reasonable option. If there are
no other comparison types that work for a particular use case, this one is a good fallback.

* Attributes:
  * `column_names` -- Type: list of strings. Required. A list of all columns used in the SQL expression.
  * `condition` -- Type: `string`. The SQL expression to evaluate.

In this example, we make use of the hlink-defined `jw` function, which computes
the Jaro-Winkler similarity of two strings. `nvl` is a Spark builtin function
which returns its second argument if the first is null, and the first argument
otherwise.

```
[[comparison_features]]
alias = "namelast_jw_max"
comparison_type = "sql_condition"
column_names = ["namelast1", "namelast2", "namelast3"]
condition = "GREATEST(jw(nvl(a.namelast1, ''), nvl(b.namelast1, '')), jw(nvl(a.namelast2, ''), nvl(b.namelast2, '')), jw(nvl(a.namelast3, ''), nvl(b.namelast3, '')))"
```

## Feature add-ons
These attributes can be added to most comparison feature types above to extend the type of output returned beyond the standard comparison feature.

### alias
* Attributes:
  * `alias`: Type: `string`.  Should be used at the top level comparison of every comparison feature.  The name for the output column.
```
[[comparison_features]]
alias = "jw_f"
column_name = "father_namefrst"
comparison_type = "jaro_winkler"
```

### power
Raises a comparison feature to a given exponential power.
* Attributes:
  * `power` -- Type: `int`. The power to raise the comparison output to.  For example, `power = 2` will square the output.
```
[[comparison_features]]
alias = "county_distance_squared"
comparison_type = "geo_distance"
column_names = ["county", "statefip"]
# PRIMARY JOIN
# key count: the number of keys used for the join per source file.  Ex: for state and county, key_count = 2.  For just state, key_count = 1 even though there is county_a and county_b
key_count = 2
table_name = "county_distance_lookup"
#distances_file = "/path/to/county_state_distance.csv"
distances_file = "/path/to/county_1900_1910_distances_km.csv"
# columns to join on in the data
source_column_a = "county"
source_column_b = "statefip"
# column names from the csv lookup file
loc_a_0 = "from_icpsrctyi"
loc_a_1 = "to_icpsrctyi"
loc_b_0 = "from_statefip"
loc_b_1 = "to_statefip"
distance_col = "distance_km"
# SECONDARY JOIN
secondary_key_count = 1
secondary_table_name = "state_distance_lookup"
secondary_distances_file = "/path/to/state_1900_1910_distances_km.csv"
secondary_source_column = "statefip"
secondary_loc_a = "from_statefip"
secondary_loc_b = "to_statefip"
secondary_distance_col = "distance_km"
power = 2
```

### threshold
* Attributes:
  * `threshold` -- Type: numeric types. Asserts if the comparison feature output is not null and is greater than or equal to (`>=`) the given threshold value.
```
[[comparison_features]]
alias = "imm"
column_name = "nativity"
comparison_type = "fetch_a"
threshold = 5
categorical = true
```

### lower_threshold
* Attributes: 
  * `lower_threshold` -- Type: numeric types.  Asserts if the comparison feature output is not null and is greater than or equal to (`>=`) the given threshold value.
```
[[comparison_features]]
alias = "street_jw"
comparison_type = "times"
column_names = ["street","county", "statefip"]
[comparison_features.comp_a]
column_name = "street"
comparison_type = "jaro_winkler"
lower_threshold = 0.9
[comparison_features.comp_b]
comparison_type = "and"
column_names = ["county", "statefip"]
[comparison_features.comp_b.comp_a]
column_name = "county"
comparison_type = "equals"
[comparison_features.comp_b.comp_b]
column_name = "statefip"
comparison_type = "equals"
```

### upper_threshold
* Attributes: 
  * `upper_threshold` -- Type: numeric types.  Asserts if the comparison feature output is not null and is less than or equal to (`<=`) the given threshold value.
```
[[comparison_features]]
alias = "sp_caution"
column_names = ["spouse_bpl", "spouse_birthyr", "durmarr", "sploc"]
comparison_type = "caution_comp_4"
categorical = true
[comparison_features.comp_a]
column_name = "spouse_bpl"
comparison_type = "not_equals"
[comparison_features.comp_b]
column_name = "spouse_birthyr"
comparison_type = "abs_diff"
gt_threshold = 5
[comparison_features.comp_c]
column_name = "durmarr"
comparison_type = "new_marr"
upper_threshold = 7
[comparison_features.comp_d]
column_name = "sploc"
comparison_type = "present_both_years"
```

### gt_threshold
* Attributes: 
  * `gt_threshold` -- Type: numeric types.  Asserts if the comparison feature output is not null and is greater than (`>`) the given threshold value.
```
[[comparison_features]]
alias = "sp_caution"
column_names = ["spouse_bpl", "spouse_birthyr", "durmarr", "sploc"]
comparison_type = "caution_comp_4"
categorical = true
[comparison_features.comp_a]
column_name = "spouse_bpl"
comparison_type = "not_equals"
[comparison_features.comp_b]
column_name = "spouse_birthyr"
comparison_type = "abs_diff"
gt_threshold = 5
[comparison_features.comp_c]
column_name = "durmarr"
comparison_type = "new_marr"
upper_threshold = 7
[comparison_features.comp_d]
column_name = "sploc"
comparison_type = "present_both_years"
```

### btwn_threshold
* Attributes: 
  * `btwn_threshold` -- Type: List of numeric type.  Asserts if the comparison feature output is greater than or equal to (`>=`) the first threshold value, and less than or equal to (`<=`) the second threshold value.
```
[[comparison_features]]
alias = "mardurmatch"
column_name = "durmarr"
not_equals = 99
comparison_type = "b_minus_a"
btwn_threshold = [5,14]
categorical = true
```

### look_at_addl_var
* Attributes:
  * `look_at_addl_var` -- Type: boolean. Flags the program to consider an additional column value before reporting the comparison feature value.
  * `addl_var` -- Type: `string`. The additional column to consider.
  * `check_val_expr` -- Type: expression.  The expression to use to evaluate the additional column.  For example, `check_val_expr = "= 5"`.
  * `else_val` -- Type: same type as comparison feature output.  If the additional volumn value does not meet the `check_val_expr` specification, the value to return instead of the comparison feature value.
  
In the following example, the generated SQL expression for the column would be: `CASE WHEN {datasource}.nativity = 5 then {yrimmig abs_diff value} else -1 END`.
```
[[comparison_features]]
alias = "immyear_diff"
column_name = "yrimmig"
comparison_type = "abs_diff"
look_at_addl_var = true
addl_var = "nativity"
datasource = "a"
check_val_expr = "= 5"
else_val = -1
```

## Aggregate Features
These features are not configurable.  To include them in the generated comparison features, they just need to be included in the `[training][independent_vars]` section of the config.  They are generated using the "aggregate_features" SQL template.

### hits
The number of potential matches generated for the given individual (counted by aggregating on `{id_column}_a`).

### hits2
`hits` squared.

### exact_mult
Indicator for the existence of multiple potential matches with the exact same first and last name as the A sample individual within the B data.  Returns numeric boolean (0 or 1).

## Household Aggregate Features
These features are not configurable.  To include them in the generated comparison features, they just need to be included in the `[hh_training][independent_vars]` section of the config.  They are generated using the "hh_aggregate_features" SQL template.

### jw_max_a
The highest Jaro-Winkler score for any of the first names in linked household A against the first name in linked household B where birth year difference is less than or equal to ten, excluding the individual A in the current potential match.  Returns `0` if no other individuals are in the household for comparison.

### jw_max_b
The highest Jaro-Winkler score for any of the first names in linked household A against the first name in linked household B where sex matches and birth year difference is less than or equal to ten, excluding the individual A in the current potential match.  Returns `0` if no other individuals are in the household for comparison.
