bioneuralnet.utils.preprocess

Functions

clean_inf_nan(df)

Replace infinite values with NaN, impute NaNs with the column median, and drop zero-variance columns.

get_logger(name)

Retrieves a global logger configured to write to 'bioneuralnet.log' at the project root.

multipletests(pvals[, alpha, method, ...])

Test results and p-value correction for multiple tests

network_remove_high_zero_fraction(network[, ...])

Remove rows and columns from adjacency matrix where the fraction of zero entries is higher than the threshold.

network_remove_low_variance(network[, threshold])

Remove rows and columns from adjacency matrix where the variance is below a threshold.

preprocess_clinical(X, y[, top_k, scale, ...])

Preprocess clinical data, handling numeric and categorical features, cleaning, optional scaling, and selecting top features by RandomForest importance.

prune_network(adjacency_matrix[, ...])

Prune a network based on a weight threshold, removing nodes with weak connections.

prune_network_by_quantile(adjacency_matrix)

Prune a network by removing edges below a quantile-based weight threshold and dropping isolated nodes.

select_top_k_correlation(X[, y, top_k])

Select the top k features by correlation, either supervised (with respect to y) or unsupervised (redundancy minimization).

select_top_k_variance(df[, k, ddof])

Select the top k features with the highest variance.,

select_top_randomforest(X, y[, top_k, seed])

Select the top k features using RandomForest feature importances.

top_anova_f_features(X, y, max_features[, ...])

Select top features based on ANOVA F-test (with false recovery rate correction).

bioneuralnet.utils.preprocess.clean_inf_nan(df: DataFrame) DataFrame[source]

Replace infinite values with NaN, impute NaNs with the column median, and drop zero-variance columns.

Parameters:

df (-) – Input DataFrame containing numeric columns.

Returns:

Cleaned DataFrame with no infinite or NaN values and no zero-variance columns.

Return type:

  • pd.DataFrame

bioneuralnet.utils.preprocess.network_remove_high_zero_fraction(network: DataFrame, threshold: float = 0.95) DataFrame[source]

Remove rows and columns from adjacency matrix where the fraction of zero entries is higher than the threshold.

Parameters:
  • network (pd.DataFrame) – Adjacency matrix.

  • threshold (float) – Zero-fraction threshold.

Returns:

Filtered adjacency matrix.

Return type:

pd.DataFrame

bioneuralnet.utils.preprocess.network_remove_low_variance(network: DataFrame, threshold: float = 1e-06) DataFrame[source]

Remove rows and columns from adjacency matrix where the variance is below a threshold.

Parameters:
  • network (pd.DataFrame) – Adjacency matrix.

  • threshold (float) – Variance threshold.

Returns:

Filtered adjacency matrix.

Return type:

pd.DataFrame

bioneuralnet.utils.preprocess.preprocess_clinical(X: DataFrame, y: Series, top_k: int = 10, scale: bool = False, ignore_columns=None) DataFrame[source]

Preprocess clinical data, handling numeric and categorical features, cleaning, optional scaling, and selecting top features by RandomForest importance.

Parameters:
  • X (-) – Clinical feature matrix (samples x features) including numeric and categorical columns.

  • y (-) – Target values; single-column DataFrame or Series of length n_samples.

  • top_k (-) – Number of features to select based on importance.

  • scale (-) – If True, scale numeric features using RobustScaler; default is False.

  • ignore_columns (-) – List of columns to ignore during preprocessing; default is None.

Returns:

Subset of the original features with the selected top_k features plus ignored columns.

Return type:

  • pd.DataFrame

bioneuralnet.utils.preprocess.prune_network(adjacency_matrix, weight_threshold=0.0)[source]

Prune a network based on a weight threshold, removing nodes with weak connections.

Parameters:
  • adjacency_matrix (-) – The adjacency matrix of the network.

  • weight_threshold (-) – Minimum weight to keep an edge (default: 0.0).

Return type:

  • pd.DataFrame

bioneuralnet.utils.preprocess.prune_network_by_quantile(adjacency_matrix, quantile=0.5)[source]

Prune a network by removing edges below a quantile-based weight threshold and dropping isolated nodes.

Parameters:
  • adjacency_matrix (-) – Weighted adjacency matrix (nodes x nodes).

  • quantile (-) – Quantile in [0,1] to compute weight threshold; default is 0.5.

Returns:

Pruned adjacency matrix with edges below the quantile threshold removed.

Return type:

  • pd.DataFrame

bioneuralnet.utils.preprocess.select_top_k_correlation(X: DataFrame, y: Series = None, top_k: int = 1000) DataFrame[source]

Select the top k features by correlation, either supervised (with respect to y) or unsupervised (redundancy minimization).

Parameters:
  • X (-) – Numeric feature matrix (samples x features).

  • y (-) – Target values for supervised selection; if None, performs unsupervised selection.

  • top_k (-) – Number of features to select.

Returns:

Subset of X containing the selected features.

Return type:

  • pd.DataFrame

Note

  • Correlation computation can be expensive for large datasets.

bioneuralnet.utils.preprocess.select_top_k_variance(df: DataFrame, k: int = 1000, ddof: int = 0) DataFrame[source]

Select the top k features with the highest variance.,

Parameters:
  • df (-) – Input DataFrame; non-numeric columns will be ignored.

  • k (-) – Number of top-variance features to select.

  • ddof (-) – Delta degrees of freedom for varianceg calculation; default is 0.

Returns:

DataFrame containing only the top k features by variance.

Return type:

  • pd.DataFrame

bioneuralnet.utils.preprocess.select_top_randomforest(X: DataFrame, y: Series, top_k: int = 1000, seed: int = 119) DataFrame[source]

Select the top k features using RandomForest feature importances.

Parameters:
  • X (-) – Numeric feature matrix (samples x features); must contain only numeric columns.

  • y (-) – Target values; single-column DataFrame or Series.

  • top_k (-) – Number of features to select.

  • seed (-) – Random seed for the RandomForest model; default is 119.

Returns:

Subset of X containing the selected top_k features by importance.

Return type:

  • pd.DataFrame

bioneuralnet.utils.preprocess.top_anova_f_features(X: DataFrame, y: Series, max_features: int, alpha: float = 0.05, task: str = 'classification') DataFrame[source]

Select top features based on ANOVA F-test (with false recovery rate correction). This function is suitable for both classification and regression tasks.

Parameters:
  • X (-) – Numeric feature matrix (samples x features).

  • y (-) – Target vector; categorical for classification or continuous for regression.

  • max_features (-) – Maximum number of features to return.

  • alpha (-) – Significance threshold for false recovery rate correction; default is 0.05.

  • task (-) – ‘classification’ to use f_classif or ‘regression’ to use f_regression.

Returns:

Subset of X with the selected features, padded if necessary.

Return type:

  • pd.DataFrame