bioneuralnet.utils.preprocess
Functions
|
Replace infinite values with NaN, impute NaNs with the column median, and drop zero-variance columns. |
|
Retrieves a global logger configured to write to 'bioneuralnet.log' at the project root. |
|
Test results and p-value correction for multiple tests |
|
Remove rows and columns from adjacency matrix where the fraction of zero entries is higher than the threshold. |
|
Remove rows and columns from adjacency matrix where the variance is below a threshold. |
|
Preprocess clinical data, handling numeric and categorical features, cleaning, optional scaling, and selecting top features by RandomForest importance. |
|
Prune a network based on a weight threshold, removing nodes with weak connections. |
|
Prune a network by removing edges below a quantile-based weight threshold and dropping isolated nodes. |
|
Select the top k features by correlation, either supervised (with respect to y) or unsupervised (redundancy minimization). |
|
Select the top k features with the highest variance., |
|
Select the top k features using RandomForest feature importances. |
|
Select top features based on ANOVA F-test (with false recovery rate correction). |
- bioneuralnet.utils.preprocess.clean_inf_nan(df: DataFrame) DataFrame[source]
Replace infinite values with NaN, impute NaNs with the column median, and drop zero-variance columns.
- Parameters:
df (-) – Input DataFrame containing numeric columns.
- Returns:
Cleaned DataFrame with no infinite or NaN values and no zero-variance columns.
- Return type:
pd.DataFrame
- bioneuralnet.utils.preprocess.network_remove_high_zero_fraction(network: DataFrame, threshold: float = 0.95) DataFrame[source]
Remove rows and columns from adjacency matrix where the fraction of zero entries is higher than the threshold.
- Parameters:
network (pd.DataFrame) – Adjacency matrix.
threshold (float) – Zero-fraction threshold.
- Returns:
Filtered adjacency matrix.
- Return type:
pd.DataFrame
- bioneuralnet.utils.preprocess.network_remove_low_variance(network: DataFrame, threshold: float = 1e-06) DataFrame[source]
Remove rows and columns from adjacency matrix where the variance is below a threshold.
- Parameters:
network (pd.DataFrame) – Adjacency matrix.
threshold (float) – Variance threshold.
- Returns:
Filtered adjacency matrix.
- Return type:
pd.DataFrame
- bioneuralnet.utils.preprocess.preprocess_clinical(X: DataFrame, y: Series, top_k: int = 10, scale: bool = False, ignore_columns=None) DataFrame[source]
Preprocess clinical data, handling numeric and categorical features, cleaning, optional scaling, and selecting top features by RandomForest importance.
- Parameters:
X (-) – Clinical feature matrix (samples x features) including numeric and categorical columns.
y (-) – Target values; single-column DataFrame or Series of length n_samples.
top_k (-) – Number of features to select based on importance.
scale (-) – If True, scale numeric features using RobustScaler; default is False.
ignore_columns (-) – List of columns to ignore during preprocessing; default is None.
- Returns:
Subset of the original features with the selected top_k features plus ignored columns.
- Return type:
pd.DataFrame
- bioneuralnet.utils.preprocess.prune_network(adjacency_matrix, weight_threshold=0.0)[source]
Prune a network based on a weight threshold, removing nodes with weak connections.
- Parameters:
adjacency_matrix (-) – The adjacency matrix of the network.
weight_threshold (-) – Minimum weight to keep an edge (default: 0.0).
- Return type:
pd.DataFrame
- bioneuralnet.utils.preprocess.prune_network_by_quantile(adjacency_matrix, quantile=0.5)[source]
Prune a network by removing edges below a quantile-based weight threshold and dropping isolated nodes.
- Parameters:
adjacency_matrix (-) – Weighted adjacency matrix (nodes x nodes).
quantile (-) – Quantile in [0,1] to compute weight threshold; default is 0.5.
- Returns:
Pruned adjacency matrix with edges below the quantile threshold removed.
- Return type:
pd.DataFrame
- bioneuralnet.utils.preprocess.select_top_k_correlation(X: DataFrame, y: Series = None, top_k: int = 1000) DataFrame[source]
Select the top k features by correlation, either supervised (with respect to y) or unsupervised (redundancy minimization).
- Parameters:
X (-) – Numeric feature matrix (samples x features).
y (-) – Target values for supervised selection; if None, performs unsupervised selection.
top_k (-) – Number of features to select.
- Returns:
Subset of X containing the selected features.
- Return type:
pd.DataFrame
Note
Correlation computation can be expensive for large datasets.
- bioneuralnet.utils.preprocess.select_top_k_variance(df: DataFrame, k: int = 1000, ddof: int = 0) DataFrame[source]
Select the top k features with the highest variance.,
- Parameters:
df (-) – Input DataFrame; non-numeric columns will be ignored.
k (-) – Number of top-variance features to select.
ddof (-) – Delta degrees of freedom for varianceg calculation; default is 0.
- Returns:
DataFrame containing only the top k features by variance.
- Return type:
pd.DataFrame
- bioneuralnet.utils.preprocess.select_top_randomforest(X: DataFrame, y: Series, top_k: int = 1000, seed: int = 119) DataFrame[source]
Select the top k features using RandomForest feature importances.
- Parameters:
X (-) – Numeric feature matrix (samples x features); must contain only numeric columns.
y (-) – Target values; single-column DataFrame or Series.
top_k (-) – Number of features to select.
seed (-) – Random seed for the RandomForest model; default is 119.
- Returns:
Subset of X containing the selected top_k features by importance.
- Return type:
pd.DataFrame
- bioneuralnet.utils.preprocess.top_anova_f_features(X: DataFrame, y: Series, max_features: int, alpha: float = 0.05, task: str = 'classification') DataFrame[source]
Select top features based on ANOVA F-test (with false recovery rate correction). This function is suitable for both classification and regression tasks.
- Parameters:
X (-) – Numeric feature matrix (samples x features).
y (-) – Target vector; categorical for classification or continuous for regression.
max_features (-) – Maximum number of features to return.
alpha (-) – Significance threshold for false recovery rate correction; default is 0.05.
task (-) – ‘classification’ to use f_classif or ‘regression’ to use f_regression.
- Returns:
Subset of X with the selected features, padded if necessary.
- Return type:
pd.DataFrame