bioneuralnet.utils
Functions
|
Replace infinite values with NaN, impute NaNs with the column median, and drop zero-variance columns. |
Compute summary statistics of the maximum pairwise correlation |
|
|
Print key statistics for an omics DataFrame including variance, zero fraction, |
Compute summary statistics for the mean expression of features |
|
|
Build a graph based on pairwise Pearson or Spearman correlations. |
|
Build a normalized knn similarity graph from feature vectors. |
|
Infer a sparse network via Graphical Lasso. |
|
Compute the minimum spanning tree (MST) on Euclidean distances. |
|
Build a normalized knn similarity graph from feature vectors. |
|
Build a shared nearest neighbor (SNN) graph. |
|
Generate a soft threshold co-xpression network this is very similar to how WGCNA works |
|
Retrieves a global logger configured to write to 'bioneuralnet.log' at the project root. |
|
Remove rows and columns from adjacency matrix where the fraction of zero entries is higher than the threshold. |
|
Remove rows and columns from adjacency matrix where the variance is below a threshold. |
|
Preprocess clinical data, handling numeric and categorical features, cleaning, optional scaling, and selecting top features by RandomForest importance. |
|
Prune a network based on a weight threshold, removing nodes with weak connections. |
|
Prune a network by removing edges below a quantile-based weight threshold and dropping isolated nodes. |
|
|
|
Select the top k features by correlation, either supervised (with respect to y) or unsupervised (redundancy minimization). |
|
Select the top k features with the highest variance., |
|
Select the top k features using RandomForest feature importances. |
|
Select top features based on ANOVA F-test (with false recovery rate correction). |
|
Compute summary statistics for column variances in the DataFrame |
|
Compute summary statistics for the fraction of zeros in each column |
- bioneuralnet.utils.clean_inf_nan(df: DataFrame) DataFrame[source]
Replace infinite values with NaN, impute NaNs with the column median, and drop zero-variance columns.
- Parameters:
df (-) – Input DataFrame containing numeric columns.
- Returns:
Cleaned DataFrame with no infinite or NaN values and no zero-variance columns.
- Return type:
pd.DataFrame
- bioneuralnet.utils.correlation_summary(df: DataFrame) dict[source]
Compute summary statistics of the maximum pairwise correlation
- bioneuralnet.utils.explore_data_stats(omics_df: DataFrame, name: str = 'Data') None[source]
Print key statistics for an omics DataFrame including variance, zero fraction,
- bioneuralnet.utils.expression_summary(df: DataFrame) dict[source]
Compute summary statistics for the mean expression of features
- bioneuralnet.utils.gen_correlation_graph(X: DataFrame, k: int = 15, method: str = 'pearson', mutual: bool = False, per_node: bool = True, threshold: float = None, self_loops: bool = True) DataFrame[source]
Build a graph based on pairwise Pearson or Spearman correlations.
- Parameters:
pd.dataframe (- X)
k (-) – Number of neighbors to keep per node if per_node is True.
method (-) – ‘pearson’ or ‘spearman’.
mutual (-) – If True, only mutual knn edges.
per_node (-) – If True, use per node topk selection, else global threshold.
threshold (-) – Correlation cutoff when per_node is False.
self_loops (-) – If True, adds weight 1 to diagonal.
- Returns:
pandas.DataFrame. Normalized adjacency matrix (N x N) of the sparse correlation graph.
Note
Correlation is very expensive to compute, so this function is not recommended for large datasets.
- bioneuralnet.utils.gen_gaussian_knn_graph(X: DataFrame, k: int = 15, sigma: float = None, mutual: bool = False, self_loops: bool = True) DataFrame[source]
Build a normalized knn similarity graph from feature vectors. Computes pairwise cosine or Euclidean similarities, sparsifies via k-nearest neighbors or a global threshold. Optionally prunes to mutual neighbors and/or adds self-loops.
- Parameters:
X (-) – Feature matrix where rows are nodes and columns are features.
k (-) – Number of neighbors to keep per node.
metric (-) – ‘cosine’ or ‘euclidean’, uses Gaussian kernel for distances.
mutual (-) – If True, only mutual knn edges.
per_node (-) – If True, use per-node topk selection; else global threshold.
self_loops (-) – If True, adds weight 1 to diagonal.
- Returns:
Normalized adjacency matrix (N x N) of the sparse similarity graph.
- Return type:
pandas.DataFrame
- bioneuralnet.utils.gen_lasso_graph(X: DataFrame, alpha: float = 0.01, self_loops: bool = True) DataFrame[source]
Infer a sparse network via Graphical Lasso.
- Parameters:
X (-) – Data matrix where rows are nodes and columns are features.
alpha (-) – Regularization parameter for Graphical Lasso.
self_loops (-) – If True, adds weight 1 to diagonal.
- Returns:
Normalized adjacency matrix (N x N) of the inferred network.
- Return type:
pandas.DataFrame
- bioneuralnet.utils.gen_mst_graph(X: DataFrame, self_loops: bool = True) DataFrame[source]
Compute the minimum spanning tree (MST) on Euclidean distances.
- Parameters:
X (-) – Feature matrix where rows are nodes and columns are features.
self_loops (-) – If True, adds weight 1 to diagonal.
- Returns:
Normalized adjacency matrix (N x N) of the MST graph.
- Return type:
pandas.DataFrame
- bioneuralnet.utils.gen_similarity_graph(X: DataFrame, k: int = 15, metric: str = 'cosine', mutual: bool = False, per_node: bool = True, self_loops: bool = True) DataFrame[source]
Build a normalized knn similarity graph from feature vectors. Computes pairwise cosine or ecledian disntace,then sparsifies via knn or global a threshold. Optionally prunes to mutual neighbors and/or adds self-loops.
- Parameters:
X (-) – pandas.DataFrame of shape (N, D) (rows = nodes, cols = features)
k (-) – Number of neighbors to keep per node.
metric (-) – “cosine” or “euclidean” (uses gaussian kernel on distances).
mutual (-) – If True, retain only mutual edges (i->j and j->i).
per_node (-) – If True, use per-node top_k; else global cutoff.
self_loops (-) – If True, add self-loop weight of 1.
- Returns:
DataFrame of shape (N, N) the normalized adjacency matrix
- bioneuralnet.utils.gen_snn_graph(X: DataFrame, k: int = 15, mutual: bool = False, self_loops: bool = True) DataFrame[source]
Build a shared nearest neighbor (SNN) graph.
- Parameters:
X (-) – Feature matrix where rows are nodes and columns are features.
k (-) – Number of neighbors to keep per node.
mutual (-) – If True, only mutual knn edges.
self_loops (-) – If True, adds weight 1 to diagonal.
- Returns:
Normalized adjacency matrix (N x N) of the SNN graph.
- Return type:
pandas.DataFrame
- bioneuralnet.utils.gen_threshold_graph(X: DataFrame, b: float = 6.0, k: int = 15, mutual: bool = False, self_loops: bool = True) DataFrame[source]
Generate a soft threshold co-xpression network this is very similar to how WGCNA works
- Parameters:
X (-) – Data matrix where rows are nodes and columns are features.
b (-) – Thresholding exponent applied to absolute correlations.
k (-) – Number of neighbors to keep per node.
mutual (-) – If True, only mutual knn edges.
self_loops (-) – If True, adds weight 1 to diagonal.
- Returns:
Normalized adjacency matrix (N x N) of the soft-thresholded graph.
- Return type:
pandas.DataFrame
- bioneuralnet.utils.get_logger(name: str) Logger[source]
Retrieves a global logger configured to write to ‘bioneuralnet.log’ at the project root.
- Parameters:
name (str) – Name of the logger.
- Returns:
Configured logger instance.
- Return type:
- bioneuralnet.utils.network_remove_high_zero_fraction(network: DataFrame, threshold: float = 0.95) DataFrame[source]
Remove rows and columns from adjacency matrix where the fraction of zero entries is higher than the threshold.
- Parameters:
network (pd.DataFrame) – Adjacency matrix.
threshold (float) – Zero-fraction threshold.
- Returns:
Filtered adjacency matrix.
- Return type:
pd.DataFrame
- bioneuralnet.utils.network_remove_low_variance(network: DataFrame, threshold: float = 1e-06) DataFrame[source]
Remove rows and columns from adjacency matrix where the variance is below a threshold.
- Parameters:
network (pd.DataFrame) – Adjacency matrix.
threshold (float) – Variance threshold.
- Returns:
Filtered adjacency matrix.
- Return type:
pd.DataFrame
- bioneuralnet.utils.preprocess_clinical(X: DataFrame, y: Series, top_k: int = 10, scale: bool = False, ignore_columns=None) DataFrame[source]
Preprocess clinical data, handling numeric and categorical features, cleaning, optional scaling, and selecting top features by RandomForest importance.
- Parameters:
X (-) – Clinical feature matrix (samples x features) including numeric and categorical columns.
y (-) – Target values; single-column DataFrame or Series of length n_samples.
top_k (-) – Number of features to select based on importance.
scale (-) – If True, scale numeric features using RobustScaler; default is False.
ignore_columns (-) – List of columns to ignore during preprocessing; default is None.
- Returns:
Subset of the original features with the selected top_k features plus ignored columns.
- Return type:
pd.DataFrame
- bioneuralnet.utils.prune_network(adjacency_matrix, weight_threshold=0.0)[source]
Prune a network based on a weight threshold, removing nodes with weak connections.
- Parameters:
adjacency_matrix (-) – The adjacency matrix of the network.
weight_threshold (-) – Minimum weight to keep an edge (default: 0.0).
- Return type:
pd.DataFrame
- bioneuralnet.utils.prune_network_by_quantile(adjacency_matrix, quantile=0.5)[source]
Prune a network by removing edges below a quantile-based weight threshold and dropping isolated nodes.
- Parameters:
adjacency_matrix (-) – Weighted adjacency matrix (nodes x nodes).
quantile (-) – Quantile in [0,1] to compute weight threshold; default is 0.5.
- Returns:
Pruned adjacency matrix with edges below the quantile threshold removed.
- Return type:
pd.DataFrame
- bioneuralnet.utils.select_top_k_correlation(X: DataFrame, y: Series = None, top_k: int = 1000) DataFrame[source]
Select the top k features by correlation, either supervised (with respect to y) or unsupervised (redundancy minimization).
- Parameters:
X (-) – Numeric feature matrix (samples x features).
y (-) – Target values for supervised selection; if None, performs unsupervised selection.
top_k (-) – Number of features to select.
- Returns:
Subset of X containing the selected features.
- Return type:
pd.DataFrame
Note
Correlation computation can be expensive for large datasets.
- bioneuralnet.utils.select_top_k_variance(df: DataFrame, k: int = 1000, ddof: int = 0) DataFrame[source]
Select the top k features with the highest variance.,
- Parameters:
df (-) – Input DataFrame; non-numeric columns will be ignored.
k (-) – Number of top-variance features to select.
ddof (-) – Delta degrees of freedom for varianceg calculation; default is 0.
- Returns:
DataFrame containing only the top k features by variance.
- Return type:
pd.DataFrame
- bioneuralnet.utils.select_top_randomforest(X: DataFrame, y: Series, top_k: int = 1000, seed: int = 119) DataFrame[source]
Select the top k features using RandomForest feature importances.
- Parameters:
X (-) – Numeric feature matrix (samples x features); must contain only numeric columns.
y (-) – Target values; single-column DataFrame or Series.
top_k (-) – Number of features to select.
seed (-) – Random seed for the RandomForest model; default is 119.
- Returns:
Subset of X containing the selected top_k features by importance.
- Return type:
pd.DataFrame
- bioneuralnet.utils.top_anova_f_features(X: DataFrame, y: Series, max_features: int, alpha: float = 0.05, task: str = 'classification') DataFrame[source]
Select top features based on ANOVA F-test (with false recovery rate correction). This function is suitable for both classification and regression tasks.
- Parameters:
X (-) – Numeric feature matrix (samples x features).
y (-) – Target vector; categorical for classification or continuous for regression.
max_features (-) – Maximum number of features to return.
alpha (-) – Significance threshold for false recovery rate correction; default is 0.05.
task (-) – ‘classification’ to use f_classif or ‘regression’ to use f_regression.
- Returns:
Subset of X with the selected features, padded if necessary.
- Return type:
pd.DataFrame
- bioneuralnet.utils.variance_summary(df: DataFrame, low_var_threshold: float = None) dict[source]
Compute summary statistics for column variances in the DataFrame
- bioneuralnet.utils.zero_fraction_summary(df: DataFrame, high_zero_threshold: float = None) dict[source]
Compute summary statistics for the fraction of zeros in each column
Modules