bioneuralnet.utils

Functions

clean_inf_nan(df)

Replace infinite values with NaN, impute NaNs with the column median, and drop zero-variance columns.

correlation_summary(df)

Compute summary statistics of the maximum pairwise correlation

explore_data_stats(omics_df[, name])

Print key statistics for an omics DataFrame including variance, zero fraction,

expression_summary(df)

Compute summary statistics for the mean expression of features

gen_correlation_graph(X[, k, method, ...])

Build a graph based on pairwise Pearson or Spearman correlations.

gen_gaussian_knn_graph(X[, k, sigma, ...])

Build a normalized knn similarity graph from feature vectors.

gen_lasso_graph(X[, alpha, self_loops])

Infer a sparse network via Graphical Lasso.

gen_mst_graph(X[, self_loops])

Compute the minimum spanning tree (MST) on Euclidean distances.

gen_similarity_graph(X[, k, metric, mutual, ...])

Build a normalized knn similarity graph from feature vectors.

gen_snn_graph(X[, k, mutual, self_loops])

Build a shared nearest neighbor (SNN) graph.

gen_threshold_graph(X[, b, k, mutual, ...])

Generate a soft threshold co-xpression network this is very similar to how WGCNA works

get_logger(name)

Retrieves a global logger configured to write to 'bioneuralnet.log' at the project root.

network_remove_high_zero_fraction(network[, ...])

Remove rows and columns from adjacency matrix where the fraction of zero entries is higher than the threshold.

network_remove_low_variance(network[, threshold])

Remove rows and columns from adjacency matrix where the variance is below a threshold.

preprocess_clinical(X, y[, top_k, scale, ...])

Preprocess clinical data, handling numeric and categorical features, cleaning, optional scaling, and selecting top features by RandomForest importance.

prune_network(adjacency_matrix[, ...])

Prune a network based on a weight threshold, removing nodes with weak connections.

prune_network_by_quantile(adjacency_matrix)

Prune a network by removing edges below a quantile-based weight threshold and dropping isolated nodes.

rdata_to_df(rdata_file, csv_file[, Object])

select_top_k_correlation(X[, y, top_k])

Select the top k features by correlation, either supervised (with respect to y) or unsupervised (redundancy minimization).

select_top_k_variance(df[, k, ddof])

Select the top k features with the highest variance.,

select_top_randomforest(X, y[, top_k, seed])

Select the top k features using RandomForest feature importances.

top_anova_f_features(X, y, max_features[, ...])

Select top features based on ANOVA F-test (with false recovery rate correction).

variance_summary(df[, low_var_threshold])

Compute summary statistics for column variances in the DataFrame

zero_fraction_summary(df[, high_zero_threshold])

Compute summary statistics for the fraction of zeros in each column

bioneuralnet.utils.clean_inf_nan(df: DataFrame) DataFrame[source]

Replace infinite values with NaN, impute NaNs with the column median, and drop zero-variance columns.

Parameters:

df (-) – Input DataFrame containing numeric columns.

Returns:

Cleaned DataFrame with no infinite or NaN values and no zero-variance columns.

Return type:

  • pd.DataFrame

bioneuralnet.utils.correlation_summary(df: DataFrame) dict[source]

Compute summary statistics of the maximum pairwise correlation

bioneuralnet.utils.explore_data_stats(omics_df: DataFrame, name: str = 'Data') None[source]

Print key statistics for an omics DataFrame including variance, zero fraction,

bioneuralnet.utils.expression_summary(df: DataFrame) dict[source]

Compute summary statistics for the mean expression of features

bioneuralnet.utils.gen_correlation_graph(X: DataFrame, k: int = 15, method: str = 'pearson', mutual: bool = False, per_node: bool = True, threshold: float = None, self_loops: bool = True) DataFrame[source]

Build a graph based on pairwise Pearson or Spearman correlations.

Parameters:
  • pd.dataframe (- X)

  • k (-) – Number of neighbors to keep per node if per_node is True.

  • method (-) – ‘pearson’ or ‘spearman’.

  • mutual (-) – If True, only mutual knn edges.

  • per_node (-) – If True, use per node topk selection, else global threshold.

  • threshold (-) – Correlation cutoff when per_node is False.

  • self_loops (-) – If True, adds weight 1 to diagonal.

Returns:

  • pandas.DataFrame. Normalized adjacency matrix (N x N) of the sparse correlation graph.

Note

  • Correlation is very expensive to compute, so this function is not recommended for large datasets.

bioneuralnet.utils.gen_gaussian_knn_graph(X: DataFrame, k: int = 15, sigma: float = None, mutual: bool = False, self_loops: bool = True) DataFrame[source]

Build a normalized knn similarity graph from feature vectors. Computes pairwise cosine or Euclidean similarities, sparsifies via k-nearest neighbors or a global threshold. Optionally prunes to mutual neighbors and/or adds self-loops.

Parameters:
  • X (-) – Feature matrix where rows are nodes and columns are features.

  • k (-) – Number of neighbors to keep per node.

  • metric (-) – ‘cosine’ or ‘euclidean’, uses Gaussian kernel for distances.

  • mutual (-) – If True, only mutual knn edges.

  • per_node (-) – If True, use per-node topk selection; else global threshold.

  • self_loops (-) – If True, adds weight 1 to diagonal.

Returns:

Normalized adjacency matrix (N x N) of the sparse similarity graph.

Return type:

  • pandas.DataFrame

bioneuralnet.utils.gen_lasso_graph(X: DataFrame, alpha: float = 0.01, self_loops: bool = True) DataFrame[source]

Infer a sparse network via Graphical Lasso.

Parameters:
  • X (-) – Data matrix where rows are nodes and columns are features.

  • alpha (-) – Regularization parameter for Graphical Lasso.

  • self_loops (-) – If True, adds weight 1 to diagonal.

Returns:

Normalized adjacency matrix (N x N) of the inferred network.

Return type:

  • pandas.DataFrame

bioneuralnet.utils.gen_mst_graph(X: DataFrame, self_loops: bool = True) DataFrame[source]

Compute the minimum spanning tree (MST) on Euclidean distances.

Parameters:
  • X (-) – Feature matrix where rows are nodes and columns are features.

  • self_loops (-) – If True, adds weight 1 to diagonal.

Returns:

Normalized adjacency matrix (N x N) of the MST graph.

Return type:

  • pandas.DataFrame

bioneuralnet.utils.gen_similarity_graph(X: DataFrame, k: int = 15, metric: str = 'cosine', mutual: bool = False, per_node: bool = True, self_loops: bool = True) DataFrame[source]

Build a normalized knn similarity graph from feature vectors. Computes pairwise cosine or ecledian disntace,then sparsifies via knn or global a threshold. Optionally prunes to mutual neighbors and/or adds self-loops.

Parameters:
  • X (-) – pandas.DataFrame of shape (N, D) (rows = nodes, cols = features)

  • k (-) – Number of neighbors to keep per node.

  • metric (-) – “cosine” or “euclidean” (uses gaussian kernel on distances).

  • mutual (-) – If True, retain only mutual edges (i->j and j->i).

  • per_node (-) – If True, use per-node top_k; else global cutoff.

  • self_loops (-) – If True, add self-loop weight of 1.

Returns:

  • DataFrame of shape (N, N) the normalized adjacency matrix

bioneuralnet.utils.gen_snn_graph(X: DataFrame, k: int = 15, mutual: bool = False, self_loops: bool = True) DataFrame[source]

Build a shared nearest neighbor (SNN) graph.

Parameters:
  • X (-) – Feature matrix where rows are nodes and columns are features.

  • k (-) – Number of neighbors to keep per node.

  • mutual (-) – If True, only mutual knn edges.

  • self_loops (-) – If True, adds weight 1 to diagonal.

Returns:

Normalized adjacency matrix (N x N) of the SNN graph.

Return type:

  • pandas.DataFrame

bioneuralnet.utils.gen_threshold_graph(X: DataFrame, b: float = 6.0, k: int = 15, mutual: bool = False, self_loops: bool = True) DataFrame[source]

Generate a soft threshold co-xpression network this is very similar to how WGCNA works

Parameters:
  • X (-) – Data matrix where rows are nodes and columns are features.

  • b (-) – Thresholding exponent applied to absolute correlations.

  • k (-) – Number of neighbors to keep per node.

  • mutual (-) – If True, only mutual knn edges.

  • self_loops (-) – If True, adds weight 1 to diagonal.

Returns:

Normalized adjacency matrix (N x N) of the soft-thresholded graph.

Return type:

  • pandas.DataFrame

bioneuralnet.utils.get_logger(name: str) Logger[source]

Retrieves a global logger configured to write to ‘bioneuralnet.log’ at the project root.

Parameters:

name (str) – Name of the logger.

Returns:

Configured logger instance.

Return type:

logging.Logger

bioneuralnet.utils.network_remove_high_zero_fraction(network: DataFrame, threshold: float = 0.95) DataFrame[source]

Remove rows and columns from adjacency matrix where the fraction of zero entries is higher than the threshold.

Parameters:
  • network (pd.DataFrame) – Adjacency matrix.

  • threshold (float) – Zero-fraction threshold.

Returns:

Filtered adjacency matrix.

Return type:

pd.DataFrame

bioneuralnet.utils.network_remove_low_variance(network: DataFrame, threshold: float = 1e-06) DataFrame[source]

Remove rows and columns from adjacency matrix where the variance is below a threshold.

Parameters:
  • network (pd.DataFrame) – Adjacency matrix.

  • threshold (float) – Variance threshold.

Returns:

Filtered adjacency matrix.

Return type:

pd.DataFrame

bioneuralnet.utils.preprocess_clinical(X: DataFrame, y: Series, top_k: int = 10, scale: bool = False, ignore_columns=None) DataFrame[source]

Preprocess clinical data, handling numeric and categorical features, cleaning, optional scaling, and selecting top features by RandomForest importance.

Parameters:
  • X (-) – Clinical feature matrix (samples x features) including numeric and categorical columns.

  • y (-) – Target values; single-column DataFrame or Series of length n_samples.

  • top_k (-) – Number of features to select based on importance.

  • scale (-) – If True, scale numeric features using RobustScaler; default is False.

  • ignore_columns (-) – List of columns to ignore during preprocessing; default is None.

Returns:

Subset of the original features with the selected top_k features plus ignored columns.

Return type:

  • pd.DataFrame

bioneuralnet.utils.prune_network(adjacency_matrix, weight_threshold=0.0)[source]

Prune a network based on a weight threshold, removing nodes with weak connections.

Parameters:
  • adjacency_matrix (-) – The adjacency matrix of the network.

  • weight_threshold (-) – Minimum weight to keep an edge (default: 0.0).

Return type:

  • pd.DataFrame

bioneuralnet.utils.prune_network_by_quantile(adjacency_matrix, quantile=0.5)[source]

Prune a network by removing edges below a quantile-based weight threshold and dropping isolated nodes.

Parameters:
  • adjacency_matrix (-) – Weighted adjacency matrix (nodes x nodes).

  • quantile (-) – Quantile in [0,1] to compute weight threshold; default is 0.5.

Returns:

Pruned adjacency matrix with edges below the quantile threshold removed.

Return type:

  • pd.DataFrame

bioneuralnet.utils.rdata_to_df(rdata_file: Path, csv_file: Path, Object=None) DataFrame[source]
bioneuralnet.utils.select_top_k_correlation(X: DataFrame, y: Series = None, top_k: int = 1000) DataFrame[source]

Select the top k features by correlation, either supervised (with respect to y) or unsupervised (redundancy minimization).

Parameters:
  • X (-) – Numeric feature matrix (samples x features).

  • y (-) – Target values for supervised selection; if None, performs unsupervised selection.

  • top_k (-) – Number of features to select.

Returns:

Subset of X containing the selected features.

Return type:

  • pd.DataFrame

Note

  • Correlation computation can be expensive for large datasets.

bioneuralnet.utils.select_top_k_variance(df: DataFrame, k: int = 1000, ddof: int = 0) DataFrame[source]

Select the top k features with the highest variance.,

Parameters:
  • df (-) – Input DataFrame; non-numeric columns will be ignored.

  • k (-) – Number of top-variance features to select.

  • ddof (-) – Delta degrees of freedom for varianceg calculation; default is 0.

Returns:

DataFrame containing only the top k features by variance.

Return type:

  • pd.DataFrame

bioneuralnet.utils.select_top_randomforest(X: DataFrame, y: Series, top_k: int = 1000, seed: int = 119) DataFrame[source]

Select the top k features using RandomForest feature importances.

Parameters:
  • X (-) – Numeric feature matrix (samples x features); must contain only numeric columns.

  • y (-) – Target values; single-column DataFrame or Series.

  • top_k (-) – Number of features to select.

  • seed (-) – Random seed for the RandomForest model; default is 119.

Returns:

Subset of X containing the selected top_k features by importance.

Return type:

  • pd.DataFrame

bioneuralnet.utils.top_anova_f_features(X: DataFrame, y: Series, max_features: int, alpha: float = 0.05, task: str = 'classification') DataFrame[source]

Select top features based on ANOVA F-test (with false recovery rate correction). This function is suitable for both classification and regression tasks.

Parameters:
  • X (-) – Numeric feature matrix (samples x features).

  • y (-) – Target vector; categorical for classification or continuous for regression.

  • max_features (-) – Maximum number of features to return.

  • alpha (-) – Significance threshold for false recovery rate correction; default is 0.05.

  • task (-) – ‘classification’ to use f_classif or ‘regression’ to use f_regression.

Returns:

Subset of X with the selected features, padded if necessary.

Return type:

  • pd.DataFrame

bioneuralnet.utils.variance_summary(df: DataFrame, low_var_threshold: float = None) dict[source]

Compute summary statistics for column variances in the DataFrame

bioneuralnet.utils.zero_fraction_summary(df: DataFrame, high_zero_threshold: float = None) dict[source]

Compute summary statistics for the fraction of zeros in each column

Modules

data

graph

logger

preprocess

rdata_convert