Utils

RData Conversion

Logging

Graph Generation

This section details utility functions for generating networks from omics data matrices. All methods are industry-standard and open in the literature.

Methods and Examples

  1. k-NN Cosine/RBF Similarity Graph

    Computes either cosine similarity:

    \[S_{ij} = \frac{x_i^\top x_j}{\|x_i\|\,\|x_j\|}\]

    or Gaussian RBF kernel:

    \[S_{ij} = \exp\bigl(-\|x_i - x_j\|^2 /(2\sigma^2)\bigr)\]

    Sparsify by keeping top-(k) neighbors per node, optionally mutual.

      from bioneuralnet.utils.graphs import gen_similarity_graph
      A = gen_similarity_graph(X, k=15, metric='cosine', mutual=True)
    
    Reference: Hastie et al., 2009 [Hastie2009_]
    
  2. Pearson/Spearman Co-expression Graph

    Compute correlation:

    \[C_{ij} = \mathrm{corr}(x_i, x_j)\]

    Then sparsify by top-(k) or threshold.

      from bioneuralnet.utils.graphs import gen_correlation_graph
      A = gen_correlation_graph(X, k=15, method='pearson')
    
    Reference: Langfelder & Horvath, 2008 [Langfelder2008_]
    
  3. Soft-Threshold Graph

    Soft-threshold absolute correlation, similar to WGCNA:

    \[W_{ij} = |C_{ij}|^\beta\]

    Then top-(k) selection and normalization.

      from bioneuralnet.utils.graphs import gen_threshold_graph
      A = gen_threshold_graph(X, b=6.0, k=15)
    
    Reference: Langfelder & Horvath, 2008 [Langfelder2008_]
    
  4. Gaussian k-NN Graph

    Gaussian kernel sparsified by k-nearest neighbors:

    \[S_{ij} = \exp\bigl(-\|x_i - x_j\|^2 /(2\sigma^2)\bigr),\quad W = \text{Top}_k(S)\]
    from bioneuralnet.utils.graphs import gen_gaussian_knn_graph
    A = gen_gaussian_knn_graph(X, k=15, sigma=None)
    

    Credit: adapts common practice from spectral clustering (Ng et al., 2002).

  5. Mutual Information Graph

    Estimate pairwise mutual information:

    \[\mathrm{MI}_{ij} = I(x_i; x_j)\]
      from bioneuralnet.utils.graphs import gen_mutual_info_graph
      A = gen_mutual_info_graph(X, k=15)
    
    Reference: Margolin et al., 2006 [Margolin2006_]
    

Preprocessing Utilities

A collection of data-cleaning and feature-selection functions for clinical and omics datasets.

Clinical Preprocessing

  • bioneuralnet.utils.preprocess.preprocess_clinical() Splits numeric and categorical features; replaces Inf/NaN; optionally scales numeric data (RobustScaler); encodes categoricals; drops zero-variance; and selects top-k features by RandomForest importance.

    Example:

    from bioneuralnet.utils.preprocess import preprocess_clinical
    df_top = preprocess_clinical(X, y, top_k=10, scale=True)
    
  • bioneuralnet.utils.preprocess.clean_inf_nan() Replaces Inf with NaN, imputes medians, drops zero-variance columns, and logs counts.

    Example:

    from bioneuralnet.utils.preprocess import clean_inf_nan
    df_clean = clean_inf_nan(df)
    

Variance-Based Selection

Correlation-Based Selection

  • bioneuralnet.utils.preprocess.select_top_k_correlation() - Supervised: if you pass y, selects features by absolute Pearson correlation with y. - Unsupervised: if y=None, picks features with the lowest average inter-feature correlation (redundancy reduction).

    Example:

    from bioneuralnet.utils.preprocess import select_top_k_correlation
    df_sup = select_top_k_correlation(X, y, top_k=100)   # supervised
    df_unsup = select_top_k_correlation(X, top_k=100)    # unsupervised
    

RandomForest Feature Importance

  • bioneuralnet.utils.preprocess.select_top_randomforest() Requires numeric inputs; cleans data; drops zero-variance; fits RandomForest; and returns the top-k features by importance.

    Example:

    from bioneuralnet.utils.preprocess import select_top_randomforest
    df_rf = select_top_randomforest(X, y, top_k=200)
    

ANOVA F-Test Selection

  • bioneuralnet.utils.preprocess.top_anova_f_features() Runs ANOVA F-test (classification or regression), applies FDR correction, selects all significant features, and pads with next-best to reach max_features.

    Example:

    from bioneuralnet.utils.preprocess import top_anova_f_features
    df_anova = top_anova_f_features(X, y, max_features=100, alpha=0.05)
    

Network Pruning

  • bioneuralnet.utils.preprocess.prune_network() Prunes edges below a weight threshold, removes isolates, and logs before/after stats.

    Example:

    from bioneuralnet.utils.preprocess import prune_network
    pruned = prune_network(adj_df, weight_threshold=0.1)
    
  • bioneuralnet.utils.preprocess.prune_network_by_quantile() Uses a quantile cutoff on edge weights, prunes accordingly, removes isolates, and logs stats.

    Example:

    from bioneuralnet.utils.preprocess import prune_network_by_quantile
    pruned_q = prune_network_by_quantile(adj_df, quantile=0.75)
    
  • bioneuralnet.utils.preprocess.network_remove_low_variance() Drops rows/columns in the adjacency matrix whose variance falls below a threshold.

    Example:

    from bioneuralnet.utils.preprocess import network_remove_low_variance
    filtered = network_remove_low_variance(adj_df, threshold=1e-5)
    
  • bioneuralnet.utils.preprocess.network_remove_high_zero_fraction() Removes rows/columns where the fraction of zero weights exceeds a threshold (default 0.95).

    Example:

    from bioneuralnet.utils.preprocess import network_remove_high_zero_fraction
    filtered_z = network_remove_high_zero_fraction(adj_df, threshold=0.95)
    

Data Summary Utilities

  • bioneuralnet.utils.data_summary.variance_summary() - Computes summary statistics for column variances.

    Example:

    from bioneuralnet.utils.data_summary import variance_summary
    stats = variance_summary(df, low_var_threshold=1e-4)
    
  • bioneuralnet.utils.data_summary.zero_fraction_summary() - Computes statistics for the fraction of zeros per column.

    Example:

    from bioneuralnet.utils.data_summary import zero_fraction_summary
    stats = zero_fraction_summary(df, high_zero_threshold=0.5)
    
  • bioneuralnet.utils.data_summary.expression_summary() - Computes summary of mean expression values across features.

    Example:

    from bioneuralnet.utils.data_summary import expression_summary
    stats = expression_summary(df)
    
  • bioneuralnet.utils.data_summary.correlation_summary() - Computes statistics of each feature’s maximum pairwise correlation (excluding self).

    Example:

    from bioneuralnet.utils.data_summary import correlation_summary
    stats = correlation_summary(df)
    
  • bioneuralnet.utils.data_summary.explore_data_stats() - Prints an overall summary (variance, zero fraction, expression, correlation) to stdout.

    Example:

    from bioneuralnet.utils.data_summary import explore_data_stats
    explore_data_stats(df, name="MyOmicsData")
    

References