TF-MoDISco Report

Introduction to TF-MoDISco

What is TF-MoDISco?

TF-MoDISco (Transcription Factor Motif Discovery from Importance Scores) is a method for discovering sequence motifs from neural network-derived importance scores. Unlike traditional motif discovery methods that rely solely on sequence enrichment, TF-MoDISco leverages context-aware importance scores to identify patterns.

Preliminary Concepts

Contribution Scores

Seqlets

About Seqlets: Seqlets are short subsequences identified as having high contribution scores. These are the building blocks that TF-MoDISco clusters into motifs. The number of seqlets reported represents high-confidence instances used for motif construction and is a subset of the total number of binding sites. Additional sites likely exist in the data but were filtered during the clustering process.

Position Statistics: Seqlet positions are displayed relative to its input region's midpoint (set at position 0), with negative values indicating upstream positions and positive values downstream. The "median distance from center" statistic is the median absolute seqlet distance from the corresponding regions' midpoints.

Contribution Score Statistics: The "total contribution" of a seqlet is computed as the the total absolute contribution scores across all positions. This provides a measure of the overall strength of the seqlets's influence on model predictions.

Seqlet Selection for Visualization: Representative seqlets are selected from different total contribution quantiles (10th, 20th, 30th, etc. percentiles) to show the range of pattern strength.

Matrix Types

Contribution Weight Matrices (CWMs) represent the average contribution scores across aligned seqlets, quantifying each position's contribution to binding predictions.

Position Probability Matrices (PPMs) show sequence composition frequencies. When scaled by information content (IC), they become information-weighted PPMs that emphasize positions with higher sequence consistency.

Understanding Motif Matches & Pattern Names

Tomtom Database Matches: Each discovered motif is compared against the user-provided database of known motifs using TOMTOM. Only the top 3 matches by statistical significance are displayed, but there may be other significant matches not shown. The factor responsible for binding may not necessarily correspond to the top-ranked match (Match 0) - biological context and additional evidence should be considered when interpreting results.

Pattern Naming: When database matches are available, patterns are given descriptive names constructed from the first 10 characters of the top matches, separated by semicolons (e.g., "CTCF_HUMAN;SP1_MOUSE;ZNF143"). The original pattern ID in the H5 (e.g., "pos_patterns.pattern_0") is also shown and remains the canonical identifier.

{% if patterns_data %}

Pattern Summary

{% if meme_motif_db %} {% endif %} {% for pattern_tag, data in patterns_data.items() %} {% if meme_motif_db %} {% for i in range(3) %} {% if pattern_tag in tomtom_data and 'match_' + i|string in tomtom_data[pattern_tag] and tomtom_data[pattern_tag]['match_' + i|string] %} {% else %} {% endif %} {% endfor %} {% endif %} {% endfor %}
Pattern Seqlets Avg. Contribution Med. Center Dist. CWM Fwd CWM RevMatch 0 Q-value Logo Match 1 Q-value Logo Match 2 Q-value Logo
{% if descriptive_names and pattern_tag in descriptive_names %} {{ descriptive_names[pattern_tag] }} {% else %} {{ pattern_tag }} {% endif %} {% if descriptive_names and pattern_tag in descriptive_names %} {{ pattern_tag }} {% endif %} {{ data.n_seqlets }} {% if data.std_importance and data.std_importance == data.std_importance %} {{ "%.3f"|format(data.avg_importance) }} ± {{ "%.3f"|format(data.std_importance) }} {% else %} {{ "%.3f"|format(data.avg_importance) }} {% endif %} {% if data.median_abs_distance_from_center and data.median_abs_distance_from_center == data.median_abs_distance_from_center %} {% if data.std_distance_from_center and data.std_distance_from_center == data.std_distance_from_center %} {{ "%.1f"|format(data.median_abs_distance_from_center) }} ± {{ "%.1f"|format(data.std_distance_from_center) }} {% else %} {{ "%.1f"|format(data.median_abs_distance_from_center) }} {% endif %} {% else %} - {% endif %} {% if pattern_tag in logo_paths %} CWM Fwd {% endif %} {% if pattern_tag in logo_paths %} CWM Rev {% endif %} {{ tomtom_data[pattern_tag]['match_' + i|string] }} {{ "%.2e"|format(tomtom_data[pattern_tag]['pval_' + i|string]) }} {% if tomtom_logos and pattern_tag in tomtom_logos and 'match_' + i|string + '_base64' in tomtom_logos[pattern_tag] %} {% endif %}
{% endif %}

Discovered Motifs

{% for pattern_tag, data in patterns_data.items() %}
{% if descriptive_names and pattern_tag in descriptive_names %} {{ descriptive_names[pattern_tag] }} ({{ pattern_tag }}) {% else %} {{ pattern_tag }} {% endif %}

Motif Visualization

{% if pattern_tag in logo_paths %}
CWM Logo

Contribution Weight Matrix: shows actual contribution scores

CWM Logo
hCWM Logo

Hypothetical Contribution Weight Matrix: shows counterfactual contributions

hCWM Logo
IC-scaled PPM Logo

Information-weighted Position Probability Matrix (PPM scaled by information content)

IC-scaled PPM Logo
Trimmed CWM Logo

CWM trimmed to core region

Trimmed CWM Logo
{% endif %}

Motif Statistics

{% if data.median_abs_distance_from_center and data.median_abs_distance_from_center == data.median_abs_distance_from_center %} {% endif %}
MetricValueStd Dev
Number of seqlets{{ data.n_seqlets }}-
Average contribution score {{ "%.3f"|format(data.avg_importance) }} {% if data.std_importance and data.std_importance == data.std_importance %} {{ "%.3f"|format(data.std_importance) }} {% else %} - {% endif %}
GC content{{ "%.3f"|format(data.gc_content) }}-
Median distance from center {{ "%.1f"|format(data.median_abs_distance_from_center) }} {% if data.std_distance_from_center and data.std_distance_from_center == data.std_distance_from_center %} {{ "%.1f"|format(data.std_distance_from_center) }} {% else %} - {% endif %}
{% if meme_motif_db and pattern_tag in tomtom_data %}

Tomtom Matches

Top matches from motif database comparison:

{% for i in range(top_n_matches) %} {% if 'match_' + i|string in tomtom_data[pattern_tag] and tomtom_data[pattern_tag]['match_' + i|string] %} {% endif %} {% endfor %}
Rank Match Logo {% if ttl %}P-value{% else %}Q-value{% endif %}
{{ i + 1 }} {{ tomtom_data[pattern_tag]['match_' + i|string] }} {% if tomtom_logos and pattern_tag in tomtom_logos and 'match_' + i|string + '_base64' in tomtom_logos[pattern_tag] %} {% else %} Logo not available {% endif %} {{ "%.2e"|format(tomtom_data[pattern_tag]['pval_' + i|string]) }}
{% endif %} {% if pattern_tag in distribution_paths %}

Seqlet Distributions

{% if 'importance' in distribution_paths[pattern_tag] %}
Seqlet Contribution Score Distribution

Distribution of total contribution scores across seqlets

Contribution Score Distribution
{% endif %} {% if 'spatial' in distribution_paths[pattern_tag] %}
Seqlet Spatial Distribution

Distribution of seqlet positions within input sequences

Spatial Distribution
{% endif %}
{% endif %} {% if pattern_tag in examples_data and examples_data[pattern_tag] %}

Seqlet Examples

Representative seqlets from contribution score quantiles:

{% for example in examples_data[pattern_tag] %}
{{ example.quantile }}th Percentile (Contribution: {{ "%.3f"|format(example.importance) }})
Seqlet {{ example.quantile }}th Percentile
{% endfor %}
{% endif %}
{% endfor %} {% if not patterns_data %}

Note: No patterns were found in the provided TF-MoDISco results file. This could indicate that no significant motifs were discovered during the analysis, or there may be an issue with the input file format.

{% endif %}