GitLab Repo

amachine

What is A-Machine?

A-Machine is a work in progress library for constructing epsilon-machines1 and other stochastic models for generating structured symbol sequences with ground truth causal structure and information-theoretic complexity. The intended use case is the study of neural network learning dynamics and internal representations.

Quickstart

Installation

# CPU only
pip install a-machine

# With GPU support (requires CUDA 13)
pip install "a-machine[cuda]" --extra-index-url https://pypi.nvidia.com

Example

More examples are available at the gitlab repo.

import amachine as am

# Seeds random and np.random globally
am.srand_global( 42 )

# May have multiple recurrent subgraphs, terminal states, or tranistory states
m = am.random_machine( 
        n_states=23, 
        symbols=[ '0', '1' ],
        connectedness=0.35,
        randomness=0.25 )

# Collapse to the largest recurrent subgraph
m.collapse_to_largest_strongly_connected_subgraph()

# Minimize the machine in case it's not minimal -> epsilon-machine.
m.minimize( retain_names=True )

# Entropy rate, statistical complexity
print( f"h_mu : {m.h_mu()}" )
print( f"C_mu : {m.C_mu()}" )

# Display the epislon machine
m.draw_graph()

AM Graph

HMMs and Epsilon Machines

An $\epsilon$-machine is the minimal, optimally predictive, unifilar, hidden Markov model of a stationary stochastic process. A large body of theory linking them to complex dynamical systems, information theory, and physics, has been developed by James P. Crutchfield and collaborators under the name computational mechanics 12.

A-Machine's HMM implementation is designed primarily for constructing $\epsilon$-machines to generate training data with interesting ground truth theoretical properties. This should be useful for making informed hypotheses about, designing experiments for, and interpreting the learning dynamics and internal representations of neural networks.

More expressive finite generative models, such as non-unifilar HMMs or probabilistic context-free grammars, can represent a broader class of languages than a finite $\epsilon$-machine, but may lack the theoretical tractability that makes $\epsilon$-machines particularly analytically useful. Thus, utilizing $\epsilon$-machines instead of alternatives, may involve a tradeoff between expressiveness and interpretability.

Complexity Measures

Complexity measures implemented within the HMM class include:

Citations, definitions, and interpretations are in their individual documentation sections linked above.

Isomorphic Construction

A few ways to construct $\epsilon$-machines are impemented so far:

  • random_machine, which generates a random HMM that can be reduced to an $\epsilon$-machine,

  • star_join, which joins a list of machines to a central hub,

  • star, which combines random_machine and star_join,

  • isomorphic_to, which creates a copy of an existing machine with new labels,

  • full_isomorphic_rotation, which creates and star joins $n$ isomorphic machines, where $n$ is the number of symbols, and the $i^{th}$ machine's state $j$ emits the $k^{th}$ symbol in the alphabet, then the ${i+1}^{th}$ machine's state $j$ will emit symbol $k+1 \bmod n$, and

iso rotation

  • unique_isomorphic, which creates and star joins n_machines isomorphic machines, where n_machines is a parameter, and each has its own alphabet.

unique iso

Syntagmatics and Paradigmatics

The obvious differences between full_isomorphic_rotation and unique_isomorphic are that the former has $n$ sub-machnes = $n$ symbols, while the latter can have any number of sub-machines but the full alphabet grows with the number of machines. Both introduce learnable structural redundancies, which may be experimentally interesting, but they may be considered somewhat unnatural and structually boring relative to human language.

Imagine a world where every now and then, each word in the dictionary moves up one entry but takes on the meaning of the word it replaced. This is analogous to what full_isomorphic_rotation does. On the other hand, unique_isomorphic is like cloning the whole dictionary, and then exclusively using one dictionary at a time, while switching to a new one every now and then.

Neither are very good anologs for isomorphic substructure in human lanuage, where there are consistent rules, that creates coherent, rich, and varied relational structure.

And this is the type of structure we want to also in our $\epsilon$-machines, so that the insights gleaned may be generalize better to help us understand how artificial intelligence learns from, internally represents, and uses human language.

The work in progress framework for this is conceptually grounded in concepts borrowed from paradigmatic analysis and syntagmatic analysis. Instead studying how words can be chosen in positions within a body of text, we consider how symbols can be chosen for a transtion in a state machine. Within the hard constraints, some stochastic variability among valid choices can be permitted.

There is one complication, and that is an $\epsilon-$machine must be minimal and unifilar. You can't simply change a symbol on an edge at random, as it may break the structural properties of the machine.

Once suitable syntagmatics are defined for a single machine, we can apply the same concepts to joined collections of machines, in a way that is analogous to discourse structure in human language. At this level, we can create $\epsilon$-machines, with hierarchical isomorphic substructure.

A framework and API for generating models with these properties is still being worked out.

Mixed State Presentation

The mixed states presentation construction 34 tracks beliefs about which intrinsic state the $\epsilon$-machicine is in as an observer attempts to syncronize with the $\epsilon$ machines. MSP construction can be useful for analyzing learning dynamics, because this is the task that a neural network learning from the generated data must perform to minimize loss. This has been emipirically demonstrated in transformer models (in at least one instance) by Shai et al. 5.

The MSP can explode in terms of the number of states that are produced, or even be inifite for non-synchronizable processes. The belief states transitions also may produce fractal-like patterns in the probability simplex.

A-Machine implements MSP construction from a given $\epsilon$ machine. Since the MSP may have too many states to compute, a cap on the max number of states is imposed, and when the cap is reached, the MSP is closed by mapping the unresolved frontier of belief states back to the nearest existing belief states. When this happens, the final MSP is only an approximation.

The MSP also yields closed form expressions for a range of complexity measures 6, such as $\mathbf{E}, \mathbf{S}$, and $\mathbf{T}$, which can be solved for using spectral decompositon. A-machine implements this approach to computing these complexity measures, in addition to the alternative, block entropy convergence. When the MSP construction excedes the max states cap, comparing the measures computed from its approximation with those approximated by block entropy convergence is useful for corroboration and gaining additional insight into the synronization dynamics.

Block Entropy Convergence

As mentioned, block entropy convergence supports estimation of a range of complexity measures, and allows you to observe the convergence in uncertainty as a function of $L$ (after seeing all blocks of length $L$). A-machine implements this as a C++ extension.

Data Generation

Data generation generates stochastic sequences from a given machine, optionally along with the sequence of hidden states that were traversed, and also optionally with an additional symbol sequence with what we are calling isomorphic_shift applied. This changes the symbols that were emited by a state, for symbols that would have been emitted by alternative isomorphic states (should they exist).

import amachine as am

am.srand_global(42)

m = am.random_machine( 
        n_states=11, 
        symbols=[ '0', '1', '2' ],
        connectedness=0.65,
        randomness=0.37 )

# Collapse to the largest recurrent subgraph
m.collapse_to_largest_strongly_connected_subgraph()

# Minimize the machine -> epsilon-machine.
m.minimize()

# Create an isomorphic machine with new labels
m_iso = am.isomorphic_to( m, alphabet=[ '3', '4', '5' ] )

m_iso.isoclass = 0
m.isoclass     = 0

for j, state in enumerate( m.states ) : 
    m.states[ j ].add_isomorph( m_iso.states[ j ].name )
    m_iso.states[ j ].add_isomorph( m.states[ j ].name )

# Join them together
m_star = am.star_join(
    exit_symbol='x', 
    enter_symbols=['a','b'],
    machines=[ m, m_iso ],
    mode_residency_factor=0.75 
)

m_star.draw_graph()

m_star.generate_data(
    file_prefix="./data/iso_012-345", # saves as iso_012-345.parqet
    n_gen=2_000_000_000, # 2 billion symbols    
    include_states=True,
    isomorphic_shifts={1}
)

iso machine

The two sequences will be aligned, and with identical local and global statistics. Thus, you can train on one of them, and then analyze activation patterns and residuals on both of them in parallel during inference, to study if and how the neural network learned polysemantic representations that exploit the structural redundancies.

Cleaner and more general methods for constructing graphs with ismorphic substructure are in the works.


  1. Crutchfield, "The calculi of emergence: computation, dynamics and induction.", 1994. https://csc.ucdavis.edu/~cmg/papers/CalcEmerg.pdf 

  2. Shalizi, Rohilla, and Crutchfield. "Computational mechanics: Pattern and prediction, structure and simplicity.", 2001. https://arxiv.org/abs/cond-mat/9907176 

  3. Blackwell, David Harold. "The entropy of functions of finite-state Markov chains.", 1959. 

  4. Jurgens & Crutchfield "Shannon Entropy Rate of Hidden Markov Processes", 2021. https://link.springer.com/article/10.1007/s10955-021-02769-3 

  5. Shai et al., "Transformers Represent Belief State Geometry in their Residual Stream", 2024. https://arxiv.org/abs/2405.15943 

  6. Crutchfield, "Exact complexity: The spectral decomposition of intrinsic computation", 2016. https://csc.ucdavis.edu/~cmg/papers/ec.pdf 

  1"""
  2# What is A-Machine?
  3
  4A-Machine is a work in progress library for constructing epsilon-machines[^1] and other stochastic models for generating structured symbol sequences with ground truth causal structure and information-theoretic complexity. The intended use case is the study of neural network learning dynamics and internal representations.
  5
  6# Quickstart
  7
  8## Installation
  9
 10```bash
 11# CPU only
 12pip install a-machine
 13
 14# With GPU support (requires CUDA 13)
 15pip install "a-machine[cuda]" --extra-index-url https://pypi.nvidia.com
 16```
 17
 18##Example
 19
 20[More examples](https://gitlab.com/tneuroth/a-machine/-/tree/main/examples) are available at the gitlab repo.
 21
 22```python
 23import amachine as am
 24
 25# Seeds random and np.random globally
 26am.srand_global( 42 )
 27
 28# May have multiple recurrent subgraphs, terminal states, or tranistory states
 29m = am.random_machine( 
 30	n_states=23, 
 31	symbols=[ '0', '1' ],
 32	connectedness=0.35,
 33	randomness=0.25 )
 34
 35# Collapse to the largest recurrent subgraph
 36m.collapse_to_largest_strongly_connected_subgraph()
 37
 38# Minimize the machine in case it's not minimal -> epsilon-machine.
 39m.minimize( retain_names=True )
 40
 41# Entropy rate, statistical complexity
 42print( f"h_mu : {m.h_mu()}" )
 43print( f"C_mu : {m.C_mu()}" )
 44
 45# Display the epislon machine
 46m.draw_graph()
 47```
 48
 49<img src="resources/am_graph.png" alt="AM Graph" style="width: 80%; margin-left: 10%;">
 50
 51# HMMs and Epsilon Machines
 52
 53An $\\epsilon$-machine is the minimal, optimally predictive, unifilar, hidden Markov model of a stationary stochastic process. A large body of theory linking them to complex dynamical systems, information theory, and physics, has been developed by James P. Crutchfield and collaborators under the name computational mechanics [^1][^2].
 54
 55A-Machine's [HMM implementation](amachine/am_hmm.html#HMM) is designed primarily for constructing $\\epsilon$-machines to generate training data with interesting ground truth theoretical properties. This should be useful for making informed hypotheses about, designing experiments for, and interpreting the learning dynamics and internal representations of neural networks.
 56
 57More expressive finite generative models, such as non-unifilar HMMs or probabilistic context-free grammars, can represent a broader class of languages than a finite $\\epsilon$-machine, but may lack the theoretical tractability that makes $\\epsilon$-machines particularly analytically useful. Thus, utilizing $\\epsilon$-machines instead of alternatives, may involve a tradeoff between expressiveness and interpretability.
 58
 59## Complexity Measures
 60
 61Complexity measures implemented within the HMM class include:
 62
 63* [statistical complexity $C_\\mu$](amachine/am_hmm.html#HMM.C_mu),
 64* [entropy rate $h_{\\mu}$](amachine/am_hmm.html#HMM.h_mu),
 65* [single symbol entropy H(1)](amachine/am_hmm.html#HMM.H_1), 
 66* [anticipated information $\\rho_{\\mu}$](amachine/am_hmm.html#HMM.rho_mu),
 67* [Excess entropy $\\mathbf{E}$](amachine/am_hmm.html#HMM.E),
 68* [Syncronization information $\\mathbf{S}$](amachine/am_hmm.html#HMM.S),
 69* [Transient information $\\mathbf{T}$](amachine/am_hmm.html#HMM.S),
 70* [Crypticity $\\chi$](amachine/am_hmm.html#HMM.chi), and
 71* [Block convergence measures $\\mathbf{S}(L), \\mathbf{T}(L), \\mathcal{H}(L),$ and $h_{\\mu}(L)$](amachine/am_hmm.html#HMM.block_convergence).
 72
 73Citations, definitions, and interpretations are in their individual documentation sections linked above.
 74
 75## Isomorphic Construction
 76
 77A few ways to construct $\\epsilon$-machines are impemented so far:
 78
 79* [random_machine](amachine/am_create.html#random_machine), which generates a random HMM that can be reduced to an $\\epsilon$-machine,
 80
 81* [star_join](amachine/am_create.html#star_join), which joins a list of machines to a central hub,
 82
 83* [star](amachine/am_create.html#star), which combines random_machine and star_join,
 84
 85* [isomorphic_to](amachine/am_create.html#isomorphic_to), which creates a copy of an existing machine with new labels,
 86
 87* [full_isomorphic_rotation](amachine/am_create.html#full_isomorphic_rotation), which creates and star joins $n$ isomorphic machines, where $n$ is the number of symbols, and the $i^{th}$ machine's state $j$ emits the $k^{th}$ symbol in the alphabet, then the ${i+1}^{th}$ machine's state $j$ will emit symbol $k+1 \\bmod n$, and
 88
 89<img src="resources/iso_rotation.png" alt="iso rotation" style="width: 100%; margin-left: 0%;">
 90
 91* [unique_isomorphic](amachine/am_create.html#unique_isomorphic), which creates and star joins `n_machines` isomorphic machines, where `n_machines` is a parameter, and each has its own alphabet.
 92
 93<img src="resources/unique_iso.png" alt="unique iso" style="width: 100%; margin-left: 0%;">
 94
 95##Syntagmatics and Paradigmatics
 96
 97The obvious differences between `full_isomorphic_rotation` and `unique_isomorphic` are that the former has $n$ sub-machnes = $n$ symbols, while the latter can have any number of sub-machines but the full alphabet grows with the number of machines.  Both introduce learnable structural redundancies, which may be experimentally interesting, but they may be considered somewhat unnatural and structually boring relative to human language.
 98
 99Imagine a world where every now and then, each word in the dictionary moves up one entry but takes on the meaning of the word it replaced. This is analogous to what `full_isomorphic_rotation` does. On the other hand, `unique_isomorphic` is like cloning the whole dictionary, and then exclusively using one dictionary at a time, while switching to a new one every now and then.
100
101Neither are very good anologs for isomorphic substructure in human lanuage, where there are consistent rules, that creates coherent, rich, and varied relational structure. 
102
103And this is the type of structure we want to also in our $\\epsilon$-machines, so that the insights gleaned may be generalize better to help us understand how artificial intelligence learns from, internally represents, and uses human language.
104
105The work in progress framework for this is conceptually grounded in concepts borrowed from [paradigmatic analysis](https://en.wikipedia.org/wiki/Paradigmatic_analysis) and [syntagmatic analysis](https://en.wikipedia.org/wiki/Syntagmatic_analysis). Instead studying how words can be chosen in positions within a body of text, we consider how symbols can be chosen for a transtion in a state machine. Within the hard constraints, some stochastic variability among valid choices can be permitted. 
106
107There is one complication, and that is an $\\epsilon-$machine must be minimal and unifilar. You can't simply change a symbol on an edge at random, as it may break the structural properties of the machine.
108
109Once suitable syntagmatics are defined for a single machine, we can apply the same concepts to joined collections of machines, in a way that is analogous to [discourse structure](https://en.wikipedia.org/wiki/Discourse_relation) in human language. At this level, we can create $\\epsilon$-machines, with hierarchical isomorphic substructure.
110
111A framework and API for generating models with these properties is still being worked out.
112
113## Mixed State Presentation
114
115The mixed states presentation construction [^3][^4] tracks beliefs about which intrinsic state the $\\epsilon$-machicine is in as an observer attempts to syncronize with the $\\epsilon$ machines. MSP construction can be useful for analyzing learning dynamics, because this is the task that a neural network learning from the generated data must perform to minimize loss. This has been emipirically demonstrated in transformer models (in at least one instance) by Shai et al. [^5].
116
117The MSP can explode in terms of the number of states that are produced, or even be inifite for non-synchronizable processes. The belief states transitions also may produce fractal-like patterns in the  probability simplex.
118
119A-Machine implements [MSP construction](amachine/am_machine.html#get_msp) from a given $\\epsilon$ machine. Since the MSP may have too many states to compute, a cap on the max number of states is imposed, and when the cap is reached, the MSP is closed by mapping the unresolved frontier of belief states back to the nearest existing belief states. When this happens, the final MSP is only an approximation.
120
121The MSP also yields closed form expressions for a range of complexity measures [^6], such as $\\mathbf{E}, \\mathbf{S}$, and $\\mathbf{T}$, which can be solved for using spectral decompositon. A-machine implements this approach to computing these complexity measures, in addition to the alternative, block entropy convergence. When the MSP construction excedes the max states cap, comparing the measures computed from its approximation with those approximated by block entropy convergence is useful for corroboration and gaining additional insight into the synronization dynamics.
122
123## Block Entropy Convergence
124
125As mentioned, [block entropy convergence](amachine/am_hmm.html#HMM.block_convergence) supports estimation of a range of complexity measures, and allows you to observe the convergence in uncertainty as a function of $L$ (after seeing all blocks of length $L$). A-machine implements this as a [C++ extension](https://gitlab.com/tneuroth/a-machine/-/blob/main/src/amachine/am_fast/am_fast.cpp).
126
127## Data Generation
128
129[Data generation](amachine/am_hmm.html#HMM.generate_data) generates stochastic sequences from a given machine, optionally along with the sequence of hidden states that were traversed, and also optionally with an additional symbol sequence with what we are calling [isomorphic_shift](amachine/am_hmm.html#HMM.isomorphic_shift) applied. This changes the symbols that were emited by a state, for symbols that would have been emitted by alternative isomorphic states (should they exist).
130
131```python
132import amachine as am
133
134am.srand_global(42)
135
136m = am.random_machine( 
137        n_states=11, 
138        symbols=[ '0', '1', '2' ],
139        connectedness=0.65,
140        randomness=0.37 )
141
142# Collapse to the largest recurrent subgraph
143m.collapse_to_largest_strongly_connected_subgraph()
144
145# Minimize the machine -> epsilon-machine.
146m.minimize()
147
148# Create an isomorphic machine with new labels
149m_iso = am.isomorphic_to( m, alphabet=[ '3', '4', '5' ] )
150
151m_iso.isoclass = 0
152m.isoclass     = 0
153
154for j, state in enumerate( m.states ) : 
155    m.states[ j ].add_isomorph( m_iso.states[ j ].name )
156    m_iso.states[ j ].add_isomorph( m.states[ j ].name )
157    
158# Join them together
159m_star = am.star_join(
160    exit_symbol='x', 
161    enter_symbols=['a','b'],
162    machines=[ m, m_iso ],
163    mode_residency_factor=0.75 
164)
165
166m_star.draw_graph()
167
168m_star.generate_data(
169    file_prefix="./data/iso_012-345", # saves as iso_012-345.parqet
170    n_gen=2_000_000_000, # 2 billion symbols    
171    include_states=True,
172    isomorphic_shifts={1}
173)
174```
175
176<img src="resources/iso.png" alt="iso machine" style="width: 100%; margin-left: 0%;">
177
178The two sequences will be aligned, and with identical local and global statistics. Thus, you can train on one of them, and then analyze activation patterns and residuals on both of them in parallel during inference, to study if and how the neural network learned polysemantic representations that exploit the structural redundancies.
179
180*Cleaner and more general methods for constructing graphs with ismorphic substructure are in the works.*
181
182[^1]: Crutchfield, "The calculi of emergence: computation, dynamics and induction.", 1994.
183    <https://csc.ucdavis.edu/~cmg/papers/CalcEmerg.pdf>
184
185[^2]: Shalizi, Rohilla, and Crutchfield. "Computational mechanics: Pattern and prediction, structure and simplicity.", 2001.
186    <https://arxiv.org/abs/cond-mat/9907176>
187
188[^3]: Blackwell, David Harold. "The entropy of functions of finite-state Markov chains.", 1959.
189
190[^4]: Jurgens & Crutchfield "Shannon Entropy Rate of Hidden Markov Processes", 2021.
191    <https://link.springer.com/article/10.1007/s10955-021-02769-3>
192
193[^5]: Shai et al., "Transformers Represent Belief State Geometry in their Residual Stream", 2024. 
194    <https://arxiv.org/abs/2405.15943>
195
196[^6]: Crutchfield, "Exact complexity: The spectral decomposition of intrinsic computation", 2016.
197    <https://csc.ucdavis.edu/~cmg/papers/ec.pdf>
198
199"""
200__version__ = "0.2.0"
201
202from .am_create import (
203    random_machine,
204    star_join,
205    star,
206    isomorphic_to,
207    full_isomorphic_rotation,
208    unique_isomorphic
209)
210
211from .am_hmm import HMM
212
213from .am_syntagmatics import (
214    Syntagmatics,
215    ExplicitCohesionRule
216)
217
218from .am_substitution_algebra import (
219    SubstitutionAlgebra
220)
221
222from .am_random import srand_global
223from .am_vocabulary import Vocabulary