Implementing CHLU: From Paper to Prototype in One Night
This is a work-in-progress experiment. I implemented a paper to understand it better — this is not original research. Full credit to Pratik Jawahar and Maurizio Pierini for the CHLU architecture.
I came across a paper at ICLR 2026 that caught my attention: "Causal Hamiltonian Learning Units" by Pratik Jawahar and Maurizio Pierini. The idea is simple and elegant — instead of learning an unconstrained mapping from input to output, you learn a Hamiltonian energy function and make predictions by integrating the equations of motion.
I wanted to understand how this actually works in practice, so I built a complete implementation in about 8 hours on a DGX Spark.
What Is CHLU?
In classical physics, a Hamiltonian describes the total energy of a system as the sum of kinetic energy (motion) and potential energy (position). If you know the Hamiltonian, you can predict how the system evolves over time by solving Hamilton's equations.
CHLU applies this idea to neural networks. Instead of learning a direct input→output mapping, it:
- Encodes the input into a position-momentum pair (q, p) — like placing a ball on an energy landscape
- Evolves the state forward using a symplectic integrator (Velocity Verlet) — like letting the ball roll according to the energy landscape
- Decodes the final state back to the output
The Hamiltonian is: H(q, p) = T(p) + V_θ(q) + α‖q‖²
Where T(p) is relativistic kinetic energy (velocity is bounded by a speed-of-light constant c), V_θ(q) is a learnable potential energy function (an MLP), and α‖q‖² is a confinement term that prevents the state from drifting off to infinity.
Input x → Encoder → (q₀, p₀) → Velocity Verlet → (q_T, p_T) → Decoder → Output ŷ
↑
H(q,p) = T(p) + V_θ(q) + α‖q‖²
Why would you want this? Three reasons from the paper:
- Energy conservation — the symplectic integrator preserves energy by construction, so the model can't predict states that violate physics
- Long-horizon stability — regular Neural ODEs accumulate energy drift over long rollouts; symplectic integration doesn't
- Bounded velocity — the relativistic kinetic energy means the system can never accelerate past c, which makes perturbation response predictable
CHLU is designed as a drop-in replacement for LSTM or Neural ODE layers in temporal tasks.
What I Built
p2p-chlu/
├── chlu/
│ ├── core/
│ │ ├── hamiltonian.py # PotentialMLP + HamiltonianLayer
│ │ ├── integrator.py # Symplectic (Velocity Verlet) integrator
│ │ └── chlu_unit.py # Main nn.Module
│ ├── training/
│ │ └── contrastive.py # Wake-sleep + CD training loop
│ ├── baselines/
│ │ ├── lstm_baseline.py # LSTM comparison
│ │ └── node_baseline.py # Neural ODE comparison
│ └── experiments/
│ ├── exp_a_stability.py # Trajectory stability
│ ├── exp_b_safety.py # Perturbation safety
│ └── exp_c_generate.py # MNIST generation
├── tests/ # 35 unit tests
└── results/ # Checkpoints + plots
I reproduced all three experiments from the paper and compared CHLU against LSTM and Neural ODE baselines.
Experiment Results
Experiment A: Long-Horizon Stability
Train on 3 cycles of a lemniscate (figure-8 trajectory), then roll out 50 cycles. The question: does the model stay on the trajectory or diverge?
| Metric | Result |
|---|---|
| Best MSE | ~0.0000 (epoch 14) |
| Lyapunov term | Stable |
| Training speed | ~46 steps/epoch, ~3s/epoch |
CHLU learned stable trajectories. The symplectic integrator prevented the energy drift that typically kills Neural ODE on long rollouts. This matches the paper's claims.
Experiment B: Perturbation Safety
A trained model gets a 5× velocity kick — a large perturbation. How does it respond?
| Model | Max Kinetic Energy | Max Velocity | Status |
|---|---|---|---|
| CHLU | 168.95 | 18.38 | Handled it |
| LSTM | 26.29 | 7.25 | Lower peak |
| Neural ODE | — | — | Diverged (dt underflow) |
This one is nuanced. LSTM showed lower peak kinetic energy, but that's likely because it doesn't model physics — it compresses the response rather than propagating it accurately. Neural ODE crashed entirely with a dt underflow error during the large perturbation. CHLU handled the perturbation without crashing and its response is physically plausible (the relativistic velocity bound kept things finite), though the higher KE suggests the confinement term could be tuned further.
Experiment C: MNIST Generation
Train a CHLU autoencoder on MNIST images, then generate digits via Langevin dynamics from class centroids with temperature annealing.
| Metric | Result |
|---|---|
| Best MSE | 0.01827 (epoch 14) |
| Samples generated | 100 (10 per class) |
CHLU generated recognizable digit samples across all 10 classes in 20 epochs. Quality varies by class — some digits are cleaner than others. This is a proof of concept, not state-of-the-art generation.
What Broke (and How I Fixed It)
This build had five distinct failure modes before landing stable results. These are the most useful things I learned:
1. @torch.no_grad() breaks symplectic integration
The symplectic integrator computes forces as F = -∂V/∂q using torch.autograd.grad. I wrapped evaluation in @torch.no_grad() for speed — which silently killed the gradient computation. Autograd is not optional for symplectic integration, even during evaluation.
2. Contrastive Divergence without energy normalization causes explosion
CD loss computes H_wake - H_sleep (real state energy minus fantasy state energy). Without constraints on the Hamiltonian, this difference can grow unbounded. I saw loss go from -4,600 to -540,000 in 30 epochs. The model collapsed to outputting constants.
Fix: Add spectral normalization to the PotentialMLP. This bounds the Lipschitz constant of the potential function, preventing unbounded energy landscapes.
# In PotentialMLP.__init__:
self.layers = nn.Sequential(
nn.utils.spectral_norm(nn.Linear(input_dim, hidden_dim)),
nn.Tanh(),
nn.utils.spectral_norm(nn.Linear(hidden_dim, hidden_dim)),
nn.Tanh(),
nn.utils.spectral_norm(nn.Linear(hidden_dim, 1)),
)
This is the most important implementation detail. It's not mentioned explicitly in the paper, but is essential for stable CD training. If you're implementing any energy-based model with CD training, spectral normalization (or another Lipschitz constraint) is not optional.
3. Removing CD loss causes long-horizon divergence
After the CD explosion, I tried removing CD entirely. Training looked perfect (1-step MSE ≈ 0), but on 10-cycle rollout evaluation, CHLU MSE = 94,337 vs LSTM MSE = 27.
CD shapes the energy landscape so the model generalizes beyond the training distribution. Without it, the model memorizes trajectories but can't extrapolate.
4. Neural ODE baseline crashes on large perturbations
The Neural ODE baseline (torchdiffeq) crashes with AssertionError: underflow in dt 0.0 when the system diverges. A single crash kills the entire evaluation pipeline. I wrapped it in try/except to report divergence gracefully instead.
5. Evaluation bottleneck
The original evaluation code ran the full 10-cycle rollout 6 separate times (3 for metrics, 3 for plots). With autograd required at each step, this was ~6× slower than necessary. A single combined loop computing metrics and plot data simultaneously gave a 5× speedup.
Using CHLU in Your Own Code
import torch
from chlu import CHLUUnit
model = CHLUUnit(
input_dim=2, # Feature dimension
latent_dim=16, # Phase space dimension
c=2.0, # Speed limit
dt=0.01, # Integration step size
n_steps=10, # Verlet steps per forward pass
)
# Single-step prediction
x = torch.randn(32, 2)
y_hat = model(x)
# Sequence generation
seq = model.evolve_sequence(x, seq_len=100) # (32, 100, 2)
Training with contrastive divergence:
from chlu.training import ContrastiveTrainer
trainer = ContrastiveTrainer(
model=model,
lr=1e-3,
cd_weight=0.01,
lyap_weight=0.1,
)
trainer.train(dataloader, epochs=100)
What's Next
Some improvements I'd like to explore:
- Better checkpoint selection — MSE degraded in later epochs (Exp A: 0.07 → 0.157 from epochs 80-100). Best checkpoint is around epoch 14-20, not the final one. Proper validation-based selection would help.
- Longer training with LR schedule — I only ran 20 epochs due to time. A CosineAnnealingLR schedule over 100+ epochs would likely improve results.
- Better CD sampling — Currently using Langevin dynamics for sleep-phase fantasy particles. Persistent Contrastive Divergence with a replay buffer might produce better negative samples.
More speculative directions: using CHLU as a physics prior in model-based RL, scaling to N-body systems, or adding Bayesian treatment of the potential function for uncertainty quantification.
What I Learned
The main takeaway: implementing a paper is one of the best ways to understand it. The published results tell you what works. The implementation tells you why it works — and more importantly, what the paper doesn't mention (like spectral normalization being essential for CD training).
CHLU is an interesting architecture for problems where physics-inspired constraints matter: energy conservation, bounded dynamics, long-horizon stability. Whether it's practically better than well-tuned LSTMs for real applications is still an open question — my results are from a single night of experimentation, not a rigorous benchmark.
Try It
git clone https://github.com/sahilmalik27/p2p-chlu
cd p2p-chlu && pip install -e ".[dev]"
# Run all experiments
python -m chlu exp-a --epochs 20
python -m chlu exp-b --epochs 20
python -m chlu exp-c --epochs 20
Paper: "Causal Hamiltonian Learning Units" — Pratik Jawahar and Maurizio Pierini (ICLR 2026).
All checkpoints are in the repo. MIT License.
Thanks for reading. Follow me for more.
← More posts