2026-02-28·9 min read

Implementing CHLU: From Paper to Prototype in One Night

ml-researchphysicsneural-networksexperimentspytorch

This is a work-in-progress experiment. I implemented a paper to understand it better — this is not original research. Full credit to Pratik Jawahar and Maurizio Pierini for the CHLU architecture.

I came across a paper at ICLR 2026 that caught my attention: "Causal Hamiltonian Learning Units" by Pratik Jawahar and Maurizio Pierini. The idea is simple and elegant — instead of learning an unconstrained mapping from input to output, you learn a Hamiltonian energy function and make predictions by integrating the equations of motion.

I wanted to understand how this actually works in practice, so I built a complete implementation in about 8 hours on a DGX Spark.

What Is CHLU?

In classical physics, a Hamiltonian describes the total energy of a system as the sum of kinetic energy (motion) and potential energy (position). If you know the Hamiltonian, you can predict how the system evolves over time by solving Hamilton's equations.

CHLU applies this idea to neural networks. Instead of learning a direct input→output mapping, it:

Encodes the input into a position-momentum pair (q, p) — like placing a ball on an energy landscape
Evolves the state forward using a symplectic integrator (Velocity Verlet) — like letting the ball roll according to the energy landscape
Decodes the final state back to the output

The Hamiltonian is: H(q, p) = T(p) + V_θ(q) + α‖q‖²

Where T(p) is relativistic kinetic energy (velocity is bounded by a speed-of-light constant c), V_θ(q) is a learnable potential energy function (an MLP), and α‖q‖² is a confinement term that prevents the state from drifting off to infinity.

Input x → Encoder → (q₀, p₀) → Velocity Verlet → (q_T, p_T) → Decoder → Output ŷ
                                       ↑
                          H(q,p) = T(p) + V_θ(q) + α‖q‖²

Why would you want this? Three reasons from the paper:

Energy conservation — the symplectic integrator preserves energy by construction, so the model can't predict states that violate physics
Long-horizon stability — regular Neural ODEs accumulate energy drift over long rollouts; symplectic integration doesn't
Bounded velocity — the relativistic kinetic energy means the system can never accelerate past c, which makes perturbation response predictable

CHLU is designed as a drop-in replacement for LSTM or Neural ODE layers in temporal tasks.

What I Built

p2p-chlu/
├── chlu/
│   ├── core/
│   │   ├── hamiltonian.py    # PotentialMLP + HamiltonianLayer
│   │   ├── integrator.py     # Symplectic (Velocity Verlet) integrator
│   │   └── chlu_unit.py      # Main nn.Module
│   ├── training/
│   │   └── contrastive.py    # Wake-sleep + CD training loop
│   ├── baselines/
│   │   ├── lstm_baseline.py  # LSTM comparison
│   │   └── node_baseline.py  # Neural ODE comparison
│   └── experiments/
│       ├── exp_a_stability.py  # Trajectory stability
│       ├── exp_b_safety.py     # Perturbation safety
│       └── exp_c_generate.py   # MNIST generation
├── tests/                      # 35 unit tests
└── results/                    # Checkpoints + plots

I reproduced all three experiments from the paper and compared CHLU against LSTM and Neural ODE baselines.

Experiment Results

Experiment A: Long-Horizon Stability

Train on 3 cycles of a lemniscate (figure-8 trajectory), then roll out 50 cycles. The question: does the model stay on the trajectory or diverge?

Metric	Result
Best MSE	~0.0000 (epoch 14)
Lyapunov term	Stable
Training speed	~46 steps/epoch, ~3s/epoch

CHLU learned stable trajectories. The symplectic integrator prevented the energy drift that typically kills Neural ODE on long rollouts. This matches the paper's claims.

Experiment B: Perturbation Safety

A trained model gets a 5× velocity kick — a large perturbation. How does it respond?

Model	Max Kinetic Energy	Max Velocity	Status
CHLU	168.95	18.38	Handled it
LSTM	26.29	7.25	Lower peak
Neural ODE	—	—	Diverged (dt underflow)

This one is nuanced. LSTM showed lower peak kinetic energy, but that's likely because it doesn't model physics — it compresses the response rather than propagating it accurately. Neural ODE crashed entirely with a dt underflow error during the large perturbation. CHLU handled the perturbation without crashing and its response is physically plausible (the relativistic velocity bound kept things finite), though the higher KE suggests the confinement term could be tuned further.

Experiment C: MNIST Generation

Train a CHLU autoencoder on MNIST images, then generate digits via Langevin dynamics from class centroids with temperature annealing.

Metric	Result
Best MSE	0.01827 (epoch 14)
Samples generated	100 (10 per class)

CHLU generated recognizable digit samples across all 10 classes in 20 epochs. Quality varies by class — some digits are cleaner than others. This is a proof of concept, not state-of-the-art generation.

What Broke (and How I Fixed It)

This build had five distinct failure modes before landing stable results. These are the most useful things I learned:

1. `@torch.no_grad()` breaks symplectic integration

The symplectic integrator computes forces as F = -∂V/∂q using torch.autograd.grad. I wrapped evaluation in @torch.no_grad() for speed — which silently killed the gradient computation. Autograd is not optional for symplectic integration, even during evaluation.

2. Contrastive Divergence without energy normalization causes explosion

CD loss computes H_wake - H_sleep (real state energy minus fantasy state energy). Without constraints on the Hamiltonian, this difference can grow unbounded. I saw loss go from -4,600 to -540,000 in 30 epochs. The model collapsed to outputting constants.

Fix: Add spectral normalization to the PotentialMLP. This bounds the Lipschitz constant of the potential function, preventing unbounded energy landscapes.

# In PotentialMLP.__init__:
self.layers = nn.Sequential(
    nn.utils.spectral_norm(nn.Linear(input_dim, hidden_dim)),
    nn.Tanh(),
    nn.utils.spectral_norm(nn.Linear(hidden_dim, hidden_dim)),
    nn.Tanh(),
    nn.utils.spectral_norm(nn.Linear(hidden_dim, 1)),
)

This is the most important implementation detail. It's not mentioned explicitly in the paper, but is essential for stable CD training. If you're implementing any energy-based model with CD training, spectral normalization (or another Lipschitz constraint) is not optional.

3. Removing CD loss causes long-horizon divergence

After the CD explosion, I tried removing CD entirely. Training looked perfect (1-step MSE ≈ 0), but on 10-cycle rollout evaluation, CHLU MSE = 94,337 vs LSTM MSE = 27.

CD shapes the energy landscape so the model generalizes beyond the training distribution. Without it, the model memorizes trajectories but can't extrapolate.

4. Neural ODE baseline crashes on large perturbations

The Neural ODE baseline (torchdiffeq) crashes with AssertionError: underflow in dt 0.0 when the system diverges. A single crash kills the entire evaluation pipeline. I wrapped it in try/except to report divergence gracefully instead.

5. Evaluation bottleneck

The original evaluation code ran the full 10-cycle rollout 6 separate times (3 for metrics, 3 for plots). With autograd required at each step, this was ~6× slower than necessary. A single combined loop computing metrics and plot data simultaneously gave a 5× speedup.

Using CHLU in Your Own Code

import torch
from chlu import CHLUUnit

model = CHLUUnit(
    input_dim=2,     # Feature dimension
    latent_dim=16,   # Phase space dimension
    c=2.0,           # Speed limit
    dt=0.01,         # Integration step size
    n_steps=10,      # Verlet steps per forward pass
)

# Single-step prediction
x = torch.randn(32, 2)
y_hat = model(x)

# Sequence generation
seq = model.evolve_sequence(x, seq_len=100)  # (32, 100, 2)

Training with contrastive divergence:

from chlu.training import ContrastiveTrainer

trainer = ContrastiveTrainer(
    model=model,
    lr=1e-3,
    cd_weight=0.01,
    lyap_weight=0.1,
)
trainer.train(dataloader, epochs=100)

What's Next

Some improvements I'd like to explore:

Better checkpoint selection — MSE degraded in later epochs (Exp A: 0.07 → 0.157 from epochs 80-100). Best checkpoint is around epoch 14-20, not the final one. Proper validation-based selection would help.
Longer training with LR schedule — I only ran 20 epochs due to time. A CosineAnnealingLR schedule over 100+ epochs would likely improve results.
Better CD sampling — Currently using Langevin dynamics for sleep-phase fantasy particles. Persistent Contrastive Divergence with a replay buffer might produce better negative samples.

More speculative directions: using CHLU as a physics prior in model-based RL, scaling to N-body systems, or adding Bayesian treatment of the potential function for uncertainty quantification.

What I Learned

The main takeaway: implementing a paper is one of the best ways to understand it. The published results tell you what works. The implementation tells you why it works — and more importantly, what the paper doesn't mention (like spectral normalization being essential for CD training).

CHLU is an interesting architecture for problems where physics-inspired constraints matter: energy conservation, bounded dynamics, long-horizon stability. Whether it's practically better than well-tuned LSTMs for real applications is still an open question — my results are from a single night of experimentation, not a rigorous benchmark.

Try It

git clone https://github.com/sahilmalik27/p2p-chlu
cd p2p-chlu && pip install -e ".[dev]"

# Run all experiments
python -m chlu exp-a --epochs 20
python -m chlu exp-b --epochs 20
python -m chlu exp-c --epochs 20

Paper: "Causal Hamiltonian Learning Units" — Pratik Jawahar and Maurizio Pierini (ICLR 2026).

All checkpoints are in the repo. MIT License.

Thanks for reading. Follow me for more.

← More posts