Stateful Stream Processing Without the Infrastructure Overhead
There's an awkward gap in stream processing tooling. On one end, you have serverless functions (Lambda, Cloud Functions) — fast to start, cheap when idle, but completely stateless. On the other end, you have full stream processing engines (Apache Flink, Faust, Kafka Streams) — stateful and powerful, but always running, even when no data is flowing.
For a large class of real workloads — a 10-second video upload, a 60-second sensor burst, a user checkout session — neither end fits well. You need state within the stream, but the stream is short-lived and unpredictable.
This gap was formally identified in a research paper (arXiv:2603.03089), which proposed stream functions as the right abstraction and showed 99% overhead reduction compared to mature stream engines for this class of workload. The paper framed this as a future cloud platform feature.
Velo is an open-source implementation of that idea — a Python library with a Rust core that makes stateful stream processing as simple as writing an async generator.
The Problem: State Management at Scale
Processing a single event in isolation is easy. The moment you need to remember something from a previous event — the last video frame for motion detection, the running total for fraud scoring, the baseline reading for anomaly detection — things get complicated.
The common approaches each have tradeoffs:
External state stores (Redis, DynamoDB): You save state to a database between function calls and load it back on the next call. For a 30fps video, that's 30 round trips per second to Redis just to hold one variable. It works, but you're using a distributed database as a global variable store.
Always-on pipelines (Flink, Faust): These handle state beautifully, but they run continuously. A minimal Flink deployment costs ~$200-500/month just to exist. For a sensor that bursts for 60 seconds then goes silent for hours, that's a lot of idle cost.
Manual dict management: You start with session_state = {} and it seems simple. Then you need cleanup logic for ended sessions, locks for concurrent access, timeout handling, and suddenly you have 200 lines of lifecycle management instead of business logic.
What Velo Does
Velo's model: instead of a function that handles one event, you write a function that handles an entire stream — and it stays alive for the duration.
from velo import stream_fn
@stream_fn
async def detect_motion(frames):
prev_frame = None
async for frame in frames:
if prev_frame is not None:
motion = compare(frame, prev_frame)
yield {"frame": frame.id, "motion_score": motion}
prev_frame = frame
When a video upload starts, Velo spins up a worker for that stream. The worker runs your function, holds state as plain Python variables, and processes events as they arrive. When the stream ends, the worker is garbage collected.
No Redis. No always-on pipeline. No dict cleanup.
The API is four exports:
from velo import stream_fn # the decorator
from velo import Stream # type hint for stream handles
from velo import StreamMetrics # per-stream metrics
from velo import StreamConfig # optional config
Two calling modes:
# Batch — process a list, get results
results = await detect_motion.run(frame_list)
# Live — open a stream, send events, receive results
async with detect_motion.open() as stream:
await stream.send(frame)
result = await stream.recv()
And pipe composition:
pipeline = normalize | detect_motion | alert_if_high
results = await pipeline.run(sensor_data)
How It Works Under the Hood
Velo's runtime is written in Rust using tokio (the same async runtime behind Discord's backend), lock-free message channels from crossbeam, and PyO3 for the Python binding.
The data path:
send(event)
↓
Rust crossbeam SPSC channel (lock-free, GIL released)
↓
Python async generator (your code)
↓
Rust crossbeam SPSC output channel (lock-free, GIL released)
↓
recv() → caller
Rust handles stream lifecycle, channel buffering, backpressure (bounded channels), and concurrency limits. Python handles your stream function logic.
Measured Performance
These numbers are from the benchmarks in the repo, run on real hardware:
| Metric | Result |
|---|---|
| Stream startup | ~350μs |
| Inter-event P99 latency | ~480μs |
| 1,000 concurrent streams | Stable |
| Throughput | ~6K events/sec |
The throughput bottleneck is the asyncio.to_thread bridge at the Python/Rust boundary (~200μs overhead per event). The Rust core handles >500K events/sec — the gap is OS thread scheduling, not the channels. A batch API (on the roadmap) would close this significantly. Full analysis is in docs/throughput-v2-design.md.
For comparison with the broader landscape:
| Tool | Startup | Has state | Idle cost |
|---|---|---|---|
| Velo | ~350μs | Local vars | Near zero |
| Redis + functions | ~ms + RTT | External | Redis always running |
| Apache Flink | 2-10 seconds | Yes | Full cluster running |
| Faust | ~seconds | Yes | Always on (Kafka required) |
The 99% overhead reduction claim comes from the original research paper, which benchmarked stream functions against Apache Flink on a video processing workload. That number is specific to short-lived streams — Velo is not trying to compete with Flink for continuous, high-throughput pipelines.
Example: Video Processing Integration
Here's what a real integration looks like — a FastAPI endpoint that processes uploaded video frames with per-upload isolated state:
from fastapi import FastAPI, UploadFile
from velo import stream_fn
@stream_fn
async def process_video(frames):
stats = {"total": 0, "high_motion": 0, "scenes": []}
prev = None
async for frame in frames:
motion = compare(frame, prev) if prev else 0
if motion > MOTION_THRESHOLD:
stats["high_motion"] += 1
if is_scene_change(frame, prev):
stats["scenes"].append(frame.timestamp)
stats["total"] += 1
prev = frame
yield frame.with_metadata(stats)
app = FastAPI()
@app.post("/upload")
async def upload_video(file: UploadFile):
processed = []
async with process_video.open() as stream:
async for result in stream.feed(extract_frames(file)):
processed.append(result)
return {
"frames": len(processed),
"high_motion": processed[-1].metadata["high_motion"],
"scenes": processed[-1].metadata["scenes"],
}
1,000 simultaneous uploads = 1,000 isolated workers, each with their own prev and stats. No worker knows about the others. Each uses ~KB of memory.
When Velo Fits (and When It Doesn't)
Good fit:
- Streams are short-lived (seconds to minutes), not continuous
- Stream arrival is unpredictable — you can't afford to run pipelines 24/7
- You need state within a stream, not across streams
- You're already running a service and want to add stateful processing without new infrastructure
Not the right fit:
- Massive, continuous, high-throughput pipelines (use Flink)
- Exactly-once delivery guarantees and replay (use Kafka Streams)
- Aggregating state across millions of streams simultaneously (use an analytics database)
- Truly serverless low-volume workloads (Lambda + a state store is simpler)
What's on the Roadmap
Velo is early. The current roadmap includes: a batch API for higher throughput, dedicated worker threads to eliminate asyncio.to_thread overhead, Kafka and Redis Streams adapters, persistent state (checkpoint to disk), and Prometheus/OpenTelemetry metrics export.
Try It
pip install velo-stream
The source, benchmarks, and examples are at github.com/sahilmalik27/velo. If you're working on a problem that fits this pattern — or if you have feedback on the approach — open an issue. This is early and I'd like to hear how people use it.
Paper: The stream functions concept is from arXiv:2603.03089.
Velo is open source under the Apache 2.0 license. Built on tokio, crossbeam, and PyO3.
Thanks for reading. Follow me for more.
← More posts