Sign Pipeline

From sign tokens to fluent voice — under a second.

The Signchat pipeline is seven stages, three packages, and zero relay servers. Camera frames stay on the device; tokens and audio go directly from the browser to OpenRouter and ElevenLabs. Target: sign-end to first audible byte at p50 ~0.6 s, p95 ~0.9 s.

Try a call Read the source

3 packages7 stages0 relays

The seven stages

How a sign becomes a sentence becomes a voice.

Each stage is a small file you can read in one sitting. The whole turn fits inside a single browser tab — there is no Signchat backend on the per-turn path.

Camera → MediaPipe landmarks
The signer's webcam stream is sent to MediaPipe Tasks Vision (face, gesture, pose) and reduced to per-frame landmarks. Loaders live in the small @signchat/sign-pipeline package; the per-frame runner is in @signchat/runtime-browser.
packages/sign-pipeline/src/mediapipe.tspackages/runtime-browser/src/sign-pipeline/mediapipe-runner.ts
Landmarks → ONNX classifier → top-K
A 48-frame ring of landmarks is fed into the ONNX classifier every ~500 ms. Softmax over 250 classes; two known-noisy labels are zeroed; the top-3 are emitted as ClassifierResults.
packages/sign-pipeline/src/onnx.tspackages/runtime-browser/src/sign-pipeline/mediapipe-onnx-classifier.ts
Token admission (stable / band)
Recognized labels only enter the SignBuffer if they're either consistently top-1 across STABILITY_TICKS ticks (stable) or top-1 with a credible top-2 contender (band). This is the dropout filter against jittery single-tick predictions.
packages/sign-pipeline/src/admit.tspackages/runtime-browser/src/mode-controller/mode-controller.ts
Sentence reconstruction via OpenRouter
When the signer pauses, the buffered tokens plus recent hearing captions become a structured prompt. The browser POSTs directly to OpenRouter — no Vercel relay — with a JSON-schema response format constraining the model to { sentence, confidence, … }.
packages/prompts/src/build-request.tspackages/runtime-browser/src/openrouter/client.ts
JSON parse + schema validation
The model's response is parsed and validated against a Zod schema. Any malformed payload throws a ReconstructionParseError — there's no "fallback voice"; an error stays loud so the signer knows the turn failed.
packages/prompts/src/parse-response.ts
Auto / proofread review
The reconstructed sentence enters the mode controller's preview state. In auto mode it advances to speaking after a configurable silence; in proofread mode the signer must explicitly Approve, Edit, Re-sign, or Discard.
packages/runtime-browser/src/mode-controller/mode-controller.ts
ElevenLabs streaming TTS over WSS
The approved sentence is streamed sentence-at-a-time to ElevenLabs Flash v2.5 over a WebSocket. Returned 24 kHz PCM is decoded and mixed with the user's mic in Web Audio and published as a single LiveKit signchat-voice track that the hearing peer subscribes to like any other call audio.
packages/runtime-browser/src/elevenlabs/streaming.tsARCHITECTURE.md §8 — signchat-voice

Where each stage lives

Three packages, one product.

The pipeline is split across three workspace packages so the heavy bits (network, audio, FSM) can be reused by the Bridge Electron app without dragging the web UI along.

@signchat/sign-pipeline

Vision + ONNX + admit

Tiny shared package for loading MediaPipe and onnxruntime-web, fetching the vocabulary, and the pure admitToken function used by the production mode controller.

@signchat/prompts

Prompts and parsing

The frozen LEAN_OPTIONS_SYSTEM prompt, the request builder that formats top-K tokens and dialog history, and the Zod-backed response parser.

@signchat/runtime-browser

Network + audio + FSM

OpenRouter HTTP client, ElevenLabs WSS streaming, the mode-controller finite state machine, the sign-classifier orchestration, and the Web Audio graph that publishes the signchat-voice track.

Browser-direct, by design

No relay, no WSS gateway, no server-side TTS.

“LiveKit Cloud, OpenRouter, and ElevenLabs handle every per-turn data path. Vercel mints credentials — it’s not on the per-turn path.”

— ARCHITECTURE.md §1

No Signchat relay No WSS gateway No server-side TTS

Measured latency

Reconstruction latency, by model.

Numbers from the latest in-repo sweep (prompt-tester-service / RESULTS.md): 10 models × 399 scenarios = 3,990 OpenRouter calls. The numbers below are for the model call only — not the full sign-end → audible-byte path. Add ~300–500 ms for ElevenLabs TTS + audio mixing.

Latency vs overall score scatter plot — each dot is one model across 399 cases. Up and to the left is best. gemini-3.1-flash-lite-preview leads on composite quality; gpt-5.4-nano and gpt-5.4 are the fastest. llama-4-maverick dominates on raw speed. claude-haiku and command-a are excluded from real-time use due to high timeout rates.

Each dot is one model across 399 cases. Up and to the left is best. Two models (claude-haiku-latest, command-a) have high timeout rates and are excluded from real-time use. Full methodology in the RESULTS.md.

Watch the pipeline run end-to-end.

Open a call and toggle the Debug pane to see classifier ticks, admitted tokens, OpenRouter latency, and TTS bytes in real time.

Open a call Or browse the packages →

Sign Pipeline

From sign tokens to fluent voice — under a second.

Try a call Read the source

3 packages7 stages0 relays

The seven stages

How a sign becomes a sentence becomes a voice.

Each stage is a small file you can read in one sitting. The whole turn fits inside a single browser tab — there is no Signchat backend on the per-turn path.

Camera → MediaPipe landmarks
The signer's webcam stream is sent to MediaPipe Tasks Vision (face, gesture, pose) and reduced to per-frame landmarks. Loaders live in the small @signchat/sign-pipeline package; the per-frame runner is in @signchat/runtime-browser.
packages/sign-pipeline/src/mediapipe.tspackages/runtime-browser/src/sign-pipeline/mediapipe-runner.ts
Landmarks → ONNX classifier → top-K
A 48-frame ring of landmarks is fed into the ONNX classifier every ~500 ms. Softmax over 250 classes; two known-noisy labels are zeroed; the top-3 are emitted as ClassifierResults.
packages/sign-pipeline/src/onnx.tspackages/runtime-browser/src/sign-pipeline/mediapipe-onnx-classifier.ts
Token admission (stable / band)
Recognized labels only enter the SignBuffer if they're either consistently top-1 across STABILITY_TICKS ticks (stable) or top-1 with a credible top-2 contender (band). This is the dropout filter against jittery single-tick predictions.
packages/sign-pipeline/src/admit.tspackages/runtime-browser/src/mode-controller/mode-controller.ts
Sentence reconstruction via OpenRouter
When the signer pauses, the buffered tokens plus recent hearing captions become a structured prompt. The browser POSTs directly to OpenRouter — no Vercel relay — with a JSON-schema response format constraining the model to { sentence, confidence, … }.
packages/prompts/src/build-request.tspackages/runtime-browser/src/openrouter/client.ts
JSON parse + schema validation
The model's response is parsed and validated against a Zod schema. Any malformed payload throws a ReconstructionParseError — there's no "fallback voice"; an error stays loud so the signer knows the turn failed.
packages/prompts/src/parse-response.ts
Auto / proofread review
The reconstructed sentence enters the mode controller's preview state. In auto mode it advances to speaking after a configurable silence; in proofread mode the signer must explicitly Approve, Edit, Re-sign, or Discard.
packages/runtime-browser/src/mode-controller/mode-controller.ts
ElevenLabs streaming TTS over WSS
The approved sentence is streamed sentence-at-a-time to ElevenLabs Flash v2.5 over a WebSocket. Returned 24 kHz PCM is decoded and mixed with the user's mic in Web Audio and published as a single LiveKit signchat-voice track that the hearing peer subscribes to like any other call audio.
packages/runtime-browser/src/elevenlabs/streaming.tsARCHITECTURE.md §8 — signchat-voice

Where each stage lives

Three packages, one product.

The pipeline is split across three workspace packages so the heavy bits (network, audio, FSM) can be reused by the Bridge Electron app without dragging the web UI along.

@signchat/sign-pipeline

Vision + ONNX + admit

Tiny shared package for loading MediaPipe and onnxruntime-web, fetching the vocabulary, and the pure admitToken function used by the production mode controller.

@signchat/prompts

Prompts and parsing

The frozen LEAN_OPTIONS_SYSTEM prompt, the request builder that formats top-K tokens and dialog history, and the Zod-backed response parser.

@signchat/runtime-browser

Network + audio + FSM

OpenRouter HTTP client, ElevenLabs WSS streaming, the mode-controller finite state machine, the sign-classifier orchestration, and the Web Audio graph that publishes the signchat-voice track.

Browser-direct, by design

No relay, no WSS gateway, no server-side TTS.

“LiveKit Cloud, OpenRouter, and ElevenLabs handle every per-turn data path. Vercel mints credentials — it’s not on the per-turn path.”

— ARCHITECTURE.md §1

No Signchat relay No WSS gateway No server-side TTS

Measured latency

Reconstruction latency, by model.

Watch the pipeline run end-to-end.

Open a call and toggle the Debug pane to see classifier ticks, admitted tokens, OpenRouter latency, and TTS bytes in real time.

Open a call Or browse the packages →

From sign tokens to fluent voice — under a second.

Camera → MediaPipe landmarks

Landmarks → ONNX classifier → top-K

Token admission (stable / band)

Sentence reconstruction via OpenRouter

JSON parse + schema validation

Auto / proofread review

ElevenLabs streaming TTS over WSS

Vision + ONNX + admit

Prompts and parsing

Network + audio + FSM

Watch the pipeline run end-to-end.

From sign tokens to fluent voice — under a second.

Camera → MediaPipe landmarks

Landmarks → ONNX classifier → top-K

Token admission (stable / band)

Sentence reconstruction via OpenRouter

JSON parse + schema validation

Auto / proofread review

ElevenLabs streaming TTS over WSS

Vision + ONNX + admit

Prompts and parsing

Network + audio + FSM

Watch the pipeline run end-to-end.