A ~1.7-million-parameter Conv1D-Transformer hybrid trained on Google Kaggle's asl-signs (PopSign 250) competition, served as ONNX and run on WebAssembly. No model server, no vendor inference API.
The model alternates causal Conv1D blocks and Transformer blocks over MediaPipe Holistic landmarks, kept under the 40 MB TFLite cap that turned out to be the right capacity for landmark-only ISLR.
The classifier lives at /models/asl-signs/asl-signs.onnx in the public folder. The browser fetches it once, caches it, and runs WASM inference on every meeting.
We load onnxruntime-web@1.20.1 from a CDN and run the WASM execution provider only — no GPU required, no native install. See onnx-session.ts.
The classifier ticks every ~500 ms over the most recent 48 MediaPipe frames and emits the top-3 labels with confidences. Defaults live in DEFAULT_CLASSIFIER_CONFIG in mediapipe-onnx-classifier.ts.
Both MediaPipe Holistic and the ONNX classifier execute in your browser tab. Only the recognized sign tokens — short strings with floats — ever leave the device on the per-turn data path.
The browser assembles MediaPipe Holistic landmarks into a Float32 tensor, runs the ONNX model, applies softmax, and picks the top-K. Two blocked label indices (giraffe and drop) are zeroed out in production because they over-fire on idle hands.
Float32Tensor[T, 543, 3] T = up to 48 frames (sliding window) 543 = MediaPipe Holistic landmarks 3 = (x, y, z) per landmark
The classifier ticks every ~500 ms once the ring buffer has at least 8 frames. Older frames are dropped as new ones arrive.
Float32[250] // raw logits
-> softmax
-> blocked labels zeroed
-> top-3 -> ClassifierResult
{ label: string,
score: number /* 0..1 */ }Downstream, the mode controller in @signchat/runtime-browser admits a label as a stable or band token when it stays consistent across ticks.
Glosses follow Kaggle's canonical lowercase form. Aliases in src/data/gloss_aliases.expand_aliases map each one to ASL Citizen / WLASL conventions for cross-dataset pretraining.
Trained on the Kaggle Isolated Sign Language Recognition competition dataset — PopSign 250 — for 200 epochs on a single H200 with bf16 + XLA, AdamW, OneCycleLR cosine, and Adversarial Weight Perturbation from epoch 15. Splits are signer-disjoint (13 train / 4 val / 4 held-out) so eval numbers reflect performance on signers the model has never seen.
Nothing about the classifier is hidden. Read the README, audit the configs, run make eval — or train your own checkpoint and swap it into apps/web.
asl-classifier-model/ — model, preprocessing, augmentations, eval.configs/base.yaml + configs/pretrain_phase1_kaggle.yaml (200 epochs, AdamW, OneCycleLR cosine, AWP from epoch 15).data/splits/kaggle_islr.json (13 train / 4 val / 4 held-out) so eval doesn't leak signers.Makefile targets for smoke (5 epochs), full train, and eval. RunPod scripts under scripts/ provision an H200 end-to-end.experiments.csv — top-1 / top-5 accuracy, parameter count, inference latency.tests/ exercised by make test.Open the app, allow your camera, and start signing. The classifier loads in the background while you're in the lobby — by the time you join, it's warm.
Try the classifierOr read the training README →