I started ThunderTalk because every voice-to-text app I tried fell into one of two camps: cloud-only (great accuracy, but my microphone audio is now a third party's training set), or "local" but quietly phoning home for the model. I wanted something I could read the source of, run offline on a plane, and trust with a confidential meeting.
Why local ASR is finally tractable
Two years ago, running a real ASR model locally on a laptop meant either Whisper-tiny (fast, mediocre) or Whisper-large (good, but a 10-second clip took 8 seconds of CPU). Apple Silicon and the MLX framework changed the arithmetic:
- Qwen3-ASR-0.6B (MLX fp16) runs at RTF ≈ 0.05 on an M3 — 20× faster than the audio you spoke.
- The model is 1.2 GB. It fits in the cache of any modern Mac, and the first-token latency after warm-up is ~200 ms.
- No GPU compromise. MLX targets the unified memory architecture directly, so I'm not fighting CUDA wrappers or Metal shims.
The practical consequence: a hotkey-triggered "press to dictate" workflow that feels indistinguishable from a cloud service, except it works on a plane and it's free.
What I underestimated
The model wasn't the hard part. The hard parts were:
- macOS permissions theater. Microphone, Accessibility, and Input Monitoring are three separate consent flows, and the app gets re-quarantined every time you move it between folders.
- Hotkey global capture. PySide6 doesn't expose system-wide hotkeys; I ended up writing a thin Objective-C bridge over
CGEventTap. - Hotwords. Out-of-the-box ASR mangles every product name and acronym in your team's vocabulary. The Qwen3 hotwords interface is well-designed but undocumented; I wrote an integration test suite just to characterize what it would and wouldn't accept.
If you try it, let me know what breaks. The repo's at realAllenSong/ThunderTalk.