Clippy – screen-aware voice AI in the browser

A friend and I built a browser prototype that answers questions about whatever’s on your screen using getDisplayMedia, client-side wake-word detection, and server-side multimodal inference. Hard parts: – Getting the model to point to specific UI elements – Keeping it coherent across multi-step workflows (“Help me create a sword in Tinkercad”) – Preventing the infinite mirror effect and confusion between window vs full-screen sharing – Keeping voice → screenshot → inference → voice latency low enough to feel conversational We packaged it as “Clippy” for fun, but the real experiment is letting a model tool-call fresh screenshots to help it gather more context. One practical use case is remote tech support — I'm sending this to my mom next time she calls instead of screen sharing. Curious what breaks.

Résumé IA

A friend and I built a browser prototype that answers questions about whatever’s on your screen using getDisplayMedia, client-side wake-word detection, and server-side multimodal inference. Hard parts: – Getting the model to point to specific UI elements – Keeping it coherent across multi-step workflows (“Help me create a sword in Tinkercad”) – Preventing the infinite mirror effect and confusion between window vs full-screen sharing – Keeping voice → screenshot → inference → voice latency low enough to feel conversational We packaged it as “Clippy” for fun, but the real experiment is letting a model tool-call fresh screenshots to help it gather more context. One practical use case is remote tech support — I'm sending this to my mom next time she calls instead of screen sharing. Curious what breaks.

Idéal pour

Développeurs, équipes produit et fondateurs techniques.

Pourquoi c'est important

A friend and I built a browser prototype that answers questions about whatever’s on your screen using getDisplayMedia, client-side wake-word detection, and server-side multimodal inference. Hard parts: – Getting the model to point to specific UI elements – Keeping it coherent across multi-step workflows (“Help me create a sword in Tinkercad”) – Preventing the infinite mirror effect and confusion between window vs full-screen sharing – Keeping voice → screenshot → inference → voice latency low enough to feel conversational We packaged it as “Clippy” for fun, but the real experiment is letting a model tool-call fresh screenshots to help it gather more context. One practical use case is remote tech support — I'm sending this to my mom next time she calls instead of screen sharing. Curious what breaks.

Fonctionnalités clés

  • A friend and I built a browser prototype that answers questions about whatever’s on your screen using getDisplayMedia, client-side wake-word detection, and server-side multimodal inference.
  • Hard parts: – Getting the model to point to specific UI elements – Keeping it coherent across multi-step workflows (“Help me create a sword in Tinkercad”) – Preventing the infinite mirror effect and confusion between window vs full-screen sharing – Keeping voice → screenshot → inference → voice latency low enough to feel conversational We packaged it as “Clippy” for fun, but the real experiment is letting a model tool-call fresh screenshots to help it gather more context.
  • One practical use case is remote tech support — I'm sending this to my mom next time she calls instead of screen sharing.
  • Curious what breaks.

Cas d'usage

  • Review original launch sources before making adoption decisions.
  • Track community momentum from Product Hunt, GitHub, and Hacker News.

Sources originales