Clippy – screen-aware voice AI in the browser

✨ KI-Zusammenfassung

A friend and I built a browser prototype that answers questions about whatever’s on your screen using getDisplayMedia, client-side wake-word detection, and server-side multimodal inference. Hard parts: – Getting the model to point to specific UI elements – Keeping it coherent across multi-step workflows (“Help me create a sword in Tinkercad”) – Preventing the infinite mirror effect and confusion between window vs full-screen sharing – Keeping voice → screenshot → inference → voice latency low enough to feel conversational We packaged it as “Clippy” for fun, but the real experiment is letting a model tool-call fresh screenshots to help it gather more context. One practical use case is remote tech support — I'm sending this to my mom next time she calls instead of screen sharing. Curious what breaks.

Am besten geeignet für

Entwickler, Produktteams und technische Gründer.

Warum es wichtig ist

A friend and I built a browser prototype that answers questions about whatever’s on your screen using getDisplayMedia, client-side wake-word detection, and server-side multimodal inference. Hard parts: – Getting the model to point to specific UI elements – Keeping it coherent across multi-step workflows (“Help me create a sword in Tinkercad”) – Preventing the infinite mirror effect and confusion between window vs full-screen sharing – Keeping voice → screenshot → inference → voice latency low enough to feel conversational We packaged it as “Clippy” for fun, but the real experiment is letting a model tool-call fresh screenshots to help it gather more context. One practical use case is remote tech support — I'm sending this to my mom next time she calls instead of screen sharing. Curious what breaks.

Hauptfunktionen

A friend and I built a browser prototype that answers questions about whatever’s on your screen using getDisplayMedia, client-side wake-word detection, and server-side multimodal inference.
Hard parts: – Getting the model to point to specific UI elements – Keeping it coherent across multi-step workflows (“Help me create a sword in Tinkercad”) – Preventing the infinite mirror effect and confusion between window vs full-screen sharing – Keeping voice → screenshot → inference → voice latency low enough to feel conversational We packaged it as “Clippy” for fun, but the real experiment is letting a model tool-call fresh screenshots to help it gather more context.
One practical use case is remote tech support — I'm sending this to my mom next time she calls instead of screen sharing.
Curious what breaks.

Anwendungsfälle

Review original launch sources before making adoption decisions.
Track community momentum from Product Hunt, GitHub, and Hacker News.

Originalquellen

Hacker News-Diskussion→