We examine architectures for mobile speech applications. These use speech engines for synthesizing audio output and for recognizing audio input; a key architec-tural decision is whether to embed these speech engines on the mobile device or to locate them in the network. While both approaches have advantages, our focus here is on net-worked speech application architectures. Because user experience with speech is greatly improved when the speech modality is coupled with a visual modality, mobile speech applications will increasingly tend to be multimodal, so speech architectures therefore must support multimodal user interaction. Good architectures must reflect commercial reality and be economical, efficient, robust, reliable, and scalable. They must leverage existing commercial ecosystems if possible, and we contend that speech and multimodal applications must build on both the web model of application development and deploy-ment, and the large ecosystem that has grown up around the W3C’s web speech standards.
More...