Tuenti Voice Control is a proof-of-concept that allows users to browse Tuenti with their voice instead of using a mouse or keyboard. It was created for HackMeUp 15, a 24 hour code competition held between Tuenti engineers every quarter, and uses the Experimental Speech API available in Google Chrome since 2011. Tuenti featured this article on their developer blog.
Ismael Gonzalez and I recorded a video that demonstrates browsing Tuenti tabs, going to specific profiles and starting chats:
After creating a Chrome plugin that communicates speech-to-text data to the website, we spent the remaining three hours adding commands related to Tuenti. By the deadline we could:
- Access top-level pages on the site like “mensajes” and “salir”
- Target specific friends with “chat” or “perfil” followed by their name — users can also go directly to Jose’s profile by speaking “perfil jose” or be more specific with “perfil jose manuel”
- Write speech directly to the chat conversation and send the message — it is possible to begin chatting with Natalia with “chat natalia” and output any following text to the screen “hola natalia como estas”
Plugin architecture
Chrome’s Experimental Speech API implements a subset of the features detailed in the W3C Recommendation for Speech Grammer (March 2004) and allows extensions to start speech recognition and retrieve the captured text. To use experimental extension APIs, you must start Chrome with the command line option --enable-experimental-extension-apis.
Google Chrome extensions are composed of HTML pages with specific functions. We use a single content script to capture events from the browser and send requests to a background page:
window.addEventListener("speechstart", function(e) {
chrome.extension.sendRequest('speechstart', function(response) {
triggerSimpleEvent('speechstarted');
});
});
This background page is able to access the experimental API and start speech recognition:
chrome.experimental.speechInput.start({
language: 'ES_es'
}, function () {
if (chrome.extension.lastError) {
console.debug("Couldn't start speech input: "
+ chrome.extension.lastError.message);
}
});
The background page then communicates the result to the content script via an asynchronous request.
// Target active tab
chrome.tabs.getSelected(null, function (tab) {
chrome.tabs.sendRequest(tab.id, {
success: true,
result: result
}, function (response) {
// Handle request callback
});
});
If recognition has been successful, the content script appends a JSON-serialized version of the speech data array to the DOM and fires a ‘speechresult’ event.
chrome.extension.onRequest.addListener(
function(request, sender, sendResponse) {
var voice = document.getEleventById('voice');
voice.setAttribute('success', request.success ? 'true' : '');
voice.setAttribute('data', JSON.stringify(request.success
? request.result.hypotheses : []));
triggerSimpleEvent('speechresult');
}
);
Serialization is required because the content script and underlying website have different Javascript contexts and objects cannot be shared between them.
Processing the speech result
The W3C recommendation includes a method for specifying a grammar. This is crucial for achieving high accuracy and precision in speech recognition system as error rates decrease as the vocabulary size shrinks: 0-9 can be recognized without error, but vocabulary sizes of 200, 5000 or 100000 can have error rates of 3%, 7% or 45%. After experimentation we found that custom grammars are not implemented in Chrome, as of December 2011, and that Google returns any set of words from its dictionary.
- We solved this issue by converting recognized text to a bag-of-words and calculating the probability of a user wanting to perform an action on a friend based on the number of occurrences of words related to that action/user pair.
- Words were normalized and used to access a hash map that maps words to friends who have that name or commands that are referenced by that word
- Each friend or command is then increased by a value weighted by the confidence factor returned by Google and the index of the result The action/word with the highest cumulative weight is performed
This approach worked flawlessly when words present in text returned by Google correspond to a valid action/friend pair. This is helped by speaking clearly and using a high quality noise-cancelling microphone (Apple MacBook Pro) to ensure that the speech recognizer can detect the beginning and end of the command.
Conclusions
Before starting the project we did not know if it would be possible, especially using a single key, to start speech recognition, let alone recognize commands. It was and we think that such techniques can provide a better web experience. For this to happen, both Google (and other browser makers) and the W3C must work together to provide a stable API that can be used by all websites without extensions.


Comments