Voice and natural language serve up the UI of the future. Here's how to incorporate them into your applications, without relying on someone else's API
Voice and natural language systems areย an important step toward making our digital servants serve us on our terms. We went from punch cards to green screens to GUIs and eventually to touch-based, palm-sized, location- and context-sensitive computers in the form of smartphones (not to mention those annoying smart-car panels).ย Now we have Appleโs Siri, Amazonโs Alexa, Microsoftโs Cortana, and Googleโs Assistant answer our needs.
To build voice and natural language capabilities into your own applications, you have several cloud options. For Alexa, you can tap into an open API at no apparent cost beyondย AWS charges; the same goes for Google,ย although the Google Cloud site is as clear as mud on this point. Microsoft even lets you reuse your Alexa skills package with Cortana. For Apple, thereโs an API, along with the $99 cost of becoming an Apple Developer and publishing an iOS app.
But why lock yourself into Amazonโs or Appleโs or anyone elseโs platform to get these capabilities? Anybody can build their own system to voice-enable their devices, websites, or gadgets today. Itโs a matter of speech to text, a query parser, a pipeline, a rules engine, and a pluggable architecture with open APIs. (Full disclosure: I work for Lucidworks, a search technology company with a product that covers most of these tasks.)
Speech to text
I remember when I first sawย IBMโs Windows 95 voice-enabled Aptiva desktop computer that let you control your computer with voice commands. The voice interface was a bit clunky because Windows 95 wasnโt really designed with voice commands in mindโbut it made a hell of a demo!
These days you have your pick of speech recognition libraries or cloud solutions. You can (and have been able to for a while) embed them into anything. Some packages are even accurate.
Text to speech
Speech synthesis has existed since we first had sound cards. Heck, I vaguely rememberย DOS libraries that could do horrible things with the onboard speaker that claimed to be speech. Most modern operating systems from Android to Windows to OS X have built-in APIs for speech synthesis.
Query parser
Once speech has become text, most of the real work is done by the query parser. This turns words into root words (โstemmingโ) and words into phrases. Query parsersย (such as Extended DisMax) have come a long way from even a few years ago.
In the old days, asking even Google a question meant either doing a pure keyword/term search or learning a somewhat byzantine syntax and composing queries like (+โthis phrase in the documentโ AND -โthis phrase in the documentโ) OR (โsomething that may be in the documentโ AND -โthis shouldnโt be thereโ). Now you search for stuff in something as close to โplain Englishโ as possible.
To a large degree, the new query parsers moved the smarts out of the developerโs UI and into the search engine itself.
Pipeline
A lot of tasks may need to be performed on a query before we pass it to either custom plugged-in commands (โskillsโ) or execute a search against our index. Moreover, special results (such as โrestaurants in my areaโ) need different processing than run-of-the-mill search results before we return them to the user. To do this appropriately, you need some kind of pipeline for queries coming in and results coming out.
Rules and/or domain-specific language
Some items really are a series of if-then-else statements. When someone asks for the โabout page,โ send them to /about.html. When a query contains โweather,โ call the weather service.
Other details are a sort of โdomainโ or a combination of a rule and a domain, such as โrecipes for tarts containing cherriesโ or โspeech recognition libraries in C.โ For these, you might map them as searches where title=โ* tart *โ, document-type=recipe, ingredients=cherry.
Tagging/natural language processing
For truly flexible search, you need software that can map unstructured data into reasonable, searchable structures. This means when data is indexed, it should know that when this linked document is parsed, the โentityโ mentioned is Google or Alphabet and the document type is an SEC filing of the subtype 10-K. This requires recognizing these notes and โtaggingโ them.
For a human-friendly search, the system needs to recognize parts of speech. โ10-K reports about Googleโ and โ10-K reports mentioning Googleโ are two different matters. This requires parts-of-speech tagging, potentially at index time, but may also require natural language processing at query time.
Pluggable architecture
In general, โbuild a modular architectureโ is another way of saying โdonโt make software that sucks.โ Change is the only absolute constant. All major vendors have a way to plug new functions into their Alexa-like creation.
With modules you usually get some way of โdiscoveringโ the new functionality. This is nothing new. It means having a decent API with a descriptor or metadata explaining how to plug the functionality in and what it does.
Open APIs
If youโre coding in todayโs world, an โopen APIโ should mean a REST API. You should be prepared to receive JSON over HTTPS and emit the same. You donโt know what new stuff the future holds, so build for resiliency.
Why roll your own?
Maybe your homegrown Alexa is all behind your firewall. Maybe itโs a limited-function, site-specific system to provide a new way for e-commerce customers to find what they need on your site or at an in-store kiosk. Maybe your assistant is more of a shop floor device to locate manufacturing equipment. One can always name a reason to roll oneโs own.
Whether youโre doing it yourself or plugging into the new world of cloud-based personal assistants, you have decades of libraries and expertise to build on. Alexa and her ilk were inevitableโand building your own version is well within reach.


