Building an Alexa Skill
“Hey Alexa ! How does an Alexa skill works ?”
Alexa, Amazon’s cloud based voice service powers voice experiences on millions of devices including Amazon Echo, Echo Dot, Amazon Tab and Fire TV devices. Alexa provides capabilities or Skills that enable customers to interact with devices in a more intuitive way using voice. Examples of Skills include the ability to play music and answer general questions, set an alarm or timer and more. The Alexa Skill Kit (ASK) is a collection of self service APIs, tools, documentation and code samples that make it fast and easy for you to add Skills to Alexa.
Alexa Skill:
An Alexa Skill consists of two main components, a Skill service and a Skill interface. As developers we need to code Skill service and configure the Skill interface via Amazon’s Alexa Skill developer portal. The interaction between the code on the Skill service and the configuration on the Skill Interface results in a working skill. Let’s try to understand each module in depth.
source : https://image.slidesharecdn.com/introductiontobuildingalexaskillsandputtingyouramazonechotowork-160912022624/95/introduction-to-building-alexa-skills-and-putting-your-amazon-echo-to-work-10-638.jpg?cb=1473647463
The skill service is the first component in creating a skill. The skill service lives in the cloud and hosts code that we will write that receives JSON payloads from Alexa. Taking the skill service as the business logic for the skill, it determines what actions to take in response to a user’s speech.The skill service layer manages http requests, user accounts, information processing, sessions and database access. All these are configured in the skill service.
Let us now try to understand how to use Alexa Skills Kit (ASK) and then build an Alexa Skill. This Skill is a voice driven application for Alexa. This article would explain how to build and deploy a basic Skill to Amazon Web Services (AWS).This Skill would be called the Greeter Skill and would say “Hello” to users when they invoke the Skill using the words that we specify. This Skill would respond to users' words with a greeting on any Amazon Echo or Alexa enabled device.
In the Greeter Skill we are building , this is where the response “Hello” is generated and returned to Alexa enabled device.A skill service can be implemented in any language that can be hosted on HTTPS server and return JSON responses. We will implement the skill in Node.JS running on AWS Lambda, which is Amazon’s serverless compute platform.
For the HTTPS server, AWS Lambda is a good option because it can be a trusted event source allowing the Alexa service to automatically communicate securely with AWS Lambda. It is possible to use your own HTTPS server but to do so it requires additional configuration to enable SSL and a signed digital certificate. No additional configuration is required with AWS Lambda.
NodeJS is a great option because it uses JavaScript which is a supported language on AWS Lambda and both Node and Javascript have very active developer communities. They are also convenient to develop in and debug.
Source: https://image.slidesharecdn.com/recap-creatingvoiceexperienceswithalexaskills-hk-revised-170214135009/95/creating-iot-solutions-with-serverless-architecture-alexa-16-638.jpg?cb=1487080388
Skill Service:
A Skill Service implements event handlers, these event handler methods define how the skill would behave when the user triggers the event by speaking to an alexa enabled device.
We define event handlers on the skill service to handle particular events like the OnLaunch event.
GreeterService.prototype.eventHandler.onLaunch = helloAlexaResponseFunction;
var helloAlexaResponseFunction = function(intent, session, response){
response.tell(SPEECH_OUTPUT);
}
The onLaunch event would be send to the Greeter Skill service when the skill is first launched by the user. Users would trigger this skill by saying “Alexa, Open Greeter” or “ Alexa, Start Greeter”. Another type of handler a skill service can implement is called an Intent handler.
var helloAlexaResponseFunction = functin(intent, session, response) {
response.tell(SPEECH_OUTPUT);
}
GreeterService.prototype.intentHandlers = {
“HelloAlexaIntent” : helloAlexaResponseFunction
}
An intent is a type of event, there is an indication of something a user would like to do. In the basic Greeter Skill, all we have is one type of Intent, saying “Hello”, we call it the HelloAlexaIntent here.
Intent handler maps the number of features or interactions a skill offers. A Skill service can have many Intent Handlers, each reacting to different intents triggered by different spoken words which we developers specify.
Skill Interface:
The skill interface configuration is the second part of creating a skill, where we specify the words that would trigger intents of skill service defined above.
The Skill interface is what is responsible for processing users' spoken words. It handles the translation between audio from the user to events the skill service can handle. It sends events so the skill service can do its work. Skill Interface is also where we specify what a skill is called. So user can invoke it by name when talking to a Alexa enabled device,
ex. “Alexa, ask Greeter to say Hello“
It's the name users would address a skill. This is called the skill invocation name. For example, we are naming our skill as Greeter.
Within the Skill Interface, we define the skills interaction model. Interaction model is what trains the skill interface so that it knows how to listen to the user's spoken words. It resolves the spoken words into specific intent events. You define the words that should map to particular intent names in the interaction model by providing a list of sample utterances. A sample utterance is a string that represents a possible way a user may talk to the skill. These utterances are used to generate a natural language understanding model. This resolves users' voice to our skills' intents.
We also declare an Intent Schema on the interaction model. The intent schema tells the skill interface, what intents the skill service implements. Once we provide the sample utterances, the skill interface can resolve the user’s spoken words to the specific events the skill service can handle. An example is the “Hello World” Intent event in the skill we are going to build.
We will provide both the sample utterances and the intent schema in the alexa skill interface. When defining the sample utterances, consider the variety of ways the user might try to ask for an intent. A user might say, “Alexa, ask Greeter to say Hello”, or the user might also say “Alexa, ask Greeter to say Hi”, providing a comprehensive list of sample utterances to the interaction model. It’s important for making the user experience smoother by making the chances of a match.
After having set up the skill interface with sample utterances to recognize voice patterns to match our skill services intents, the fourth journey of the request, between skill interface and skill service can take place.
User Interaction Flow (source: https://developer.amazon.com/public/solutions/alexa/alexa-skills-kit/overviews/understanding-custom-skills)
Here is how user’s spoken words are processed by the Greeter Skill.
The user says, “Alexa, ask the user to say Hello”, the skill interface resolves the audio to an intent event because we can figure out the interaction model. We set up an invocation name
as Greeter and we provide the interaction model with sample utterances in the skill interface. The sample utterances list include “Say Hello”.
So the skill interface was able to match the user’s spoken words to the intent name.Now that the event is recognized by the skill interface it is sent to skill service, the matching intent handler is triggered. The intent handler returns an output speech response of “Hello” to the skill interface which is then sent to the Alexa device. Finally the device speaks the response.
source: https://github.com/bignerdranch/developing-alexa-skills-solutions/blob/master/coursebook/haIntro_Chapter.pdf
Summary:
An alexa skill is made up of Skill Interface and a Skill Service. The skill interface configuration defines how users' verbal commands are resolved to events which are then routed to the skill service. This way one can create an Alexa Skill, with an alexa skill you will write a skill service, configure a skill interface and test and deploy skill.