© 2017-2019 VoiceFirst.FM, a division of Score Publishing

Feedback? Email VoiceFirstFM@gmail.com.

The Alexa Podcast - Episode 2

Co-hosts: Bradley Metrock (CEO, Score Publishing) and Kevin Old (software developer, LifeWay)

Guest: John Kelvie, Bespoken

Duration: 27 minutes, 13 seconds

Podcast Links

Apple Podcasts

Google Play Music

Overcast

SoundCloud

Stitcher Radio

TuneIn

YouTube (+ closed captioning)

Transcript

[intro music]

 

Bradley Metrock: [00:00:07] Hi. And welcome to the second episode of The Alexa Podcast. Our guest today is John Kelvie of Bespoken. John, say hello.

 

John Kelvie: [00:00:16] Hi! Nice to meet everyone.

 

Bradley Metrock: [00:00:18] Yeah. Thank you for sharing some of your time with us, and we'll get into what you do and learn about Bespoken in just a minute. Before we do that: a shout out to our two sponsors. The Alexa Conference is the annual gathering of Alexa developers and enthusiasts. You can learn more and get registered at AlexaConference.com.

 

Bradley Metrock: [00:00:42] And our other sponsor is Fourthcast. F o u r t h c a s t. Fourthcast turns your podcast into a custom Alexa skill. Get started at Fourthcast.com.

 

Bradley Metrock: [00:00:56] John, thank you very much for joining us. We greatly appreciate you and you being willing to share some of your time with us. Let's get started by you simply explaining to us, and the audience, a little bit of back a little bit about your background and how you got into working with Bespoken.

 

John Kelvie: [00:01:19] Yeah. Absolutely. Thanks a lot for having me on; it's a pleasure. So I've actually been working on an interactive audio since 2013. I was one of the co-founders of Xappmedia. We were doing interactive audio ads for mobile. And so you know the simple example I'd always give was if you're listening to a music app, or a podcasting app, our ads would come on and they'd say you can get 2 pizzas for $10 if you just say call now. And you, as the listener, could turn around and say call now, and then you get your pizza. A really neat application of speech recognition. We're building that all on the mobile side, and we are in this voice-first world.

 

John Kelvie: [00:02:12] Even a few years ago, because if you're working with music apps or some type of other audio app, people typically were actually just listening to it in their pocket. You know they have their phone in their pocket; they were listening to it in the background and you know the only mechanism for interaction was that audio. So we really became accustomed to working with that sort of user experience. And when the Alexa device came out, or the Echo device came out with Alexa, it was immediately really interesting to us. We started building skills for it, and as we were building them, we saw that the development tools - I quickly saw the development tools really were just not not there yet.

 

John Kelvie: [00:03:03] And that was the genesis for Bespoken is that we saw for this new paradigm: voice-first development, AI-based development - and also the fact that it's just a slightly different whole deployment model it uses - that you needed a new set of tools to manage it. And we created Bespoken based on that.

 

Bradley Metrock: [00:03:28] So you've got the one tool. (And by the way, the website is Bespoken.tools - B E S P O K E N dot tools - for those listening who want to check this out while you're listening).

 

Bradley Metrock: [00:03:42] So I'm not a developer. Kevin will be asking you the questions about that. But where are you going with this? How are you going to monetize these tools?

 

John Kelvie: [00:03:55] So, I mean, we started out the first things that we put out there were some command line tools that are really helpful. We built this thing that we called a proxy that basically allows you to directly interact with Alexa on your laptop. It really shortened the development and debugging lifecycle. And that stuff is open source and it's free.

 

John Kelvie: [00:04:17] But then what we've built out is we had a logging capability. That should give you access now. We're adding monitoring to that, to ensure the quality of your skills. And then we're adding these other features sort of related to monitoring that we refer to as 'validation.' We do even a deeper level, and it's really tailored to voice. And for those features, you know, we charge people - basically it's free up to a certain number of transactions, and then if you get to a very high number, then we start to charge you, so a modest fee for that.

 

Bradley Metrock: [00:04:54] So you're doing that already?

 

John Kelvie: [00:04:56] We've added those features, we haven't started charging anyone yet. It's still all free at this point.

 

Bradley Metrock: [00:05:03] OK. I'm here to ask you the dumb questions, so I'm happy to do that. Kevin, what questions you got?

 

Kevin Old: [00:05:13] Yes, so I'm interested in the tools you have. I see that, as you mentioned, you know, the proxy. Can you talk a little bit more about that? And for those that aren't really familiar with what that could be doing, could you explain in detail how you're able to, you know, proxy requests to Alexa?

 

John Kelvie: [00:05:50] The proxy tool is really neat, and I think, you know, just in sort of introducing this I said you know we saw that it was a new paradigm for development. And so, you know, I came from this mobile background, and with mobile, you had an immense amount of flexibility in how you build your apps. And that was great, but that was also a real problem, because you're running on the device. It's a highly complex API. There's lots of things that can go wrong.

 

John Kelvie: [00:06:21] Stuff can go wrong with your code, with the configuration of the phone with something with the operating system, you know, much less if you are building for both iOS and Android. You know just within each of their universes, there's this immense complexity. There's the switch now, right, with Alexa where you're not running on the device. You're not running on the Echo if your building skills. You're not even running like inside of Alexa. Instead you're a server that's just sitting somewhere out there in the cloud and the Alexa service sends you requests and then you know it sends your request saying a user said such and such, basically. And you respond and say, well tell them this or play this audio and show them this card. You know you can do some other things as well, but that's basically what it comes down to. And it's a really neat development model but it also introduces this basic challenge which is that if you want to be able to test your skills you need to have a server that Alexa can send requests to. And people don't develop on servers. They develop on their laptops. So what we do with the proxy is we actually build a tunnel from your laptop to a public server that we've created at Bespoken and the request from Alexa goes to our server, and then we in turn send them down to your laptop. So it's kind of a neat networking trick. It sort of feels like magic.

 

John Kelvie: [00:08:01] But it makes your life as a developer much easier because you're now going through these cycles where you make a little change to your code, and then you deploy it back out to your server, and you see that something still not working quite right. And you make another change, and you play it again. You know those cycles really slow you down. And so that that tool is immensely helpful in terms of accelerating people's development.

 

Kevin Old: [00:08:23] Yeah absolutely. It's certainly something that I've run into. And I did have kind of some flashbacks to the local tunnel solutions that we saw for general web development or mobile development to your mobile device or something like that, that's on the same network. So that's awesome that you guys have that for Alexa. I see also that you have speak and an intend command line tool that appears to be a fit for updating your intents in the Alexa development, or I'm sorry, in the Amazon developer portal. Is that correct?

 

John Kelvie: [00:09:07] Yeah. So that's...what we're doing with those, that's actually based on an emulator that we have. So we built an emulator of the Alexa that mimics its internals and when you say "speak," it actually takes whatever you say, and it creates a proper JSON payload - very similar to what Alexa would send to you - and then it sends it to your skill. And so it's a really neat tool for testing it. On top of it, that emulator that we use for those commands you can actually use that programmatically to do unit tests, to hook it into a continuous integration process...you know, it's meant to be something that you can start off with a little bit of debugging. But then if you want to build some really industrial-grade development processes, you can leverage it to do that.

 

Kevin Old: [00:10:08] That's really interesting. Can you talk a little bit about how close that speak technology is to what's actually running on Alexa. And if so, like how are you accomplishing that? Because Alexa really isn't open source, so I'm interested to, I guess, to know that if I'm using that tool as I develop my skill, that once I put it into production in Alexa, that I'm not going to have any unexpected issues.

 

John Kelvie: [00:10:45] Yeah, that's a great question. So we develop it based on the API that Amazon publishes. So we're conforming to the JSON payloads that they have. And that part of it is pretty straightforward. You know I mean that's well documented and we know what value should go into each field. You know sometimes people will put a bug in to GitHub and they'll say hey you know this feels not quite right, or did you know that they added this extra field? And we make small tweaks for that. And you know that's that's really pretty straightforward, and I think you can you can use the payloads that come from that with a high degree of confidence. Now I will say one additional thing we did do with it was we actually mimicked the behavior of the audio player, which is probably...I mean, it works great. That was an ambitious thing to take on. And, in building that, we were doing some reverse engineering, because the audio player, if you use that API from Amazon, it's just there is a bit more going on. You know it's firing off events at different points. And there's some issues around timing. So I was really studying that very carefully, putting together emulating that and it works great, but at the same time, I do fear like, though, they'll be some updates and it'll be harder for us to keep that in line. But so far it has worked great for us and it's really been invaluable for unit testing, you know, podcasting skills, music skills that we have.

 

Kevin Old: [00:12:35] Certainly, yeah. I can see how that, I guess what I was thinking before was that you may have taken, you may have been able to duplicate what the Alexa voice service portion of Alexa is doing. But just matching JSON payloads is a huge leap forward from a developer standpoint. So major kudos on delivering that.

 

John Kelvie: [00:13:01] Well, thank you. And I should also just mention this is something that we're just putting out there is that you know where we're starting to build testing tools around AVS. And the way in which we've manifested this initially was - and this is just kind of a fun thing you can check out - it's called SilentEcho.Bespoken.io. And right now it's just like a fun web interface: you type something in, and we basically send it. We turn it into speech, based on what you type. We send it to Alexa. Alexa does whatever it would normally do with that speech. And then it sends back a reply, which we then in turn, we turn into text. And that's like...why did we build that as a web interface? I just thought it was kind of a fun thing. But we plan on using that to augment our testing tools as well. And that's really I think going to be super valuable.

 

Kevin Old: [00:14:02] Yeah. Can you talk a bit about the testing tools? That's something that, as soon as I've looked at what you guys have started building, that really caught my eye because I'm a developer and my tests are extremely valuable. And tests that others have written. Can you talk a bit about the, it looks like it's the BST Alexa library?

 

John Kelvie: [00:14:30] Yeah. So that's the emulator that I was referring to. And that's the piece that actually, you know, you can say speak, you can say intend, and that's going to generate the proper JSON, send it to your skill, you know, and then even send events, if you're using the audio player interface. So it's really pretty nifty. And and we do see what we're pushing towards next where there is an emulator which has its place and in things. We want to augment that, additionally, by really tying into the actual AVS. So that we're calling, and not everybody's probably familiar with what AVS is, that's listening to this. That's actually the low level, that's the API for devices. So that's going to interact with the actual Alexa service. And so there's no emulation. What your skill is going to receive, we're using that as the exact payload. So it's like an extra level of assurance, if you're using that for testing.

 

Kevin Old: [00:15:39] That's awesome. I can see as I'm developing how that would be extremely valuable to have that feedback really quick, because I've developed some very simple skills that do not require the volley of conversation that you find yourself once you do you start building something bigger. You naturally gravitate to that because Alexa receives such small bits of audio from the user and I can see that this emulator is just super valuable for that.

 

Kevin Old: [00:16:17] Can you talk about the BST encode?

 

John Kelvie: [00:16:22] Yeah, so our encoder is pretty neat. That's been less well-used. I'm glad you asked about that. I think I'm the main one that's using that. But if you wanted to know...one of the things that we're big fans of, right, is we think people should use produced audio. So that's a user experience perspective that we have. I mean the Alexa, the text to speech there, and the Alexa voice is great. You know, I mean, it's definitely improved leaps and bounds over where text to speech was, you know, five years ago, or even three years ago. But it still sounds better if you record things in a real voice. I mean you get all the pronunciations right. You don't run into those sort of, some of the weird phrasing you can sometimes get with text to speech. In addition, we think produced audio is neat just because it gives you a chance to give things a personality. And so it's just a nice way to differentiate your skills. And so what the encoder does is it makes it easier to do that, because it it'll not just...it encodes any audio file to the correct format, so that it can be used by Alexa, and then puts it out on the S3, which is Amazon's large file hosting service, so that it's accessible for use in an Alexa skill. And it can be used programmatically, so you can do it on the fly.

 

John Kelvie: [00:17:49] So to give you an example where we use something like that: you know we had an Alexa skill, a podcasting skill, where we wanted to basically take a short snippet from every podcast - you know, like an introduction - and then just automatically take that part and put it into something that you would play as an introduction. That allowed us...you didn't have to do anything ahead of time. We didn't have to go and do it yourself, and then upload it. Instead, it would just know to look at a certain URL, encode it in Flash, and then make it available. So that you can be dynamically updating the audio as you went along. It's really helpful. At the same time, you know, I think most people are still using the Alexa voice, and they're sort of catching on to the utility of using this sort of recorded audio.

 

Kevin Old: [00:18:48] Yeah, I'm going to agree with you on the fact of prerecorded audio. It's night and day when you have the Alexa voices read things and you have something that's been professionally produced, or even, it doesn't even have to professionally produced. A human can read it. And the, you know, my interest in the skill goes up 100 fold just by having that human touch.

 

Kevin Old: [00:19:15] But the Encode library that you've got, it's actually extremely valuable, because like you said you know bundling that stuff and getting it out on a server...it's just cumbersome to, you know, home-roll that every single time you have to do something. It looks like... I mean, these tools...is that tool a command line tool? Or is it a library you can build around?

 

John Kelvie: [00:19:43] It's a library you can build around. It's meant to be used programmatically.

 

Kevin Old: [00:19:46] That's really, really awesome.

 

Bradley Metrock: [00:19:50] All right John, I've got a question for you, and I'll take this a little bit of a different direction, and this will probably take us to the end. So, I'm curious - in a podcast we recorded earlier this week, the second episode of The VoiceFirst Roundtable, it was discussed whether the future would be 'one assistant to rule them all' or a universe of a lot of different assistants. And I'm wondering your thoughts on that. And just to make the, you know, just to come from the standpoint of a business person you know I run a business called Score Publishing and you know we're going to want to develop some Alexa skills, or we're going to want to develop some voice skills. Let's say I come to you and I say "alright John, I know you're developing these tools that are used for voice technology, and you're very knowledgeable in this space." How would you advise me, if I came to you? What would you tell me to do? Develop for one particular platform, or develop for them all and just budget whatever is appropriate? Or how do you, how do you look at that, and how do you think it's going to reconcile itself?

 

John Kelvie: [00:21:03] I mean, first of all, when you ask that question, it makes me think of all those times people were asking me that, on the mobile side. And I always thought I knew the right answer. You know, I'd be like "oh, you should go HTML5," or try out this cross-platform thing, or you have to be native, for each one... The answer for myself, and from our industry, sort of kept changing. And so, you know, take what I say here with a grain of salt. But I do think right now, where it's early on, we don't know who the winners are going to be.

 

John Kelvie: [00:21:44] You know Amazon has an early lead, seemingly. But Google is putting out compelling products. Microsoft is moving into the industry. Apple, at some point, might do something.

 

John Kelvie: [00:22:02] So, I think, you know, depending on your audience and what you're trying to achieve, I would build for at least Amazon and Google. And you know there's not good cross-platform tooling at this point. But I'm optimistic that we'll see some soon. So that you know essentially you can take one code base, and deploy it effectively to both platforms. Just recently some people were showing me I think what looked like a very promising cross-platform toolkit. And you know the nature of, you know, apps for voice at this point is such that they're not that complicated. And I think there is a good opportunity for people to basically give you a layer that allows you to build it once and run it on multiple platforms. And that's going to be great for folks that want to quickly build skills and don't want to have to worry about learning, you know, multiple platforms really in depth.

 

Bradley Metrock: [00:23:08] I have this very vivid memory - and thank you for that answer - I have this very vivid memory of reading an article years ago in which the author, who was in the entertainment space, was arguing vehemently that there wasn't going to be a winner between Blu-ray and HD DVD. That they were going to always both co-exist. And, for whatever reason, that's what is in the back of my mind when I look at Amazon and Google and Microsoft and Apple. I mean, I think the the upside is - part of the upside is pretty obvious. These are four massive companies that have significant trust established in the marketplace, which is absolutely necessary for this technology and what you're asking people to do. You know, put devices in their home, that could potentially listen all the time, and all this sort of thing.

 

Bradley Metrock: [00:24:07] But the downside clearly is that they're all thousand-pound gorillas that are used to bossing everybody else around. And so you know it'll be interesting to see how it plays out. It's always good to ask somebody who is knee-deep in all of it how they think it will work out.

 

John Kelvie: [00:24:27] Yeah, I think it's fascinating. I think the rate of innovation is amazing. Two things that I would highlight: You know, Amazon is putting all these incredible products. It's kind of hard to keep up with them. But you know the Echo Show just looks amazing. I can't wait to get my hands on that. You know, the ability to to not just have the voice, but also the video, I think is going to be incredible.

 

John Kelvie: [00:24:56] Meanwhile, Google, at the same time...they're really pushing down this path where they're bringing together assistants and assisted apps on a single platform. You know, for both voice, and texting. Which is...that's pretty amazing, right? These are big changes for developers, and for users, and for everybody that...I think it's going to take, you know, I mean they're just all basically kicking up a lot of dust, and it's going to take time for it to settle, and to really see where it falls out.

 

Bradley Metrock: [00:25:33] For people who have heard this podcast who are listening now or just in the process of listening and taking in this discussion, and they want to reach out to you, John, and continue the conversation, learn more about Bespoken...what's the best way for them to do that?

 

John Kelvie: [00:25:53] So I'm on Twitter, @JPKBST - that's an easy way to reach me. My email is JPK at Bespoken dot tools.

 

John Kelvie: [00:26:05] I would also just mention, you know, there's a very vibrant Slack community for voice development. And the Alexa one, I believe, is Alexa Slack dot com. If I have that wrong, just Google Alexa Slack. And that's if you're interested in developing for voice, I mean you go in there, you can really learn a lot. I'm in there all the time. You know, it's easy to get hold of me there. And also, other folks that I would say probably even more expert than myself. You know, there's just a lot of great resources out there.

 

Bradley Metrock: [00:26:46] Thank you very much for setting the time aside. Thank you for sharing your time and your perspective with us. We appreciate it.

 

Bradley Metrock: [00:26:54] And for the second episode of The Alexa Podcast: thank you for listening, and until next time.

[exit music]