Speech-To-Text & audio transcription: which solution to choose ?


This article is brought to you by the Eden AI team. We rallow you to test and use in production a large number of AI engines from different providers directly through our API and platform. In this article, we test several pre-trained Speech-to-Text APIs. We test these solutions on various relevant use cases.


Intro :

In recent years, within the world of Artificial Intelligence, one of the most popular applications is Speech recognition. This popularity is due to the huge diversity of applications and needs : call center, broadcasting, traduction, health care, banking, voice assistant, etc.

Speech recognition includes various functionalities :

  • Speech-to-text: allows you to transcribe audio into text

  • Text-to-speech: allows you to transcribe a text into audio

  • Speech analysis: allows to analyze an audio speech in order to extract information such as: gender, age, emotions of the speaker

  • Speech Diarization: Allows you to identify and differentiate the different speakers speaking in the same audio (by accents, specificities, etc.)

  • Speech Translation: allows to translate an audio speech from a specific language into an audio speech from another language


This list does not represent an exhaustive list of all speech recognition functionalities. Many solutions are based on several functionalities combined.

This article briefly treats pre-trained Speech-to-Text APIs. The aim is to show which problems can be solved with this kind of API ? Who are the main providers on the market ? What is the optimal process when using pre-trained APIs ?


Providers :

During our study on Speech-to-Text pre-trained APIs, we decided to choose 6 providers APIs that provide high performance according to many blog articles and rankings.

  • Google Cloud Platform Speech-to-Text API

  • AWS Transcribe API

  • Microsoft Azure Speech Services

  • IBM Watson Speech-to-Text

  • Rev.ai

  • Assembly AI

Eden AI: Speech to Text providers

This is the pull of providers APIs we are going to test. It is interesting to note that some other solutions and open source solutions exist.


Use cases :

As said previously, Speech-to-Text APIs are used in hundreds of fields, for many various use cases. In this article, we are going to test different Speech-to-Text APIs with different types of audios representing common use cases.

We chose 3 use cases with different speakers and speeches. For each use case, we tested the Speech-to-Text API from the 6 providers, with one audio per use case. Of course, for a real project you will need to test on a representative part of your database (not only one audio) to have the right view about different performance.


Eden AI :

For GCP, AWS, Azure and Watson, we do not need to use their API directly. In fact, the Eden AI Speech-to-Text API allows to get the 4 providers APIs results with only one simple request. With few lines of code, we can have access to the results from the 4 providers. Rev.ai and Assembly AI are not implemented yet on Eden AI, so we use their API directly.


Tests :

The API response is only a text response. This response (often json format) will be used to develop applications. For our example, the way to proceed is :


1- Benchmark Speech-to-Text APIs available on the market

  • Search for providers

  • Test solutions with some samples according to the project

  • Analyze prices


2- Choose the API provider that best fits with your project


3- Integrate final API in your project / software

  • Look how to manage API in production

  • Add pre-processing and post-processing according to your project


The benchmark is the best and fastest way to find and visualize performances of different solutions and see which one best fits with the type of audio you have. It depends on many parameters like language, type of voice, punctuation, speed processing, speed of speech, length of audio, etc.

Google, IBM, AWS, Azure, Rev.ai and Assembly AI provide performant Speech to text API. They provide different specific parameters and it is interesting to look at their performances on different audio files to quickly identify weak and strong points of each API.


Use case n°1:

The first audio file to transcribe is an interview of a young man. Here is the exact speech:

“I am not sure the exact date. It's for Comic relief, a big televised event, where a lot of comedians come together and try to do something funny for money, which is the slogan. And people also go around wearing red noses and trying to raise money like that. It can genuinely be anyone yeah. It is usually students and school children mainly but it can be anyone.”


Eden AI API returns responses for AWS, GCP, IBM, Azure APIs :


Eden AI: Speech to Text

Google Cloud response:


“I’m not sure the exact date. It’s for comic relief a big televised event where a lot of comedians come together and try to do something funny for money which is the second and people sick around wearing red noses and trying to raise money like that students in school children mainly but it can be anyI’m not sure the exact date for comic relief a big televised event where lots of comedians come together and try to do something funny for money which is the second and people sick around wearing red noses I’m trying to raise money like that usually students in school children mainly but they have me”


AWS response:


“I’m not sure the exact date. It’s for Comic relief, a big televised event where a lot of comedians come together on and try to do something funny for money, which is the slogan Andi. People also go around wearing red noses and try and raise. Money like that can generally be anyone. It’s usually students and schoolchildren, mainly, but it can be anyone.”


Microsoft Azure response:


“I’m not sure the exact date it’s for Comic Relief abig televised event where a lot of comedians come together and try to do something funny for money, which is the slogan and people also go around wearing red noses and try and raise money like that. Can generally be anyone. Yeah, it’s usually students and schoolchildren mainly, but it can be anyone.”


IBM response:


“%HESITATION I’m not sure the exact date it’s %HESITATION for comic relief a big televised event %HESITATION relative comedians come together and I try to do something funny for money which is the second %HESITATION and people to go around wearing red noses and trying to raise money like that can generally be anyway it’s usually students and school children mainly but it can be anyone”


Rev.ai response:


“Um, I’m not sure of the exact date it’s for comic relief, a big televised event, um, where a lot of comedians come together and try to do something funny for money, which is the slogan. Um, and people also go around wearing red noses and try and raise money like that. It can genuinely be anyone. Yeah. It’s usually students in school, children mainly, but it can be anyone.”


Assembly AI response:


“I’m not sure the exact date. It’s for comic release. I’m not sure the exact date. It’s for comic relief. A big televised event. A big televised event. Where a lot of comedians come together and try to do something funny for money, which is the slogan. Where a lot of comedians come together and try to do something funny for money, which is the slogan and people ought to go around wearing red noses and try and raise money like that. I can generally read anyone. it’s usually students in school children mainly, but it can be anyone. And people also go round wearing red noses and try and raise money like that can generally be anyone? Yeah it’s usually students in school children mainly, but it can be anyone.


Use case n°1 review:

For this use case, we can note that some difficulties in the speech lead to errors for every provider. But for this use case, Rev.ai clearly provides the best performance. It remains important to notice that Assembly AI punctuation management is impressive. Additionally, for Google and Assembly AI, we got a problem with text repetition that can be annoying for project integration. By combining results from different APIs, regarding to their strong points, there is a way to get very high performance.


Use case n°2:


This second audio file is a 27 second woman speech about her personal means of transport:

“In England, we use cars a lot to travel. I go to school on foot or by bike. However, to go further, I would go in the car or on the bus. To go on holiday, I go by plane or by boat. However, I do not like flying because I’m scared of heights. And I do not like going by boat because I feel seasick.”


Eden API returns responses for AWS, GCP, IBM, Azure APIs :

Eden AI: Speech to Text

Google response:


“in England we use cause a lot to travel I go to school on foot or by bike however to go further I would go in the car or on the bus to go on holiday I go by plane go by boat however I do not like flying because I’m scared of heights and I do not like going by boat because I feel seasick in England we use cause a lot to travel I go to school on foot or by bike however to go further I would go in the car or on the bus to go on holiday I go by plane go by boat however I do not like flying because I’m scared of heights and I do not like going by boat because I feel seasick


AWS response:


“in England, we use cars a lot to travel. I go to school on foot or by bike. However, to go further, I would go in the car or on the bus. to go on holiday. I go by plane or by boat. However, I do not like flying because I’m scared of heights on. And I do not like going by boat because I feel seasick.”


Microsoft Azure response:


“In England we use cars allowed to travel. I go to school on foot or by bike. However, to go further, I would go in the car or on the bus . to go on holiday. I go by plane or by boat. However, I do not like flying because I’m scared of Heights and I do not like going by boat because I feel seasick.”


IBM response:


in England we use because a lot to travel I go to school on foot all bye bye however it to go fed that I would go in the call or on the bus to go on holiday I go by plane or by boat however I do not like flying because I’m scared of heights and I do not like going by both because I feel seasick


Rev.ai response:


“In England, we cause a lot to travel. I go to school on foot or by bike. However, to go further, I would go in the car or on the bus. to go on holiday. I go by plane or by boat. However, I do not like flying because I am scared of Heights. And I do not like going by boat because I feel seasick.”


Assembly AI response:


“In England, we use cars a lot to travel. I go to school on foot or by bike. However, to go further, I would go in the car or on the bus. to go on holiday, I go by plane or by boat. However, I do not like flying because I am scared of heights and I do not like going by boat because I feel seasick.”


Use case n°2 review:

For this second use case, we can see a huge performance gap between providers. Assembly AI provides a very high level of performance, followed by Rev.ai a bit less effective but still very performant. Behind, AWS is still closer than Microsoft, Google and IBM that provides a weak result compared to Assembly AI and Rev.ai


Use case n°3:


This third use case is a phone message left by a man who is talking about his new phone. We will briefly see performance with a phone quality audio file. Here is the speech:

“Hi it’s Paul again, I’m very excited I went and got my new IPhone today with the new software. It’s a very very good phone, everyone should get one. I love it. It does many wonderful things. It allows me to do my email, my web browsing. It’s a phone very very neat. Talk to you soon. Bye !”


Eden AI API returns responses for AWS, GCP, IBM, Azure APIs :

Eden AI: Speech to Text


Google response:


“hi it’s Paul again I’m very excited I went and got my new iPhone today with the new software. to very very good phone everyone should get one I love it it does many wonderful things it allows me to do my email on my web browsing it’s a phone very very neat talk to you soon bye”


AWS response:


Hi It’s Paul again. I’m very excited. I went and got my new iPhone today with the new software. It’s a very, very good phone. Everyone should get one. I love it. It does many wonderful things. It allows me to do my email, my Web browsing. It’s a phone. Very, very neat. Talk to you soon bye.”


Microsoft Azure response:


“Hi it’s Paul again. I’m very excited. I would went and got my new iPhone today with the new software. It’s a very very good phone. Everyone should get one. I love it. It does many wonderful things. It allows me to do my email, my web browsing. It’s a phone. Very very neat. Talk to you soon bye.”


IBM response:


“hi it’s Paul again %HESITATION I’m very excited I went and got my new iPhone today with the new software it’s a very very good phone everyone should get one I love it it does many wonderful things it allows me to do my email on my web browsing it’s a phone very very needs talk to you soon bye”


Rev.ai response: