Speech-To-Text & audio transcription: which solution to choose ?
Business 

Speech-To-Text & audio transcription: which solution to choose ?

This article is brought to you by the Eden AI team. We rallow you to test and use in production a large number of AI engines from different providers directly through our API and platform. In this article, we test several pre-trained Speech-to-Text APIs. We test these solutions on various relevant use cases.


Intro :

In recent years, within the world of Artificial Intelligence, one of the most popular applications is Speech recognition. This popularity is due to the huge diversity of applications and needs : call center, broadcasting, traduction, health care, banking, voice assistant, etc.

Speech recognition includes various functionalities :

  • Speech-to-text: allows you to transcribe audio into text
  • Text-to-speech: allows you to transcribe a text into audio
  • Speech analysis: allows to analyze an audio speech in order to extract information such as: gender, age, emotions of the speaker
  • Speech Diarization: Allows you to identify and differentiate the different speakers speaking in the same audio (by accents, specificities, etc.)
  • Speech Translation: allows to translate an audio speech from a specific language into an audio speech from another language


This list does not represent an exhaustive list of all speech recognition functionalities. Many solutions are based on several functionalities combined.

This article briefly treats pre-trained Speech-to-Text APIs. The aim is to show which problems can be solved with this kind of API ? Who are the main providers on the market ? What is the optimal process when using pre-trained APIs ?


Providers :

During our study on Speech-to-Text pre-trained APIs, we decided to choose 6 providers APIs that provide high performance according to many blog articles and rankings.

  • Google Cloud Platform Speech-to-Text API
  • AWS Transcribe API    
  • Microsoft Azure Speech Services
  • IBM Watson Speech-to-Text
  • Rev.ai
  • Assembly AI

Logos of different Speech to Text providers
Eden AI: Speech to Text providers

This is the pull of providers APIs we are going to test. It is interesting to note that some other solutions and open source solutions exist.


Use cases :

As said previously, Speech-to-Text APIs are used in hundreds of fields, for many various use cases. In this article, we are going to test different Speech-to-Text APIs with different types of audios representing common use cases.

We chose 3 use cases with different speakers and speeches. For each use case, we tested the Speech-to-Text API from the 6 providers, with one audio per use case. Of course, for a real project you will need to test on a representative part of your database (not only one audio) to have the right view about different performance.


Eden AI :

For GCP, AWS, Azure and Watson, we do not need to use their API directly. In fact, the Eden AI Speech-to-Text API allows to get the 4 providers APIs results with only one simple request. With few lines of code, we can have access to the results from the 4 providers. Rev.ai and Assembly AI are not implemented yet on Eden AI, so we use their API directly.


Tests :

The API response is only a text response. This response (often json format) will be used to develop applications. For our example, the way to proceed is :


1- Benchmark Speech-to-Text APIs available on the market

  • Search for providers
  • Test solutions with some samples according to the project
  • Analyze prices


2- Choose the API provider that best fits with your project


3- Integrate final API in your project / software

  • Look how to manage API in production
  • Add pre-processing and post-processing according to your project


The benchmark is the best and fastest way to find and visualize performances of different solutions and see which one best fits with the type of audio you have. It depends on many parameters like language, type of voice, punctuation, speed processing, speed of speech, length of audio, etc.

Google, IBM, AWS, Azure, Rev.ai and Assembly AI provide performant Speech to text API. They provide different specific parameters and it is interesting to look at their performances on different audio files to quickly identify weak and strong points of each API.


Use case n°1:

The first audio file to transcribe is an interview of a young man. Here is the exact speech:

Audio pictogram
Listen Audio 1


“I am not sure the exact date. It's for Comic relief, a big televised event, where a lot of comedians come together and try to do something funny for money, which is the slogan. And people also go around wearing red noses and trying to raise money like that. It can genuinely be anyone yeah. It is usually students and school children mainly but it can be anyone.”

Eden AI API returns responses for AWS, GCP, IBM, Azure APIs :


Eden AI Speech to Text responses for different providers
Eden AI: Speech to Text

Google Cloud response:


“I’m  not sure the exact date. It’s for comic relief a big televised event  where a lot of comedians come together and try to do something funny for  money which is the second and people sick around wearing red noses and trying to raise money like that students  in school children mainly but it can be anyI’m not sure the exact date  for comic relief a big televised event where lots of comedians come  together and try to do something funny for money which is the second and  people sick around wearing red noses I’m trying to raise money like  that usually students in school children mainly but they have me”


AWS response:


“I’m not sure the exact date. It’s for Comic relief, a big televised event where a lot of comedians come together on and try to do something funny for money, which is the slogan Andi. People also go around wearing red noses and try and raise. Money like that can generally be anyone. It’s usually students and schoolchildren, mainly, but it can be anyone.”


Microsoft Azure response:


“I’m not sure the exact date it’s for Comic Relief abig  televised event where a lot of comedians come together and try to do  something funny for money, which is the slogan and people also go around  wearing red noses and try and raise money like that. Can generally be anyone. Yeah, it’s usually students and schoolchildren mainly, but it can be anyone.”


IBM response:


“%HESITATION I’m not sure the exact date it’s %HESITATION for comic relief a big televised event %HESITATION relative comedians come together and I try to do something funny for money which is the second %HESITATION and people to go around wearing red noses and trying to raise money like that can generally be anyway it’s usually students and school children mainly but it can be anyone”


Rev.ai response:


“Um, I’m not sure of  the exact date it’s for comic relief, a big televised event, um, where a  lot of comedians come together and try to do something funny for money,  which is the slogan. Um, and people also go around wearing red noses  and try and raise money like that. It can genuinely be anyone. Yeah. It’s usually students in school, children mainly, but it can be anyone.”


Assembly AI response:


“I’m  not sure the exact date. It’s for comic release. I’m not sure the exact  date. It’s for comic relief. A big televised event. A big televised  event. Where a lot of comedians come together and try to do something funny for money, which is the slogan. Where a lot of comedians come together and try to do something funny for money, which is the slogan and people ought to go around wearing red noses and try and raise money like that. I can generally read anyone. it’s usually students in school children mainly, but it can be anyone. And  people also go round wearing red noses and try and raise money like  that can generally be anyone? Yeah it’s usually students in school  children mainly, but it can be anyone.


Use case n°1 review:

For this use case, we can note that some difficulties in the speech lead to errors for every provider. But for this use case, Rev.ai clearly provides the best performance. It remains important to notice that Assembly AI punctuation management is impressive. Additionally, for Google and Assembly AI,  we got a problem with text repetition that can be annoying for project  integration. By combining results from different APIs, regarding to  their strong points, there is a way to get very high performance.


Use case n°2:

Audio Pictogram
Listen Audio 2

This second audio file is a 27 second woman speech about her personal means of transport:

“In  England, we use cars a lot to travel. I go to school on foot or by  bike. However, to go further, I would go in the car or on the bus. To go  on holiday, I go by plane or by boat. However, I do not like flying  because I’m scared of heights. And I do not like going by boat because I  feel seasick.”


Eden API returns responses for AWS, GCP, IBM, Azure APIs :

Eden AI Speech to Text responses for different providers
Eden AI: Speech to Text

Google response:


“in England we use cause  a lot to travel I go to school on foot or by bike however to go further  I would go in the car or on the bus to go on holiday I go by plane go by boat however I do not like flying because I’m scared of heights and I do not like going by boat because I feel seasick in  England we use cause a lot to travel I go to school on foot or by bike  however to go further I would go in the car or on the bus to go on  holiday I go by plane go by boat however I do not like flying because  I’m scared of heights and I do not like going by boat because I feel  seasick


AWS response:


“In England, we use cars a lot to travel. I go to school on foot or by  bike. However, to go further, I would go in the car or on the bus. to go on holiday. I go by plane or by boat. However, I do not like flying because I’m scared of heights on. And I do not like going by boat because I feel seasick.”


Microsoft Azure response:


“In England we use cars allowed to travel. I go to school on foot or by bike. However, to go further, I would go in the car or on the bus . to go on holiday. I go by plane or by boat. However, I do not like flying because I’m scared of Heights and I do not like going by boat because I feel seasick.”


IBM response:


in England we use because a lot to travel I go to school on foot all bye bye however it to go fed that I would go in the call  or on the bus to go on holiday I go by plane or by boat however I do not like flying because I’m scared of heights and I do not like going by  both because I feel seasick


Rev.ai response:


“In England, we cause a lot to travel. I go to school on foot or by bike. However, to go further, I would go in the car or on the bus. to go on holiday. I go by plane or by boat. However, I do not like flying because I am  scared of Heights. And I do not like going by boat because I feel seasick.”


Assembly AI response:


“In  England, we use cars a lot to travel. I go to school on foot or by bike. However, to go further, I would go in the car or on the bus. to  go on holiday, I go by plane or by boat. However, I do not like flying  because I am scared of heights and I do not like going by boat because I  feel seasick.”


Use case n°2 review:

For this second use case, we can see a huge performance gap between providers. Assembly AI provides a very high level of performance, followed by Rev.ai a bit less effective but still very performant. Behind, AWS is still closer than Microsoft, Google and IBM that provides a weak result compared to Assembly AI and Rev.ai


Use case n°3:

Audio Pictogram
Listen Audio 3

This  third use case is a phone message left by a man who is talking about  his new phone. We will briefly see performance with a phone quality  audio file. Here is the speech:

“Hi  it’s Paul again, I’m very excited I went and got my new IPhone today  with the new software. It’s a very very good phone, everyone should get  one. I love it. It does many wonderful things. It allows me to do my  email, my web browsing. It’s a phone very very neat. Talk to you soon.  Bye !”


Eden AI API returns responses for AWS, GCP, IBM, Azure APIs :


Eden AI Speech to Text responses for different providers
Eden AI: Speech to Text

Google response:


“Hi it’s Paul again I’m very excited I went and got my new iPhone today with the new software. to very very good phone everyone should get one I love it it does many wonderful things it allows me to do my email on my web browsing it’s a phone very very neat talk to you soon bye”


AWS response:


Hi It’s Paul again. I’m very excited. I went and got my new iPhone today  with the new software. It’s a very, very good phone. Everyone should get one. I love it. It does many wonderful things. It allows me to do my  email, my Web browsing. It’s a phone. Very, very neat. Talk to you soon bye.”


Microsoft Azure response:


“Hi it’s Paul again. I’m very excited. I would went and got my new iPhone today with the new software. It’s a very very good phone. Everyone should get one. I love it. It does many  wonderful things. It allows me to do my email, my web browsing. It’s a  phone. Very very neat. Talk to you soon bye.”


IBM response:


“hi it’s Paul again %HESITATION I’m very excited I went and got my new iPhone today with the new software  it’s a very very good phone everyone should get one I love it it does many wonderful things it allows me to do my email on my web browsing it’s a phone very very needs talk to you soon bye”


Rev.ai response:


“Hi, it’s Paul. Again, I’m very excited. I went and got my new iPhone today with the  new software. It’s a very, very good phone. Everyone should get one. I love it. It does many wonderful things. It allows me to do my email, my web browsing. It’s a phone. It’s very, very neat. Talk to you soon. Bye.”


AssemblyAI response:


“Hi it’s Paul again I’m very excited. I went and got my new iPhone today with a new software it’s a very, very good phone. Everyone should get one I love it, it does many wonderful things. It allows me to do my email. My web browsing it’s a phone it’s very, very neat. Talk to you soon. Bye.”


Use case n°3 review:

For  this third use case, all the providers give high performances. It is  interesting to note that there are providers that succeed for some  difficulties and fail an other and vice versa for other providers. But for this kind of case, the API choice is often made on speed processing or pricing.


Pricing

Concerning the costs of the APIs, they are defined according to duration thresholds with degressive prices:


A table of Speech to Text providers' pricing
Eden AI: Speech to Text providers' pricing

Prices are displayed in dollars per second. We notice that they are important  price changes between the different providers, 3 price ranges stand out. Google and Rev.ai are the most expensive : for volumes higher than 1M minutes, Google is 360% more expensive than IBM and Rev.ai 350%. Next come Microsoft and AWS with similar prices. IBM and AssemblyAI are the less expensive of the panel. Moreover, the pricing presented in this table corresponds to standard offers, it may change with particular requests containing specific parameters : For example, Google proposes higher prices for models dedicated to videos and phone calls but on the contrary lower prices when users agree to share their data in order to improve Google’s models.

Please note that the prices displayed in this table may have changed according to the providers as of the day of writing of this article.


Conclusion

So we have chosen 3 random use cases. It shows that the way to manage a project can be different for each kind of datas :

  • One API with very high performance
  • Combination of multiple APIs results
  • Every APIs are really performant, choice made on other criteria : speed processing, price, etc.

Depending on the use case, the best way to obtain the highest performance is always different. It is important to note that Google, AWS, IBM and Microsoft supports speech-to-text for many languages. In comparison, Assembly AI and and Rev.ai supports for the moment only English from different countries but they are currently working to launch models with other languages. But another important thing to notice: contrary to IBM and Google, Amazon, Microsoft, Rev.ai and Assembly AI  manage punctuation and this is a very important feature. Of course other specific features of each provider can make the difference depending on your project, we highly recommend checking for any specific optional parameter, it can change your choice!

With Eden AI,  you can get fast access to various results from various providers. So  you can have a better idea about which is the solution that best fits  for you. Other providers will be added in Eden AI in the future.

The decision making is as following :

First you run your datas on Eden AI to benchmark solutions available on the market. Then you have 3 options :

  1. You find a result that push you to choose one API that fits with your attempted performance
  2. Different providers give pretty good results but not enough. So you use a combination of results to gather forces and get a combined result, better than any single result from a provider. This operation can be  tedious for speech-to-text.
  3. Multiple  providers give very high performance, so you can base your choice on  other aspects like pricing or speed processing for example.

This process garanties you to make the right choice to succeed in your project. Eden AI is only a tool that allows you to realize a benchmark very easily and quickly. Finally, it is possible to use Eden AI API  to realize the entire project avoiding accounts and billings from many  providers, and keeping the flexibility to not just choose one provider.

In  the case of Speech-to-text solutions pricing is an important element for decision making, because high differences exist between the providers. It is especially true when considering important volumes.

Related Posts

Try Eden AI for free.

You can directly start building now. If you have any questions, don't hesitate to schedule a call with us!

Get started