We often get asked questions like - "What's Rev's accuracy?" or "How does Rev.ai compare to <enter competitor name here>?"

The short answer is, in one of our own test sets we get these results:

That said, this is on one particular data set. There are many factors that affect accuracy, including but not limited to:

  • input audio quality  
    -background noise
    -audio equipment quality
    -compression
    -sampling rate
    -Distant microphone and reverberation
    -noisy channels
  • speaker qualities
    -
    diction, pronunciation, clarity, loudness, etc.
    -accent
    -dialect
  • Environmental qualities/Speech Recognition "in the wild"
    -
    Multiple speakers
    -Non-stationary noises (e.g. passing sirens)
    -Unexpected events
    -Many different topic domains (Law, medicine, educational, news, etc.)
  • ASR engine characteristics
    -
    depth of training data - vocabularies across industries/subjects
    -breadth of training data - covering different industries, accents , dialects etc.
    -AI model depth and effectiveness
    -separate or combined language packs for different dialects (e.g. US vs UK English)

Therefore, we feel uncomfortable making a declaration about our accuracy with a particular number or %, since it is highly dependent on the top 3 bullets above, which are largely out of our control beyond being smart enough to enable our engine to be able to deal with deficiencies in those areas. Because of this, we also don't believe claims that any speech to text company makes about their accuracy unless it's stated in very broad terms. Therefore, we'll focus on the 4th bullet and explain what makes us superior to most any engine available today.

Rev.com has been around for 10 years and our core business is human transcription. Human transcripts and timing data with the associated audio is exactly what is needed to tune an engine. We have 10 years worth of such data including a multitude of industries (broadcast/media, legal, medical, education etc.) and accents/dialects from all over the world. Therefore, we can handpick the data we need to create the best possible model and have done so with millions of minutes of audio. 

In addition, we decided to put all of our English into a single model so that you'll get the best results out of the box, regardless of if it is an American, Brit, Aussie, German or other person speaking English. Many companies make you swap models in and out of memory, depending on who is speaking, which is totally unrealistic in this global world of ours.

We train our models on noisy audio and as a result, we are more resilient to noisy audio than others. ASR is our core business, not one of many things we do, so we are laser focused on continuous improvement in accuracy, performance and features. On high quality audio with native English speakers speaking clearly, we can get in the low-mid 90s percent. 

We are the ONLY Speech to Text company that offers end-to-end options, from human to fully automatic transcriptions and captions with a few options in between. This enables us to meet you where you are and provide you APIs for exactly what you need with regards to accuracy, turn-around-time, editing capabilities, required output formats  and cost. Tell us the needs in each of these criteria and we'll tell you which combination of Rev products would best meet those needs. 

Did this answer your question?