
Amazon recently launched a private preview of Amazon Transcribe, an automatic speech recognition (ASR) service that makes it easy for developers to add speech to text capabilities to their applications. As bandwidth and connectivity improve, more and more of the world’s data is stored in video and audio formats. People are creating and consuming all of this data faster than ever before. It’s important for businesses to have some means of deriving value from all of that rich multimedia content. With Amazon Transcribe you can save on the costly process of manual transcription with an efficient and scalable API.
The service analyzes audio and video files stored in many common formats (WAV, MP3, MP$, Flac, etc.) and returns a detailed and accurate transcriptions with timestamps for each word, as well as inferred punctuation.
Here Engineering has prior experience with Speech to Text translation and the challenges of automating this process. During one of our development engagements, we were asked to research solutions to automate the transcription of self-recorded video files. At the time, the commercial and open-source speech to text products struggled with lower quality voice tracks producing a transcribed document which was mostly gibberish. Without a viable Speech to Text translation engine, we were forced to defer the solution until the technology was more mature.
We were excited when Amazon announced the transcribe service and immediately submitted a request to participate in the service private preview. As one of the leading technology companies in voice recognition, text to speech, artificial intelligence and machine learning, we had high expectations for the transcribe service and were not disappointed.
Although not 100% accurate, the quality of the transcribed document produced by the Amazon AWS Transcribe service was significantly better than our prior experience. The document required some manual review and cleanup, but that effort was trivial compared to the time required to manually transcribe the audio track. Over time the machine learning will get better and I expect the transcription accuracy will continually improve.
We see applicability in several business areas:
Corporate business – Recordings of meetings, teleconferences and web conferences can be transcribed to ensure an accurate record of the meeting and provide a searchable index of the meeting conversation.
Medical industry – Physicians regularly record notes on an audio device which must be transcribed before they can be entered into a patient medical record.
Attorneys – Legal depositions captured as either audio or video recording must be transcribed.
Insurance - Insurance companies are obligated by law to record their telephonic conversations to clients, and that audio recording must be made available in text upon the client’s request.
Deaf community – Under accessibility guidelines, companies are required to provide written versions of discussions and meeting.
Board meetings – Accurate meeting minutes must be produced and posted for all board meetings. The most efficient way to produce the meetings minutes is to transcribe the audio or video recording of the meeting.
Proof of concept:
Below is the sample video used for our initial evaluation of the service. As you can see the audio quality is not great but is what you expect from a self recorded video.
The Transcribe service produces an output document containing the transcribed text and meta information about each word transcribed.
-
Recording start time for the word
-
Recording end time for the word
-
Confidence value for the word
{
"jobName": "here-engineering-poc",
"accountId": "351282490771",
"results": {
"transcripts": [{
"transcript": "The school more with the administration with the
leadership team and we talk with community members and look at uh
very statistics and so forth in order to look at what the uh ability
to pay is could we get this up just one second okay way something
this is what's known as a pregnant pause...."
}],
"items": [{
"start_time": "0.250",
"end_time": "0.330",
"alternatives": [{
"confidence": "0.7962",
"content": "The"
}],
"type": "pronunciation"
},
{
"start_time": "0.330",
"end_time": "0.650",
"alternatives": [{
"confidence": "0.9978",
"content": "school"
}],
"type": "pronunciation"
}
},
"status": "COMPLETED"
}
The meta information can be used to create a visual reference depicting which words need review (color coded) based on confidence value and where in the recording the word appears. Moussing over the text displays the recording start time for the word.
The confidence score alone cannot be used for the document review. In many cases a word is correct even though the confidence score was low and the are many cases where the confidence score is high but the transcribe engine got the word wrong.
The confidence score alone cannot be used for the document review. In many cases a word is correct even though the confidence score was low and the are many cases where the confidence score is high but the transcribe engine got the word wrong.
Let Here Engineering build the solution to your transcription problem.