Artifical Intelligence Has Revolutionized Transcriptions

When it comes to blurring the line between human and machine, Daft Punk is the grandaddy of them all.
@courtesy Daft Punk.
In the last few years, algorithmic speech-to-text (and vice versa), has gone from “not terrible” to “quite awesome”. Companies like Google have created bots that are essentially indistinguishable from human for short snippets of time. And, the algorithm’s ability to turn audio data into written content is very strong, at least in good conditions. While the AI’s ability to masquerade as human is still limited (because it can’t nail our speech patterns for longer content), this jump in speech-to-text has resulted in a burgeoning new industry: fast transcriptions, created by machine learning algorithms.
To understand why these are “fast”, it’s important to understand that the transcription market, since the invention of the recording, is traditionally very slow, and has been a bottleneck for many industries. A human can only transcribe an audio file as fast as they can listen to it and type the words they hear. If you include any kind of quality assurance, meaning that multiple people listen to the audio, this process becomes several times longer. It is time consuming, and therefore not cheap, to make transcriptions using human transcribers. The standard minimum turnaround time for most transcription companies is 24 hours, usually for “one pass” transcriptions that do not have a second person performing QA.
Some people have their assistants transcribe audio files for them — this is something I’ve heard frequently attempted by qualitative researchers who do lots of interviews, broadcasters, and lawyers (unfortunately for law interns). However, an hour long interview, transcribed by an assistant who will need to listen to it three or four times to get it right at $30+ an hour, can start to get quite expensive. Plus, you’ve taken up half your assistants day. Most of them have said it’s simply not worth it.
But why does anyone want a transcription more quickly anyway? What’s the benefit?
One big audience is people who need to do interviews, which they then later need to take notes on. For example, a qualitative researcher might interview four different people in a single day about Bryan Cranston’s new tequila, then need to prepare notes on all four by the following day for a report to stakeholders. Currently, this person will need to listen to the tape again (and again and again) in order to pull out choice quotes to use in their report. They all say they want transcripts because it is much faster to read a transcript of a one hour interview than to actually listen to the whole thing. Even if they were to listen at a sped up rate, it is many times faster to read a transcript of a conversation than to actually listen to it.
A focus group currently in session. @courtesy Thomson Dawson.
Lawyers are also interested, but probably not in the way you’re thinking. For legal reasons, it will probably be a long time before court reporters are replaced with AI (if they make a mistake it would be a big deal, with no obvious person to hold responsible). However, lawyers want to use this product for when they are interviewing their clients themselves, in private. Much like the researchers above, lawyers need to spend lots of time asking their clients questions, then using the answers to help build their case. If all of this conversation was (confidently) recorded and transcribed, they could potentially save hundreds of hours per year.
Another factor, often overlooked, is that we understand better when we read something vs. hearing it out spoken. There are two main reasons for this, the first is that we miss hardly anything when we read. If you’re reading a novel and get distracted by a spider crawling across your ceiling, you will start the book again at the place where you left off. If you’re listening to that book on Audible and get distracted, you will more than likely just keep listening without rewinding and miss that passage from the book. The second is that you get more information density from the written word. When you listen to an audio recording you remember the sounds, whereas when you read that same text you will have both an approximation of the sound memory via your inner voice, and you will have a visual memory of seeing the words on the page physically in front of you. The latter forms two sensory memories instead of one, and will form a stronger memory for most people.
This fast transcription technology may bleed into all kinds of industries that have not even thought to use transcripts. The technology will make this especially easy, as all of us are carrying around increasingly-sophisticated microphones in/on our pockets/ears/wrists, anyone who wants to record and transcribe a conversation will be able to do so with almost no effort at all.
- Jack Connor