[ad_1]
Podcasting has grown to be a well-liked and highly effective medium for storytelling, information, and leisure. With out transcripts, podcasts could also be inaccessible to people who find themselves hard-of-hearing, deaf, or deaf-blind. Nonetheless, guaranteeing that auto-generated podcast transcripts are readable and correct is a problem. The textual content must precisely mirror the that means of what was spoken and be straightforward to learn. The Apple Podcasts catalog accommodates hundreds of thousands of podcast episodes, which we transcribe utilizing computerized speech recognition (ASR) fashions. To guage the standard of our ASR output, we examine a small variety of human-generated, or reference, transcripts to corresponding ASR transcripts.
The business commonplace for measuring transcript accuracy, phrase error fee (WER), lacks nuance. It equally penalizes all errors within the ASR textual content—insertions, deletions, and substitutions—no matter their impression on readability. Moreover, the reference textual content is subjective: It’s based mostly on what the human transcriber discerns as they hearken to the audio.
Constructing on current analysis into higher readability metrics, we set ourselves a problem to develop a extra nuanced quantitative evaluation of the readability of ASR passages. As proven in Determine 1, our resolution is the human analysis phrase error fee (HEWER) metric. HEWER focuses on main errors, those who adversely impression readability, similar to misspelled correct nouns, capitalization, and sure punctuation errors. HEWER ignores minor errors, similar to filler phrases (“um,” “yeah,” “like”) or alternate spellings (“okay” vs. “okay”). We discovered that for an American English check set of 800 segments with a median ASR transcript WER of 9.2% sampled from 61 podcast episodes, the HEWER was simply 1.4%, indicating that the ASR transcripts have been of upper high quality and extra readable than WER may counsel.
Our findings present data-driven insights that we hope have laid the groundwork for bettering the accessibility of Apple Podcasts for hundreds of thousands of customers. As well as, Apple engineering and product groups can use these insights to assist join audiences with extra of the content material they search.
Choosing Pattern Podcast Segments
We labored with human annotators to determine and classify errors in 800 segments of American English podcasts pulled from manually transcribed episodes with WER of lower than 15%. We selected this WER most to make sure the ASR transcripts in our analysis samples:
Met the edge of high quality we anticipate for any transcript proven to an Apple Podcasts viewers
Required our annotators to spend not more than 5 minutes to categorise errors as main or minor
Of the 66 podcast episodes in our preliminary dataset, 61 met this criterion, representing 32 distinctive podcast reveals. Determine 2 reveals the choice course of.
For instance, one episode within the preliminary dataset from the podcast present Yo, Is This Racist? titled “Cody’s Marvel dot Ziglar (with Cody Ziglar)” had a WER of 19.2% and was excluded from our analysis. However we included an episode titled “I am Not Attempting to Put the Plantation on Blast, However…” from the identical present, with a WER of 14.5%.
Segments with a comparatively larger episode WER have been weighted extra closely within the choice course of, as a result of such episodes can present extra insights than episodes whose ASR transcripts are practically flawless. The imply episode WER throughout all segments was 7.5%, whereas the typical WER of the chosen segments was 9.2%. Every audio section was roughly 30 seconds in period, offering sufficient context for annotators to grasp the segments with out making the duty too taxing. Additionally we, aimed to pick segments that began and ended at a phrase boundary, similar to a sentence break or lengthy pause.
Evaluating Main and Minor Errors in Transcript Samples
WER is a extensively used measurement of the efficiency of speech recognition and machine translation methods. It divides the whole variety of errors within the auto-generated textual content by the whole variety of phrases within the human-generated (reference) textual content. Sadly, WER scoring provides equal weight to all ASR errors—insertions, substitutions, and deletions—which might be deceptive. For instance, a passage with a excessive WER should still be readable and even indistinguishable in semantic content material from the reference transcript, relying on the kinds of errors.
Earlier analysis on readability has targeted on subjective and imprecise metrics. For instance, of their paper “A Metric for Evaluating Speech Recognizer Output Primarily based on Human Notion Mannequin,” Nobuyasu Itoh and staff devised a scoring rubric on a scale of 0 to five, with 0 being the best high quality. Individuals of their experiment have been first introduced with auto-generated textual content with out corresponding audio and have been requested to evaluate transcripts based mostly on how straightforward the transcript was to grasp. They then listened to the audio and scored the transcript based mostly on perceived accuracy.
Different readability analysis—for instance, “The Way forward for Phrase Error Charge”—has to not our information been carried out throughout any datasets at scale. To deal with these limitations, our researchers developed a brand new metric for measuring readability, HEWER, that builds on the WER scoring system.
The HEWER rating supplies human-centric insights contemplating readability nuances. Determine 3 reveals three variations of a 30-second pattern section from transcripts of the April 23, 2021, episode, “The Herd,” of the podcast present This American Life.
Our dataset comprised 30-second audio segments from a superset of 66 podcast episodes, and every section’s corresponding reference and model-generated transcripts. Human annotators began by figuring out errors in wording, punctuation, or within the transcripts, and classifying as “main errors” solely these errors that:
Modified the that means of the textual content
Affected the readability of the textual content
Misspelled correct nouns
WER and HEWER are calculated based mostly on an alignment of the reference and model-generated textual content. Determine 3 reveals every metric’s scoring of the identical output. WER counts errors as all phrases that differ between the reference and model-generated textual content, however ignores case and punctuation. HEWER, alternatively, takes each case and punctuation under consideration, and subsequently, the whole variety of tokens, proven within the denominator, is bigger as a result of every punctuation mark counts as a token.
In contrast to WER, HEWER ignores minor errors, similar to filler phrases like “uh” solely current within the reference transcript, or the usage of “till” within the model-generated textual content instead of “till” within the reference transcript. Moreover, HEWER ignores variations in comma placement that don’t have an effect on readability or that means, in addition to lacking hyphens. The one main errors within the Determine 3 HEWER pattern are “quarantine” instead of “quarantining” and “Antibirals” instead of “Antivirals.”
On this case, WER is considerably excessive, at 9.4%. Nonetheless, that worth provides us a misunderstanding concerning the high quality of the model-generated transcript, which is definitely fairly readable. The HEWER worth of two.2% appears to point that it’s a higher reflection of the human expertise of studying the transcript.
Conclusion
Given the rigidity and limitations of WER, the established business commonplace for measuring ASR accuracy, we’re working in direction of constructing on current analysis and create HEWER, a extra nuanced quantitative evaluation of the readability of ASR passages. We utilized this new metric to a dataset of pattern segments from auto-generated transcripts of podcast episodes to glean insights into transcript readability and assist guarantee the best accessibility and very best expertise for all Apple Podcasts audiences and creators.
Acknowledgments
Many individuals contributed to this analysis, together with Nilab Hessabi, Sol Kim, Filipe Minho, Issey Masuda Mora, Samir Patel, Alejandro Woodward Riquelme, João Pinto Carrilho Do Rosario, Clara Bonnin Rossello, Tal Singer, Eda Wang, Anne Wootton, Regan Xu, and Phil Zepeda.
Apple Sources
Apple Newsroom. 2024. “Apple Introduces Transcripts for Apple Podcasts.” [link.]
Apple Podcasts. n.d. “Infinite Matters. Endlessly Participating.” [link.]
Exterior References
Glass, Ira, host. 2021. “The Herd.” This American Life. Podcast 736, April 23, 58:56. [link.]
Hughes, John. 2022. “The Way forward for Phrase Error Charge (WER).” Speechmatics. [link.]
Itoh, Nobuyasu, Gakuto Kurata, Ryuki Tachibana, and Masafumi Nishimura. 2015. “A Metric for Evaluating Speech Recognizer Output based mostly on Human-Notion Mannequin.” In sixteenth Annual Convention of the Worldwide Speech Communication Affiliation (Interspeech 2015). Speech Past Speech: In direction of a Higher Understanding of the Most Necessary Biosignal, 1285–88. [link.]
[ad_2]
Source link