Streamline diarization using AI as an assistive technology: ZOO Digital’s story

[ad_1]

ZOO Digital gives end-to-end localization and media providers to adapt unique TV and film content material to completely different languages, areas, and cultures. It makes globalization simpler for the world’s greatest content material creators. Trusted by the largest names in leisure, ZOO Digital delivers high-quality localization and media providers at scale, together with dubbing, subtitling, scripting, and compliance.

Typical localization workflows require guide speaker diarization, whereby an audio stream is segmented primarily based on the id of the speaker. This time-consuming course of have to be accomplished earlier than content material will be dubbed into one other language. With guide strategies, a 30-minute episode can take between 1–3 hours to localize. By automation, ZOO Digital goals to realize localization in underneath half-hour.

On this put up, we focus on deploying scalable machine studying (ML) fashions for diarizing media content material utilizing Amazon SageMaker, with a deal with the WhisperX mannequin.

Background

ZOO Digital’s imaginative and prescient is to supply a quicker turnaround of localized content material. This purpose is bottlenecked by the manually intensive nature of the train compounded by the small workforce of expert folks that may localize content material manually. ZOO Digital works with over 11,000 freelancers and localized over 600 million phrases in 2022 alone. Nonetheless, the provision of expert folks is being outstripped by the growing demand for content material, requiring automation to help with localization workflows.

With an purpose to speed up the localization of content material workflows via machine studying, ZOO Digital engaged AWS Prototyping, an funding program by AWS to co-build workloads with prospects. The engagement targeted on delivering a purposeful resolution for the localization course of, whereas offering hands-on coaching to ZOO Digital builders on SageMaker, Amazon Transcribe, and Amazon Translate.

Buyer problem

After a title (a film or an episode of a TV collection) has been transcribed, audio system have to be assigned to every phase of speech in order that they are often appropriately assigned to the voice artists which are forged to play the characters. This course of known as speaker diarization. ZOO Digital faces the problem of diarizing content material at scale whereas being economically viable.

Answer overview

On this prototype, we saved the unique media recordsdata in a specified Amazon Easy Storage Service (Amazon S3) bucket. This S3 bucket was configured to emit an occasion when new recordsdata are detected inside it, triggering an AWS Lambda operate. For directions on configuring this set off, seek advice from the tutorial Utilizing an Amazon S3 set off to invoke a Lambda operate. Subsequently, the Lambda operate invoked the SageMaker endpoint for inference utilizing the Boto3 SageMaker Runtime shopper.

The WhisperX mannequin, primarily based on OpenAI’s Whisper, performs transcriptions and diarization for media property. It’s constructed upon the Sooner Whisper reimplementation, providing as much as 4 instances quicker transcription with improved word-level timestamp alignment in comparison with Whisper. Moreover, it introduces speaker diarization, not current within the unique Whisper mannequin. WhisperX makes use of the Whisper mannequin for transcriptions, the Wav2Vec2 mannequin to boost timestamp alignment (making certain synchronization of transcribed textual content with audio timestamps), and the pyannote mannequin for diarization. FFmpeg is used for loading audio from supply media, supporting numerous media codecs. The clear and modular mannequin structure permits flexibility, as a result of every part of the mannequin will be swapped out as wanted sooner or later. Nonetheless, it’s important to notice that WhisperX lacks full administration options and isn’t an enterprise-level product. With out upkeep and assist, it will not be appropriate for manufacturing deployment.

On this collaboration, we deployed and evaluated WhisperX on SageMaker, utilizing an asynchronous inference endpoint to host the mannequin. SageMaker asynchronous endpoints assist add sizes as much as 1 GB and incorporate auto scaling options that effectively mitigate site visitors spikes and save prices throughout off-peak instances. Asynchronous endpoints are notably well-suited for processing massive recordsdata, akin to motion pictures and TV collection in our use case.

The next diagram illustrates the core components of the experiments we performed on this collaboration.

Within the following sections, we delve into the small print of deploying the WhisperX mannequin on SageMaker, and consider the diarization efficiency.

Obtain the mannequin and its elements

WhisperX is a system that features a number of fashions for transcription, pressured alignment, and diarization. For easy SageMaker operation with out the necessity to fetch mannequin artifacts throughout inference, it’s important to pre-download all mannequin artifacts. These artifacts are then loaded into the SageMaker serving container throughout initiation. As a result of these fashions aren’t instantly accessible, we provide descriptions and pattern code from the WhisperX supply, offering directions on downloading the mannequin and its elements.

WhisperX makes use of six fashions:

Most of those fashions will be obtained from Hugging Face utilizing the huggingface_hub library. We use the next download_hf_model() operate to retrieve these mannequin artifacts. An entry token from Hugging Face, generated after accepting the person agreements for the next pyannote fashions, is required:

import huggingface_hub
import yaml
import torchaudio
import urllib.request
import os

CONTAINER_MODEL_DIR = “/choose/ml/mannequin”
WHISPERX_MODEL = “guillaumekln/faster-whisper-large-v2”
VAD_MODEL_URL = “https://whisperx.s3.eu-west-2.amazonaws.com/model_weights/segmentation/0b5b3216d60a2d32fc086b47ea8c67589aaeb26b7e07fcbe620d6d0b83e209ea/pytorch_model.bin”
WAV2VEC2_MODEL = “WAV2VEC2_ASR_BASE_960H”
DIARIZATION_MODEL = “pyannote/speaker-diarization”

def download_hf_model(model_name: str, hf_token: str, local_model_dir: str) -> str:
“””
Fetches the supplied mannequin from HuggingFace and returns the subdirectory it’s downloaded to
:param model_name: HuggingFace mannequin title (and an elective model, appended with @[version])
:param hf_token: HuggingFace entry token licensed to entry the requested mannequin
:param local_model_dir: The native listing to obtain the mannequin to
:return: The subdirectory inside local_modeL_dir that the mannequin is downloaded to
“””
model_subdir = model_name.break up(‘@’)[0]
huggingface_hub.snapshot_download(model_subdir, token=hf_token, local_dir=f”{local_model_dir}/{model_subdir}”, local_dir_use_symlinks=False)
return model_subdir

The VAD mannequin is fetched from Amazon S3, and the Wav2Vec2 mannequin is retrieved from the torchaudio.pipelines module. Primarily based on the next code, we will retrieve all of the fashions’ artifacts, together with these from Hugging Face, and save them to the required native mannequin listing:

def fetch_models(hf_token: str, local_model_dir=”./fashions”):
“””
Fetches all required fashions to run WhisperX domestically with out downloading fashions each time
:param hf_token: A huggingface entry token to obtain the fashions
:param local_model_dir: The listing to obtain the fashions to
“””
# Fetch Sooner Whisper’s Massive V2 mannequin from HuggingFace
download_hf_model(model_name=WHISPERX_MODEL, hf_token=hf_token, local_model_dir=local_model_dir)

# Fetch WhisperX’s VAD Segmentation mannequin from S3
vad_model_dir = “whisperx/vad”
if not os.path.exists(f”{local_model_dir}/{vad_model_dir}”):
os.makedirs(f”{local_model_dir}/{vad_model_dir}”)

urllib.request.urlretrieve(VAD_MODEL_URL, f”{local_model_dir}/{vad_model_dir}/pytorch_model.bin”)

# Fetch the Wav2Vec2 alignment mannequin
torchaudio.pipelines.__dict__[WAV2VEC2_MODEL].get_model(dl_kwargs={“model_dir”: f”{local_model_dir}/wav2vec2/”})

# Fetch pyannote’s Speaker Diarization mannequin from HuggingFace
download_hf_model(model_name=DIARIZATION_MODEL,
hf_token=hf_token,
local_model_dir=local_model_dir)

# Learn within the Speaker Diarization mannequin config to fetch fashions and replace with their native paths
with open(f”{local_model_dir}/{DIARIZATION_MODEL}/config.yaml”, ‘r’) as file:
diarization_config = yaml.safe_load(file)

embedding_model = diarization_config[‘pipeline’][‘params’][’embedding’]
embedding_model_dir = download_hf_model(model_name=embedding_model,
hf_token=hf_token,
local_model_dir=local_model_dir)
diarization_config[‘pipeline’][‘params’][’embedding’] = f”{CONTAINER_MODEL_DIR}/{embedding_model_dir}”

segmentation_model = diarization_config[‘pipeline’][‘params’][‘segmentation’]
segmentation_model_dir = download_hf_model(model_name=segmentation_model,
hf_token=hf_token,
local_model_dir=local_model_dir)
diarization_config[‘pipeline’][‘params’][‘segmentation’] = f”{CONTAINER_MODEL_DIR}/{segmentation_model_dir}/pytorch_model.bin”

with open(f”{local_model_dir}/{DIARIZATION_MODEL}/config.yaml”, ‘w’) as file:
yaml.safe_dump(diarization_config, file)

# Learn within the Speaker Embedding mannequin config to replace it with its native path
speechbrain_hyperparams_path = f”{local_model_dir}/{embedding_model_dir}/hyperparams.yaml”
with open(speechbrain_hyperparams_path, ‘r’) as file:
speechbrain_hyperparams = file.learn()

speechbrain_hyperparams = speechbrain_hyperparams.exchange(embedding_model_dir, f”{CONTAINER_MODEL_DIR}/{embedding_model_dir}”)

with open(speechbrain_hyperparams_path, ‘w’) as file:
file.write(speechbrain_hyperparams)

Choose the suitable AWS Deep Studying Container for serving the mannequin

After the mannequin artifacts are saved utilizing the previous pattern code, you’ll be able to select pre-built AWS Deep Studying Containers (DLCs) from the next GitHub repo. When choosing the Docker picture, contemplate the next settings: framework (Hugging Face), job (inference), Python model, and {hardware} (for instance, GPU). We suggest utilizing the next picture: 763104351884.dkr.ecr.[REGION].amazonaws.com/huggingface-pytorch-inference:2.0.0-transformers4.28.1-gpu-py310-cu118-ubuntu20.04 This picture has all the required system packages pre-installed, akin to ffmpeg. Bear in mind to exchange [REGION] with the AWS Area you might be utilizing.

For different required Python packages, create a necessities.txt file with an inventory of packages and their variations. These packages shall be put in when the AWS DLC is constructed. The next are the extra packages wanted to host the WhisperX mannequin on SageMaker:

faster-whisper==0.7.1
git+https://github.com/m-bain/whisperx.git@1b092de19a1878a8f138f665b1467ca21b076e7e
ffmpeg-python

Create an inference script to load the fashions and run inference

Subsequent, we create a customized inference.py script to stipulate how the WhisperX mannequin and its elements are loaded into the container and the way the inference course of needs to be run. The script accommodates two capabilities: model_fn and transform_fn. The model_fn operate is invoked to load the fashions from their respective areas. Subsequently, these fashions are handed to the transform_fn operate throughout inference, the place transcription, alignment, and diarization processes are carried out. The next is a code pattern for inference.py:

import io
import json
import logging
import tempfile
import time

import torch
import whisperx

DEVICE = ‘cuda’ if torch.cuda.is_available() else ‘cpu’

def model_fn(model_dir: str) -> dict:
“””
Deserialize and return the fashions
“””
logging.data(“Loading WhisperX mannequin”)
mannequin = whisperx.load_model(whisper_arch=f”{model_dir}/guillaumekln/faster-whisper-large-v2″,
machine=DEVICE,
language=”en”,
compute_type=”float16″,
vad_options={‘model_fp’: f”{model_dir}/whisperx/vad/pytorch_model.bin”})

logging.data(“Loading alignment mannequin”)
align_model, metadata = whisperx.load_align_model(language_code=”en”,
machine=DEVICE,
model_name=”WAV2VEC2_ASR_BASE_960H”,
model_dir=f”{model_dir}/wav2vec2″)

logging.data(“Loading diarization mannequin”)
diarization_model = whisperx.DiarizationPipeline(model_name=f”{model_dir}/pyannote/speaker-diarization/config.yaml”,
machine=DEVICE)

return {
‘mannequin’: mannequin,
‘align_model’: align_model,
‘metadata’: metadata,
‘diarization_model’: diarization_model
}

def transform_fn(mannequin: dict, request_body: bytes, request_content_type: str, response_content_type=”software/json”) -> (str, str):
“””
Load in audio from the request, transcribe and diarize, and return JSON output
“””

# Begin a timer in order that we will log how lengthy inference takes
start_time = time.time()

# Unpack the fashions
whisperx_model = mannequin[‘model’]
align_model = mannequin[‘align_model’]
metadata = mannequin[‘metadata’]
diarization_model = mannequin[‘diarization_model’]

# Load the media file (the request_body as bytes) into a brief file, then use WhisperX to load the audio from it
logging.data(“Loading audio”)
with io.BytesIO(request_body) as file:
tfile = tempfile.NamedTemporaryFile(delete=False)
tfile.write(file.learn())
audio = whisperx.load_audio(tfile.title)

# Run transcription
logging.data(“Transcribing audio”)
consequence = whisperx_model.transcribe(audio, batch_size=16)

# Align the outputs for higher timings
logging.data(“Aligning outputs”)
consequence = whisperx.align(consequence[“segments”], align_model, metadata, audio, DEVICE, return_char_alignments=False)

# Run diarization
logging.data(“Operating diarization”)
diarize_segments = diarization_model(audio)
consequence = whisperx.assign_word_speakers(diarize_segments, consequence)

# Calculate the time it took to carry out the transcription and diarization
end_time = time.time()
elapsed_time = end_time – start_time
logging.data(f”Transcription and Diarization took {int(elapsed_time)} seconds”)

# Return the outcomes to be saved in S3
return json.dumps(consequence), response_content_type

Inside the mannequin’s listing, alongside the necessities.txt file, make sure the presence of inference.py in a code subdirectory. The fashions listing ought to resemble the next:

fashions
├── code
│ ├── inference.py
│ └── necessities.txt
├── guillaumekln
│ └── faster-whisper-large-v2
├── pyannote
│ ├── segmentation
│ │ └── …
│ └── speaker-diarization
│ └── …
├── speechbrain
│ └── spkrec-ecapa-voxceleb
│ └── …
├── wav2vec2
│ └── …
└── whisperx
└── vad
└── …

Create a tarball of the fashions

After you create the fashions and code directories, you should use the next command traces to compress the mannequin right into a tarball (.tar.gz file) and add it to Amazon S3. On the time of writing, utilizing the faster-whisper Massive V2 mannequin, the ensuing tarball representing the SageMaker mannequin is 3 GB in measurement. For extra data, seek advice from Mannequin internet hosting patterns in Amazon SageMaker, Half 2: Getting began with deploying actual time fashions on SageMaker.

# Save the mannequin artifacts to the ‘mannequin’ listing and create a tarball
tar cvzf mannequin.tar.gz -C mannequin/ .
# Add the mannequin to S3
aws s3 cp mannequin.tar.gz s3://<target_bucket>

Create a SageMaker mannequin and deploy an endpoint with an asynchronous predictor

Now you’ll be able to create the SageMaker mannequin, endpoint config, and asynchronous endpoint with AsyncPredictor utilizing the mannequin tarball created within the earlier step. For directions, seek advice from Create an Asynchronous Inference Endpoint.

Consider diarization efficiency

To evaluate the diarization efficiency of the WhisperX mannequin in numerous eventualities, we chosen three episodes every from two English titles: one drama title consisting of 30-minute episodes, and one documentary title consisting of 45-minute episodes. We utilized pyannote’s metrics toolkit, pyannote.metrics, to calculate the diarization error charge (DER). Within the analysis, manually transcribed and diarized transcripts supplied by ZOO served as the bottom fact.

We outlined the DER as follows:

Complete is the size of the bottom fact video. FA (False Alarm) is the size of segments which are thought-about as speech in predictions, however not in floor fact. Miss is the size of segments which are thought-about as speech in floor fact, however not in prediction. Error, additionally known as Confusion, is the size of segments which are assigned to completely different audio system in prediction and floor fact. All of the items are measured in seconds. The everyday values for DER can range relying on the precise software, dataset, and the standard of the diarization system. Word that DER will be bigger than 1.0. A decrease DER is healthier.

To have the ability to calculate the DER for a chunk of media, a floor fact diarization is required in addition to the WhisperX transcribed and diarized outputs. These have to be parsed and end in lists of tuples containing a speaker label, speech phase begin time, and speech phase finish time for every phase of speech within the media. The speaker labels don’t have to match between the WhisperX and floor fact diarizations. The outcomes are primarily based totally on the time of the segments. pyannote.metrics takes these tuples of floor fact diarizations and output diarizations (referred to within the pyannote.metrics documentation as reference and speculation) to calculate the DER. The next desk summarizes our outcomes.

Video Sort
DER
Appropriate
Miss
Error
False Alarm

Drama
0.738
44.80%
21.80%
33.30%
18.70%

Documentary
1.29
94.50%
5.30%
0.20%
123.40%

Common
0.901
71.40%
13.50%
15.10%
61.50%

These outcomes reveal a major efficiency distinction between the drama and documentary titles, with the mannequin attaining notably higher outcomes (utilizing DER as an mixture metric) for the drama episodes in comparison with the documentary title. A better evaluation of the titles gives insights into potential elements contributing to this efficiency hole. One key issue might be the frequent presence of background music overlapping with speech within the documentary title. Though preprocessing media to boost diarization accuracy, akin to eradicating background noise to isolate speech, was past the scope of this prototype, it opens avenues for future work that might doubtlessly improve the efficiency of WhisperX.

Conclusion

On this put up, we explored the collaborative partnership between AWS and ZOO Digital, using machine studying strategies with SageMaker and the WhisperX mannequin to boost the diarization workflow. The AWS staff performed a pivotal position in aiding ZOO in prototyping, evaluating, and understanding the efficient deployment of customized ML fashions, particularly designed for diarization. This included incorporating auto scaling for scalability utilizing SageMaker.

Harnessing AI for diarization will result in substantial financial savings in each price and time when producing localized content material for ZOO. By aiding transcribers in swiftly and exactly creating and figuring out audio system, this expertise addresses the historically time-consuming and error-prone nature of the duty. The traditional course of usually includes a number of passes via the video and extra high quality management steps to attenuate errors. The adoption of AI for diarization allows a extra focused and environment friendly strategy, thereby growing productiveness inside a shorter timeframe.

We’ve outlined key steps to deploy the WhisperX mannequin on the SageMaker asynchronous endpoint, and encourage you to attempt it your self utilizing the supplied code. For additional insights into ZOO Digital’s providers and expertise, go to ZOO Digital’s official web site. For particulars on deploying the OpenAI Whisper mannequin on SageMaker and numerous inference choices, seek advice from Host the Whisper Mannequin on Amazon SageMaker: exploring inference choices. Be at liberty to share your ideas within the feedback.

In regards to the Authors

Ying Hou, PhD, is a Machine Studying Prototyping Architect at AWS. Her main areas of curiosity embody Deep Studying, with a deal with GenAI, Laptop Imaginative and prescient, NLP, and time collection information prediction. In her spare time, she relishes spending high quality moments along with her household, immersing herself in novels, and mountain climbing within the nationwide parks of the UK.

Ethan Cumberland is an AI Analysis Engineer at ZOO Digital, the place he works on utilizing AI and Machine Studying as assistive applied sciences to enhance workflows in speech, language, and localisation. He has a background in software program engineering and analysis within the safety and policing area, specializing in extracting structured data from the net and leveraging open-source ML fashions for analysing and enriching collected information.

Gaurav Kaila leads the AWS Prototyping staff for UK & Eire. His staff works with prospects throughout various industries to ideate & co-develop enterprise essential workloads with a mandate to speed up adoption of AWS providers.