Avneesh Saluja, Andy Yao, Hossein Taghavi
When watching a film or an episode of a TV present, we expertise a cohesive narrative that unfolds earlier than us, typically with out giving a lot thought to the underlying construction that makes all of it doable. However, motion pictures and episodes are usually not atomic models, however somewhat composed of smaller parts reminiscent of frames, photographs, scenes, sequences, and acts. Understanding these parts and the way they relate to one another is essential for duties reminiscent of video summarization and highlights detection, content-based video retrieval, dubbing high quality evaluation, and video enhancing. At Netflix, such workflows are carried out lots of of occasions a day by many groups all over the world, so investing in algorithmically-assisted tooling round content material understanding can reap outsized rewards.
While segmentation of extra granular models like frames and shot boundaries is both trivial or can primarily depend on pixel-based data, greater order segmentation¹ requires a extra nuanced understanding of the content material, such because the narrative or emotional arcs. Furthermore, some cues could be higher inferred from modalities aside from the video, e.g. the screenplay or the audio and dialogue monitor. Scene boundary detection, specifically, is the duty of figuring out the transitions between scenes, the place a scene is outlined as a steady sequence of photographs that happen in the identical time and placement (typically with a comparatively static set of characters) and share a standard motion or theme.
In this weblog submit, we current two complementary approaches to scene boundary detection in audiovisual content material. The first technique, which could be seen as a type of weak supervision, leverages auxiliary knowledge within the type of a screenplay by aligning screenplay textual content with timed textual content (closed captions, audio descriptions) and assigning timestamps to the screenplay’s scene headers (a.okay.a. sluglines). In the second strategy, we present {that a} comparatively easy, supervised sequential mannequin (bidirectional LSTM or GRU) that makes use of wealthy, pretrained shot-level embeddings can outperform the present state-of-the-art baselines on our inside benchmarks.
Screenplays are the blueprints of a film or present. They are formatted in a selected means, with every scene starting with a scene header, indicating attributes reminiscent of the situation and time of day. This constant formatting makes it doable to parse screenplays right into a structured format. At the identical time, a) adjustments made on the fly (directorial or actor discretion) or b) in submit manufacturing and enhancing are hardly ever mirrored within the screenplay, i.e. it isn’t rewritten to replicate the adjustments.
In order to leverage this noisily aligned knowledge supply, we have to align time-stamped textual content (e.g. closed captions and audio descriptions) with screenplay textual content (dialogue and action² strains), allowing for a) the on-the-fly adjustments that may end in semantically comparable however not an identical line pairs and b) the doable post-shoot adjustments which can be extra vital (reordering, eradicating, or inserting whole scenes). To tackle the primary problem, we use pre educated sentence-level embeddings, e.g. from an embedding mannequin optimized for paraphrase identification, to symbolize textual content in each sources. For the second problem, we use dynamic time warping (DTW), a way for measuring the similarity between two sequences which will differ in time or pace. While DTW assumes a monotonicity situation on the alignments³ which is often violated in follow, it’s sturdy sufficient to get well from native misalignments and the overwhelming majority of salient occasions (like scene boundaries) are well-aligned.
As a results of DTW, the scene headers have timestamps that may point out doable scene boundaries within the video. The alignments can be used to e.g., increase audiovisual ML fashions with screenplay data like scene-level embeddings, or switch labels assigned to audiovisual content material to coach screenplay prediction fashions.
The alignment technique above is an effective way to rise up and operating with the scene change activity because it combines easy-to-use pretrained embeddings with a well known dynamic programming method. However, it presupposes the provision of high-quality screenplays. A complementary strategy (which in truth, can use the above alignments as a characteristic) that we current subsequent is to coach a sequence mannequin on annotated scene change knowledge. Certain workflows in Netflix seize this data, and that’s our major knowledge supply; publicly-released datasets are additionally accessible.
From an architectural perspective, the mannequin is comparatively easy — a bidirectional GRU (biGRU) that ingests shot representations at every step and predicts if a shot is on the finish of a scene.⁴ The richness within the mannequin comes from these pretrained, multimodal shot embeddings, a preferable design selection in our setting given the issue in acquiring labeled scene change knowledge and the comparatively bigger scale at which we are able to pretrain numerous embedding fashions for photographs.
For video embeddings, we leverage an in-house mannequin pretrained on aligned video clips paired with textual content (the aforementioned “timestamped text”). For audio embeddings, we first carry out supply separation to try to separate foreground (speech) from background (music, sound results, noise), embed every separated waveform individually utilizing wav2vec2, after which concatenate the outcomes. Both early and late-stage fusion approaches are explored; within the former (Figure 4a), the audio and video embeddings are concatenated and fed right into a single biGRU, and within the latter (Figure 4b) every enter modality is encoded with its personal biGRU, after which the hidden states are concatenated previous to the output layer.
We discover:
- Our outcomes match and generally even outperform the state-of-the-art (benchmarked utilizing the video modality solely and on our analysis knowledge). We consider the outputs utilizing F-1 rating for the constructive label, and likewise calm down this analysis to think about “off-by-n” F-1 i.e., if the mannequin predicts scene adjustments inside n photographs of the bottom reality. This is a extra life like measure for our use circumstances because of the human-in-the-loop setting that these fashions are deployed in.
- As with earlier work, including audio options improves outcomes by 10–15%. A major driver of variation in efficiency is late vs. early fusion.
- Late fusion is persistently 3–7% higher than early fusion. Intuitively, this end result is sensible — the temporal dependencies between photographs is probably going modality-specific and ought to be encoded individually.
We have offered two complementary approaches to scene boundary detection that leverage a wide range of accessible modalities — screenplay, audio, and video. Logically, the subsequent steps are to a) mix these approaches and use screenplay options in a unified mannequin and b) generalize the outputs throughout a number of shot-level inference duties, e.g. shot kind classification and memorable moments identification, as we hypothesize that this path can be helpful for coaching normal objective video understanding fashions of longer-form content material. Longer-form content material additionally comprises extra complicated narrative construction, and we envision this work as the primary in a sequence of tasks that intention to raised combine narrative understanding in our multimodal machine studying fashions.
Special due to Amir Ziai, Anna Pulido, and Angie Pollema.