From Facts & Metrics to Media Machine Learning: Evolving the Data Engineering Function at Netflix | by Netflix Technology Blog | Aug, 2025

0
497

[ad_1]

By Dao Mi, Pablo Delgado, Ryan Berti, Amanuel Kahsay, Obi-Ike Nwoke, Christopher Thrailkill, and Patricio Garza

At Netflix, information engineering has all the time been a crucial perform to allow the enterprise’s capability to grasp content material, energy suggestions, and drive enterprise selections. Traditionally, the perform centered on constructing sturdy tables and pipelines to seize details, derive metrics, and supply properly modeled information merchandise to their companions in analytics & information science capabilities. But as Netflix’s studio and content material manufacturing scaled, so too have the challenges — and alternatives — of working with advanced media information.

Today, we’re excited to share how our staff is formalizing a brand new specialization of information engineering at Netflix: Media ML Data Engineering. This evolution is embodied in our newest collaboration with our platform groups, the Media Data Lake, which is designed to harness the complete potential of media property (video, audio, subtitles, scripts, and extra) and allow the newest advances in machine studying, together with newest transformer mannequin structure. As a part of this initiative, we’re deliberately making use of information engineering finest practices — making certain that our strategy is each progressive and grounded in confirmed methodologies.

The Evolution: From Traditional Tables to Media Tables

Traditional information engineering at Netflix centered on constructing structured tables for metrics, dashboards, and information science fashions. These tables have been primarily structured textual content or numerical fields, ultimate for enterprise intelligence, analytics and statistical modeling.

However, the character of media information is basically completely different:

  • It’s multi-modal (video, audio, textual content, photos).
  • It accommodates derived fields from media (embeddings, captions, transcriptions…and many others)
  • It’s unstructured and big in scale when parsed out.
  • It’s deeply intertwined with artistic workflows and enterprise asset lineage.

As our studio operations (see under) expanded, we noticed the necessity for a brand new strategy — one that would present centralized, standardized, and scalable entry to all forms of media property and their metadata for each analytical and machine studying workflows.

Press enter or click on to view picture in full measurement

The Rise of Media ML Data Engineering

Enter Media ML Data Engineering — a brand new specialization at Netflix that bridges the hole between conventional information engineering and the distinctive calls for of media-centric machine studying. This position sits on the intersection of information engineering, ML infrastructure, and media manufacturing. Our mission is to supply seamless entry to media property and derived information (together with outputs from machine studying fashions) for researchers, information scientists, and different downstream information shoppers.

Key Responsibilities

  • Centralized Media Data Access: Building, cataloging and sustaining the info and pipelines that populates the Media Data Lake, a knowledge platform for storing and serving media property and their metadata.
  • Asset Standardization: Standardizing media property throughout modalities (video, photos, audio, textual content) to make sure consistency and high quality for ML functions in partnership with area engineering groups.
  • Metadata Management: Unifying and enriching asset metadata, making it simpler to trace asset lineage, high quality, and protection.
  • ML-Ready Data: Exposing massive corpora of property for early-stage algorithm exploration, benchmarking, and productionization.
  • Collaboration: Partnering intently with area consultants, algorithm researchers, upstream content material engineering groups and (machine studying & information) platform colleagues to make sure our information meets real-world wants.

This new position is important for bridging the hole between artistic media workflows and the technical calls for of cutting-edge ML.

Introducing the Media Data Lake

To allow the subsequent era of media analytics and machine studying, we’re constructing the Media Data Lake at Netflix — a knowledge lake designed particularly for media property at Netflix utilizing cutting-edge vector storage options. We have partnered with our information platform staff to pilot integrating LanceDB into our Big Data Platform.

Architecture and Key Components

  • Media Table: The core of the Media Data Lake, this structured dataset captures important metadata and references to all media property. It’s designed to be extensible, supporting each conventional metadata and outputs from ML fashions (together with transformer-based embeddings, media understanding analysis and extra).
  • Data Model: We are creating a sturdy information mannequin to standardize how media property and their attributes are represented, making it simpler to question and be part of throughout schemas.
  • Data API: An pythonic interface that may present programmatic entry to the Media Table, supporting each interactive exploration and automatic workflows.
  • UI Components: Off-the-shelf UI interfaces allow groups to visually discover property within the media information lake, accelerating discovery and iteration for ICs.
  • Online and Offline System Architecture: Real-time entry for light-weight queries and exploration of uncooked media property; scalable massive batch processing for ML coaching, benchmarking, and analysis.
  • Compute: distributed batch inference layer able to processing utilizing GPUs and media information processing at scale utilizing CPUs.

Starting Small with New Technology

Our preliminary focus this previous 12 months has been on delivering a “data pond” — a mini-version of the Media Data Lake focused at video/audio datasets for early stage mannequin coaching, analysis and analysis. All information for this section comes from AMP, our inside asset administration system and annotation retailer, and the scope is deliberately small to make sure a stable, extensible basis might be constructed whereas introducing a brand new know-how into the corporate. We are in a position to carry out information exploration of the uncooked media property to construct up an intuitive understanding of the media by way of light-weight queries to AMP.

Media Tables: The New Foundation for ML and Innovation

One of essentially the most thrilling developments is the rise of media tables — structured datasets that not solely seize conventional metadata, but additionally embody the outputs of superior ML fashions.

Press enter or click on to view picture in full measurement

These media tables energy a variety of progressive functions, reminiscent of:

  • Translation & Audio Quality Measures: Managing audio clips and options by way of text-to-speech fashions for engineering localization high quality metrics.
  • Media Fidelity Restoration: Research on restoration of movies to HDR for remastering and different picture know-how use-cases.
  • Story Understanding and Content Embedding: Structuring narrative components extracted from textual proof and video of a title to extend operational effectivity in title launch preparation and rankings, e.g. detection of smoking, gore, NSFW scenes in our titles.
  • Media Search: Leverage multi-modal vector search to seek out comparable keyframes, pictures, dialogue to facilitate analysis and experimentation.

These tables are designed to scale, assist advanced queries, and serve each analysis and different information science & analytical wants.

The Human Side: New Roles and Collaboration

Media ML Data Engineering is a staff sport. Our information engineers associate with area consultants, information scientists, ML researchers, upstream enterprise ops and content material engineering groups to make sure our information options are match for objective. We additionally work intently with our pleasant platform groups to make sure technological breakthroughs which are helpful past our small nook of the universe might change into horizontal abstractions that profit the remainder of Netflix. This collaborative mannequin permits fast iteration, excessive information high quality, progressive use circumstances and know-how re-use.

Press enter or click on to view picture in full measurement

Looking Ahead

The evolution from conventional information engineering to Media ML information engineering — anchored by our media information lake — is unlocking new frontiers for Netflix:

  • Richer, extra correct ML fashions educated on high-quality, standardized media information.
  • Supercharge ML Model evaluations by way of fast iteration cycles on the info.
  • Faster experimentation and productization of latest AI-powered options.
  • Deeper insights into our content material and inventive workflows by way of metrics constructed from Media ML algorithms inferred options.

As we proceed to develop the media information lake, be looking out for subsequent weblog posts sharing our learnings and instruments with the broader media ml & information engineering neighborhood.

This article was up to date on August 25, 2025.

[ad_2]

LEAVE A REPLY

Please enter your comment!
Please enter your name here