Introducing Configurable Metaflow | by Netflix Technology Blog | Dec, 2024

0
110
Introducing Configurable Metaflow | by Netflix Technology Blog | Dec, 2024


David J. Berg*, David Casler^, Romain Cledat*, Qian Huang*, Rui Lin*, Nissan Pow*, Nurcan Sonmez*, Shashank Srikanth*, Chaoying Wang*, Regina Wang*, Darin Yu*
*: Model Development Team, Machine Learning Platform
^: Content Demand Modeling Team

A month in the past at QConSF, we showcased how Netflix makes use of Metaflow to energy a various set of ML and AI use instances, managing 1000’s of distinctive Metaflow flows. This adopted a earlier weblog on the identical matter. Many of those tasks are beneath fixed improvement by devoted groups with their very own enterprise targets and improvement greatest practices, such because the system that helps our content material resolution makers, or the system that ranks which language subtitles are most dear for a particular piece of content material.

As a central ML and AI platform workforce, our function is to empower our accomplice groups with instruments that maximize their productiveness and effectiveness, whereas adapting to their particular wants (not the opposite method round). This has been a guiding design precept with Metaflow since its inception.

Metaflow infrastructure stack

Standing on the shoulders of our intensive cloud infrastructure, Metaflow facilitates quick access to knowledge, compute, and production-grade workflow orchestration, in addition to built-in greatest practices for widespread issues akin to collaboration, versioning, dependency administration, and observability, which groups use to setup ML/AI experiments and programs that work for them. As a consequence, Metaflow customers at Netflix have been capable of run hundreds of thousands of experiments over the previous few years with out losing time on low-level issues.

While Metaflow goals to be un-opinionated about among the higher ranges of the stack, some groups inside Netflix have developed their very own opinionated tooling. As a part of Metaflow’s adaptation to their particular wants, we continuously attempt to perceive what has been developed and, extra importantly, what gaps these options are filling.

In some instances, we decide that the hole being addressed may be very workforce particular, or too opinionated at too excessive a stage within the stack, and we due to this fact determine to not develop it inside Metaflow. In different instances, nevertheless, we understand that we will develop an underlying assemble that aids in filling that hole. Note that even in that case, we don’t all the time purpose to fully fill the hole and as a substitute deal with extracting a extra common decrease stage idea that may be leveraged by that exact person but in addition by others. One such recurring sample we observed at Netflix is the necessity to deploy units of carefully associated flows, typically as half of a bigger pipeline involving desk creations, ETLs, and deployment jobs. Frequently, practitioners wish to experiment with variants of those flows, testing new knowledge, new parameterizations, or new algorithms, whereas maintaining the general construction of the movement or flows intact.

A pure answer is to make flows configurable utilizing configuration recordsdata, so variants will be outlined with out altering the code. Thus far, there hasn’t been a built-in answer for configuring flows, so groups have constructed their bespoke options leveraging Metaflow’s JSON-typed Parameters, IncludeFile, and deploy-time Parameters or deploying their very own home-grown answer (typically with nice ache). However, none of those options make it straightforward to configure all features of the movement’s conduct, decorators specifically.

Requests for a characteristic like Metaflow Config

Outside Netflix, now we have seen related steadily requested questions on the Metaflow neighborhood Slack as proven within the person quotes above:

Today, to reply the FAQ, we introduce a brand new — small however mighty — characteristic in Metaflow: a Config object. Configs complement the prevailing Metaflow constructs of artifacts and Parameters, by permitting you to configure all features of the movement, decorators specifically, previous to any run beginning. At the top of the day, artifacts, Parameters and Configs are all saved as artifacts by Metaflow however they differ in when they’re persevered as proven within the diagram under:

Different knowledge artifacts in Metaflow

Said one other method:

  • An artifact is resolved and persevered to the datastore on the finish of every process.
  • A parameter is resolved and persevered at first of a run; it might due to this fact be modified as much as that time. One widespread use case is to make use of triggers to move values to a run proper earlier than executing. Parameters can solely be used inside your step code.
  • A config is resolved and persevered when the movement is deployed. When utilizing a scheduler akin to Argo Workflows, deployment occurs when create’ing the movement. In the case of an area run, “deployment” occurs simply previous to the execution of the run — consider “deployment” as gathering all that’s wanted to run the movement. Unlike parameters, configs can be utilized extra extensively in your movement code, notably, they can be utilized in step or movement stage decorators in addition to to set defaults for parameters. Configs can after all even be used inside your movement.

As an instance, you possibly can specify a Config that reads a pleasantly human-readable configuration file, formatted as TOML. The Config specifies a triggering ‘@schedule’ and ‘@resource’ necessities, in addition to application-specific parameters for this particular deployment:

[schedule]
cron = "0 * * * *"

[model]
optimizer = "adam"
learning_rate = 0.5

[resources]
cpu = 1

Using the newly launched Metaflow 2.13, you possibly can configure a movement with a Config like above, as demonstrated by this movement:

import pprint
from metaflow import FlowSpec, step, Config, sources, config_expr, schedule

@schedule(cron=config_expr("config.schedule.cron"))
class ConfigurableStream(FlowSpec):
config = Config("config", default="myconfig.toml", parser="tomllib.hundreds")

@sources(cpu=config.sources.cpu)
@step
def begin(self):
print("Config loaded:")
pprint.pp(self.config)
self.subsequent(self.finish)

@step
def finish(self):
move

if __name__ == "__main__":
ConfigurableStream()

There is quite a bit occurring within the code above, just a few highlights:

  • you possibly can check with configs earlier than they’ve been outlined utilizing ‘config_expr’.
  • you possibly can outline arbitrary parsers — utilizing a string means the parser doesn’t even should be current remotely!

From the developer’s standpoint, Configs behave like dictionary-like artifacts. For comfort, they assist the dot-syntax (when attainable) for accessing keys, making it straightforward to entry values in a nested configuration. You may unpack the entire Config (or a subtree of it) with Python’s normal dictionary unpacking syntax, ‘**config’. The normal dictionary subscript notation can be accessible.

Since Configs flip into dictionary artifacts, they get versioned and persevered routinely as artifacts. You can entry Configs of any previous runs simply by means of the Client API. As a consequence, your knowledge, fashions, code, Parameters, Configs, and execution environments are all saved as a constant bundle — neatly organized in Metaflow namespaces — paving the best way for simply reproducible, constant, low-boilerplate, and now simply configurable experiments and strong manufacturing deployments.

While you may get far by accompanying your movement with a easy config file (saved in your favourite format, due to user-definable parsers), Configs unlock numerous superior use instances. Consider these examples from the up to date documentation:

A significant advantage of Config over earlier extra hacky options for configuring flows is that they work seamlessly with different options of Metaflow: you possibly can run steps remotely and deploy flows to manufacturing, even when counting on customized parsers, with out having to fret about packaging Configs or parsers manually or maintaining Configs constant throughout duties. Configs additionally work with the Runner and Deployer.

When used together with a configuration supervisor like Hydra, Configs allow a sample that’s extremely related for ML and AI use instances: orchestrating experiments over a number of configurations or sweeping over parameter areas. While Metaflow has all the time supported sweeping over parameter grids simply utilizing foreaches, it hasn’t been simply attainable to change the movement itself, e.g. to alter @sources or @pypi/@conda dependencies for each experiment.

In a typical case, you set off a Metaflow movement that consumes a configuration file, altering how a run behaves. With Hydra, you possibly can invert the management: it’s Hydra that decides what will get run based mostly on a configuration file. Thanks to Metaflow’s new Runner and Deployer APIs, you possibly can create a Hydra app that operates Metaflow programmatically — as an example, to deploy and execute a whole bunch of variants of a movement in a large-scale experiment.

Take a have a look at two fascinating examples of this sample within the documentation. As a teaser, this video exhibits Hydra orchestrating deployment of tens of Metaflow flows, every of which benchmarks PyTorch utilizing a various variety of CPU cores and tensor sizes, updating a visualization of the leads to real-time because the experiment progresses:

Example utilizing Hydra with Metaflow

To give a motivating instance of what configurations appear to be at Netflix in apply, let’s contemplate Metaboost, an inside Netflix CLI device that helps ML practitioners handle, develop and execute their cross-platform tasks, considerably just like the open-source Hydra mentioned above however with particular integrations to the Netflix ecosystem. Metaboost is an instance of an opinionated framework developed by a workforce already utilizing Metaflow. In reality, part of the inspiration for introducing Configs in Metaflow got here from this very use case.

Metaboost serves as a single interface to 3 completely different inside platforms at Netflix that handle ETL/Workflows (Maestro), Machine Learning Pipelines (Metaflow) and Data Warehouse Tables (Kragle). In this context, having a single configuration system to handle a ML challenge holistically provides customers elevated challenge coherence and decreased challenge threat.

Configuration in Metaboost

Ease of configuration and templatizing are core values of Metaboost. Templatizing in Metaboost is achieved by means of the idea of bindings, whereby we will bind a Metaflow pipeline to an arbitrary label, after which create a corresponding bespoke configuration for that label. The binding-connected configuration is then merged into a world set of configurations containing such info as GIT repository, department, and so forth. Binding a Metaflow, may even sign to Metaboost that it ought to instantiate the Metaflow movement as soon as per binding into our orchestration cluster.

Imagine a ML practitioner on the Netflix Content ML workforce, sourcing options from a whole bunch of columns in our knowledge warehouse, and creating a large number of fashions in opposition to a rising suite of metrics. When a model new content material metric comes alongside, with Metaboost, the primary model of the metric’s predictive mannequin can simply be created by merely swapping the goal column in opposition to which the mannequin is skilled.

Subsequent variations of the mannequin will consequence from experimenting with hyper parameters, tweaking characteristic engineering, or conducting characteristic diets. Metaboost’s bindings, and their integration with Metaflow Configs, will be leveraged to scale the variety of experiments as quick as a scientist can create experiment based mostly configurations.

Scaling experiments with Metaboost bindings — backed by Metaflow Config

Consider a Metaboost ML challenge named `demo` that creates and hundreds knowledge to customized tables (ETL managed by Maestro), after which trains a easy mannequin on this knowledge (ML Pipeline managed by Metaflow). The challenge construction of this repository may appear to be the next:

├── metaflows
│ ├── customized -> customized python code, utilized by
| | | Metaflow
│ │ ├── knowledge.py
│ │ └── mannequin.py
│ └── coaching.py -> defines our Metaflow pipeline
├── schemas
│ ├── demo_features_f.tbl.yaml -> desk DDL, shops our ETL
| | output, Metaflow enter
│ └── demo_predictions_f.tbl.yaml -> desk DDL,
| shops our Metaflow output
├── settings
│ ├── settings.configuration.EXP_01.yaml -> defines the additive
| | config for Experiment 1
│ ├── settings.configuration.EXP_02.yaml -> defines the additive
| | config for Experiment 2
│ ├── settings.configuration.yaml -> defines our world
| | configuration
│ └── settings.setting.yaml -> defines parameters based mostly on
| git department (e.g. READ_DB)
├── exams
├── workflows
│ ├── sql
│ ├── demo.demo_features_f.sch.yaml -> Maestro workflow, defines ETL
│ └── demo.most important.sch.yaml -> Maestro workflow, orchestrates
| ETLs and Metaflow
└── metaboost.yaml -> defines our challenge for
Metaboost

The configuration recordsdata within the settings listing above include the next YAML recordsdata:

# settings.configuration.yaml (world configuration)
mannequin:
fit_intercept: True
conda:
numpy: '1.22.4'
"scikit-learn": '1.4.0'
# settings.configuration.EXP_01.yaml
target_column: metricA
options:
- runtime
- content_type
- top_billed_talent
# settings.configuration.EXP_02.yaml
target_column: metricA
options:
- runtime
- director
- box_office

Metaboost will merge every experiment configuration (*.EXP*.yaml) into the worldwide configuration (settings.configuration.yaml) individually at Metaboost command initialization. Let’s check out how Metaboost combines these configurations with a Metaboost command:

(venv-demo) ~/tasks/metaboost-demo [branch=demoX] 
$ metaboost metaflow settings present --yaml-path=configuration

binding=EXP_01:
mannequin: -> outlined in setting.configuration.yaml (world)
fit_intercept: true
conda: -> outlined in setting.configuration.yaml (world)
numpy: 1.22.4
"scikit-learn": 1.4.0
target_column: metricA -> outlined in setting.configuration.EXP_01.yaml
options: -> outlined in setting.configuration.EXP_01.yaml
- runtime
- content_type
- top_billed_talent

binding=EXP_02:
mannequin: -> outlined in setting.configuration.yaml (world)
fit_intercept: true
conda: -> outlined in setting.configuration.yaml (world)
numpy: 1.22.4
"scikit-learn": 1.4.0
target_column: metricA -> outlined in setting.configuration.EXP_02.yaml
options: -> outlined in setting.configuration.EXP_02.yaml
- runtime
- director
- box_office

Metaboost understands it ought to deploy/run two unbiased situations of coaching.py — one for the EXP_01 binding and one for the EXP_02 binding. You may see that Metaboost is conscious that the tables and ETL workflows are not certain, and will solely be deployed as soon as. These particulars of which artifacts to bind and which to depart unbound are encoded within the challenge’s top-level metaboost.yaml file.

(venv-demo) ~/tasks/metaboost-demo [branch=demoX] 
$ metaboost challenge listing

Tables (metaboost desk listing):
schemas/demo_predictions_f.tbl.yaml (binding=default):
table_path=prodhive/demo_db/demo_predictions_f
schemas/demo_features_f.tbl.yaml (binding=default):
table_path=prodhive/demo_db/demo_features_f

Workflows (metaboost workflow listing):
workflows/demo.demo_features_f.sch.yaml (binding=default):
cluster=sandbox, workflow.id=demo.branch_demox.demo_features_f
workflows/demo.most important.sch.yaml (binding=default):
cluster=sandbox, workflow.id=demo.branch_demox.most important

Metaflows (metaboost metaflow listing):
metaflows/coaching.py (binding=EXP_01): -> EXP_01 occasion of coaching.py
cluster=sandbox, workflow.id=demo.branch_demox.EXP_01.coaching
metaflows/coaching.py (binding=EXP_02): -> EXP_02 occasion of coaching.py
cluster=sandbox, workflow.id=demo.branch_demox.EXP_02.coaching

Below is an easy Metaflow pipeline that fetches knowledge, executes characteristic engineering, and trains a LinearRegression mannequin. The work to combine Metaboost Settings right into a person’s Metaflow pipeline (applied utilizing Metaflow Configs) is as straightforward as including a single mix-in to the FlowSpec definition:

from metaflow import FlowSpec, Parameter, conda_base, step
from customized.knowledge import feature_engineer, get_data
from metaflow.metaboost import MetaboostSettings

@conda_base(
libraries=MetaboostSettings.get_deploy_time_settings("configuration.conda")
)
class DemoTraining(FlowSpec, MetaboostSettings):
prediction_date = Parameter("prediction_date", kind=int, default=-1)

@step
def begin(self):
# get show_settings() at no cost with the mixin
# and get handy debugging information
self.show_settings(exclude_patterns=["artifact*", "system*"])

self.subsequent(self.get_features)

@step
def get_features(self):
# characteristic engineers on our extracted knowledge
self.fe_df = feature_engineer(
# hundreds knowledge from our ETL pipeline
knowledge=get_data(prediction_date=self.prediction_date),
options=self.settings.configuration.options +
[self.settings.configuration.target_column]
)

self.subsequent(self.practice)

@step
def practice(self):
from sklearn.linear_model import LinearRegression

# trains our mannequin
self.mannequin = LinearRegression(
fit_intercept=self.settings.configuration.mannequin.fit_intercept
).match(
X=self.fe_df[self.settings.configuration.features],
y=self.fe_df[self.settings.configuration.target_column]
)
print(f"Fit slope: {self.mannequin.coef_[0]}")
print(f"Fit intercept: {self.mannequin.intercept_}")

self.subsequent(self.finish)

@step
def finish(self):
move

if __name__ == "__main__":
DemoTraining()

The Metaflow Config is added to the FlowSpec by mixing within the MetaboostSettings class. Referencing a configuration worth is as straightforward as utilizing the dot syntax to drill into whichever parameter you’d like.

Finally let’s check out the output from our pattern Metaflow above. We execute experiment EXP_01 with

metaboost metaflow run --binding=EXP_01

which upon execution will merge the configurations right into a single settings file (proven beforehand) and serialize it as a yaml file to the .metaboost/settings/compiled/ listing.

You can see the precise command and args that have been sub-processed within the Metaboost Execution part under. Please notice the –config argument pointing to the serialized yaml file, after which subsequently accessible through self.settings. Also notice the handy printing of configuration values to stdout throughout the begin step utilizing a blended in operate named show_settings().

(venv-demo) ~/tasks/metaboost-demo [branch=demoX] 
$ metaboost metaflow run --binding=EXP_01

Metaboost Execution:
- python3.10 /root/repos/cdm-metaboost-irl/metaflows/coaching.py
--no-pylint --package-suffixes=.py --environment=conda
--config settings
.metaboost/settings/compiled/settings.branch_demox.EXP_01.coaching.mP4eIStG.yaml
run --prediction_date20241006

Metaflow 2.12.39+nflxfastdata(2.13.5);nflx(2.13.5);metaboost(0.0.27)
executing DemoTraining for person:dcasler
Validating your movement...
The graph appears to be like good!
Bootstrapping Conda setting... (this might take a couple of minutes)
All packages already cached in s3.
All environments already cached in s3.

Workflow beginning (run-id 50), see it within the UI at
https://metaflowui.prod.netflix.net/DemoTraining/50

[50/start/251640833] Task is beginning.
[50/start/251640833] Configuration Values:
[50/start/251640833] settings.configuration.conda.numpy = 1.22.4
[50/start/251640833] settings.configuration.options.0 = runtime
[50/start/251640833] settings.configuration.options.1 = content_type
[50/start/251640833] settings.configuration.options.2 = top_billed_talent
[50/start/251640833] settings.configuration.mannequin.fit_intercept = True
[50/start/251640833] settings.configuration.target_column = metricA
[50/start/251640833] settings.setting.READ_DATABASE = data_warehouse_prod
[50/start/251640833] settings.setting.TARGET_DATABASE = demo_dev
[50/start/251640833] Task completed efficiently.

[50/get_features/251640840] Task is beginning.
[50/get_features/251640840] Task completed efficiently.

[50/train/251640854] Task is beginning.
[50/train/251640854] Fit slope: 0.4702672504331096
[50/train/251640854] Fit intercept: -6.247919678070083
[50/train/251640854] Task completed efficiently.

[50/end/251640868] Task is beginning.
[50/end/251640868] Task completed efficiently.

Done! See the run within the UI at
https://metaflowui.prod.netflix.net/DemoTraining/50

Takeaways

Metaboost is an integration device that goals to ease the challenge improvement, administration and execution burden of ML tasks at Netflix. It employs a configuration system that mixes git based mostly parameters, world configurations and arbitrarily certain configuration recordsdata to be used throughout execution in opposition to inside Netflix platforms.

Integrating this configuration system with the brand new Config in Metaflow is extremely easy (by design), solely requiring customers so as to add a mix-in class to their FlowSpec — just like this instance in Metaflow documentation — after which reference the configuration values in steps or decorators. The instance above templatizes a coaching Metaflow for the sake of experimentation, however customers might simply as simply use bindings/configs to templatize their flows throughout goal metrics, enterprise initiatives or every other arbitrary strains of labor.

It couldn’t be simpler to get began with Configs! Just

pip set up -U metaflow

to get the newest model and head to the up to date documentation for examples. If you’re impatient, you’ll find and execute all config-related examples on this repository as properly.

If you will have any questions or suggestions about Config (or different Metaflow options), you possibly can attain out to us on the Metaflow neighborhood Slack.

We want to thank Outerbounds for his or her collaboration on this characteristic; for rigorously testing it and creating a repository of examples to showcase among the prospects provided by this characteristic.

LEAVE A REPLY

Please enter your comment!
Please enter your name here