Workflow orchestrators: Metaflow, Kedro, Luigi, Airflow, Flyte, Prefect and others

There are probably more workflow orchestrators available than there are people at your company.

Apr 05, 2023

Why reinvent the wheel? Because it’s fun.

It’s a lot easier to say “I can make that tool better” rather than say, “how do I create new value for my client?”

Now, if your client is thousands of time-constrained data scientists who hate all of these things but want them to work exactly how they hope they do, then sure - try to make that tool better.

Just to get you warmed up

Star History Matters

Why? 1) it’s a signal of how much documentation and SO help there is, 2) it’s a signal of how stable it is, 3) it’s a signal of your nerdiness. I’m sure you’ll like these plots as much as I do, you savvy DS.

Luigi vs. Argo, Kubeflow, metaflow, mlflow, prefect, airflow, zenml, kedro

I love star-history.com (for plot below), idea stolen from this blog.

Luigi vs. Dagster, temporal, flyte

Star Ratings aren’t everything

A few things:

There’s faster adoption today beause there are more data engineers than there were before.
Just because something’s popular, doesn’t mean it’s more valuable. Take this R popularity plot below per SO.

This plot above is misleading, because R is mostly for data science (like pandas):

Add pandas in and you see it’s 50-50 what questions people are asking for.

Your tool depends on your use case

Most of the tools above are build for Data Engineers - folks who care about deploying pipelines in production on some server or something. Kedro and Metaflow are the only two I’ve seen specifically marketed for data scientists, while others are a mix of DS/ML Eng/Data Eng.

I can’t compare all these for you. I’m just writing this blog post for my own use cases. Here’s my workflow as a DS:

I want to be able to run it locally as easily as I run it remotely. Most of my prototyping is on my mac, but then I want to scale my compute remotely.
Debugging: I want to be able to inspect outputs locally
I want to be able to re-run tasks easily (say, if task 5 dies) without running all prior steps. Step 1 might be load/clean data. Step 2 might be build ML model. Step 3 might be make plots.
Parallelization/dynamic inputs. I want to scale

I don’t care about retrying tasks (if my tasks fail, it’s because my data are bad and I want them to fail). I feel like retrying is for when you rely on internet connections and hit timeouts or something.

Flyte vs. Metaflow

Flyte and Metaflow have online demos.

Design:
- Flyte: Decorators: I like prefect because it’s all decorators. Metaflow has a strange interface with the class objects. But Flyte is also all decorators.
- Metaflow: Classes: Metaflow is designed with a single class with methods that chain and all the data are stored in that class, which is persisted to disk.
Running locally/remotely:
- Both: Run locally/remotely. Can scale on kubernetes (k8s).
- Flyte: Runs out of the box (no k8s required) (just run the python script). You can launch and run it on a local cluster.
- Metaflow: scales to k8s or AWS batch
Interact/debug the output: (need to cache)
- Flyte: Need to cache each task
  - Locally:
  - Remotely, it saves all the output in Pickle files. Like after running a workflow, this node had 3 sklearn model objects in it. Just supply the execution id. Remember you can run a task “remotely” on a local cluster.
- Metaflow:
Re-run specific tasks: Flyte lets you re-run specific tasks, though I haven’t tried this and it seems pretty complicated. I’m sure I could build a CLI that makes this more efficient, but Metaflow has a pretty clean interface for resuming a task.

Interacting with output

from flytekit.remote import FlyteRemote
from flytekit.configuration import Config

remote = FlyteRemote(
    config=Config.auto(),
    default_project="flytesnacks",
    default_domain="development",
)
# Specify the run_id
execution_id = "fac7321e7902348da9c3"
execution = remote.fetch_execution(name=execution_id)
remote.sync(execution, sync_nodes=True)

# get execution output keys
len(execution.outputs["o0"])
# 3

When running it locally:

coder@union-sandbox-00u8jkhhfd2pivdrp5d7-55685d656c-tmt7s:~/flyte$ pyflyte run \
  flyte_demo/workflows/parallelism.py training_workflow \
  --hp_grid '[{"C": 0.1, "max_iter":1000}, {"C": 0.01, "max_iter":1000}, {"C": 0.001, "max_iter":1000}]'


TrainArgs(hyperparameters={'C': 0.1, 'max_iter': 1000.0}, data=StructuredDataset(uri=None, file_format='parquet'))
TrainArgs(hyperparameters={'max_iter': 1000.0, 'C': 0.01}, data=StructuredDataset(uri=None, file_format='parquet'))
TrainArgs(hyperparameters={'max_iter': 1000.0, 'C': 0.001}, data=StructuredDataset(uri=None, file_format='parquet'))
[LogisticRegression(C=0.1, max_iter=1000.0), LogisticRegression(C=0.01, max_iter=1000.0), LogisticRegression(C=0.001, max_iter=1000.0)]

When runing remotely, it provides an execution id: (added the —remote flag)

coder@union-sandbox-00u8jkhhfd2pivdrp5d7-55685d656c-tmt7s:~/flyte$ pyflyte run \
  --remote \
  flyte_demo/workflows/parallelism.py training_workflow \
  --hp_grid '[{"C": 0.1, "max_iter":1000}, {"C": 0.01, "max_iter":1000}, {"C": 0.001, "max_iter":1000}]' 

Go to https://sandbox.union.ai/console/projects/flytesnacks/domains/development/executions/f535a4fe73efd4a248fc to see execution in the console.

Dashboard:

Comparisons

Why not join the approaches?

Doesn’t have to be a one-size fits all.

Blog post showing how to integrate Airflow with Metaflow

Data Science Daily

Discussion about this post