Data Quality Assurance with Great Expectations and Kubeflow Pipelines

Data Quality Assurance with Great Expectations and Kubeflow Pipelines

Introduction

The importance of data quality validation in machine learning is hard to overestimate. Nevertheless, major ML platforms are still lacking tools to establish the data QA process. Recently, Provectus has made a contribution to the Kubeflow repository that will allow ML engineers to test and validate data inside Kubeflow Pipelines with the Great Expectations component. In this article, we talk about what it is and show how you can use it to keep data for your ML pipelines in check.

Kubeflow Pipeline with Great Expectations component for data validation

The Rise of ML Data Quality

The lack of data quality assurance in ML model life cycle always costs and backfires. You can have all the stellar MLOps tools and high-quality models you want, but if your data is garbage, none of that matters. Unprepared data will annihilate their worth.

Available research continues to confirm this year after year.

For instance, a recent 2020 Gartner survey found that organizations estimate the average cost of poor data quality at $12.8 million per year, and this number will likely rise as business environments become increasingly complex.

Risks of optimizing for the wrong thing are at all times high, and no one can afford to ignore testing data for models and products based on them. No matter what framework you’re using to orchestrate your ML, having data assurance in place is a must to eliminate the #1 enemy of effective ML and companies’ budgets: bad data. At Provectus, we address it by making Data Quality Assurance a part of any of our end-to-end ML and Data Infrastructure solutions.

Kubeflow Pipelines, a popular open-source end-to-end ML orchestration platform, is no exception to this rule. When you use Kubeflow Pipelines as your go-to tool for building and managing your ML workflows on Kubernetes, you need to have data quality assurance in place. You definitely want to eliminate the GIGO (“garbage in, garbage out”) situation and get the second picture:

Data Quality assurance is a must for effective data pipelines. Image by author.

Though the need for data QA is already widely recognized by the ML community, data quality tools are still in their infancy and are largely absent from many major ML platforms, including Kubeflow Pipelines. Up until recently, Kubeflow users couldn’t establish a proper process to make sure that their data is tested and validated before training their model.

At Provectus, we decided to fix this and contributed a pull request that adds a Great Expectations data validation component to the Kubeflow repository. From now on you can add the GE component to your Kubeflow pipelines to test and validate .csv datasets against the rules you set in the Expectations Suite. If a dataset is invalid, the component stops pipeline execution. If it’s valid, you know your data is healthy and the pipeline continues to execute. The GE component also outputs an HTML Data Doc with a validation report in each of these scenarios.

What is Great Expectations

Great Expectations is an open-source data validation framework written in Python that allows you to test, profile, and document data to measure and maintain its quality on any stage of your ML pipeline. It’s a great way to avoid ending up with inconsistent records and resulting issues in data pipelines.

Great Expectations test and validate data, produce reports and offer logging and alerting. Image by author.

For example, when you ingest data from various outside data providers and need to check the output of an ETL before adding it to your database. Say, you’re doing ML for a medical enterprise and are constantly getting datasets from insurance companies. Often this data is of low quality and suffers from typical issues like not matching a dictionary, incorrect data types, faulty formatting, duplication, shifted in values distribution.

To validate data, you have to write so-called Expectations — assertions that describe the desired state of your datasets. Expectations are basically unit tests that your data will be run against. You can either select asserts from a built-in glossary of expectations or write your own. How does Great Expectations fit into your MLOps?

ML model life cycle consists of three major stages:

  1. Data Preparation. Data is collected, cleaned, structured and enriched to prepare it for later stages.
  2. Model Development. The model is designed, built, trained, tested, and fine-tuned, in preparation for deployment.
  3. Deployment & Operations. A fine-tuned model is packaged and deployed to production, then monitored for quality and reliability.

At what stage of ML life cycle data requires quality assurance, would you say?

Three main stages of the ML model life cycle. Image by author.

That’s right: just like code and model, data of each ML stage should be covered with tests.

Great Expectations is a tool that can be used for data testing in each of the stages, and we brought it to Kubeflow Pipelines to do exactly that.

Depending on how you structure your MLOps, your Kubeflow pipeline can include all or some of these blocks. Data QA belongs to several of them. Here is where you might want to fit Data QA in your ML pipeline to validate your data:

Steps of the ML life cycle that require data validation where the Great Expectations component can help. Image by author.

Great Expectations is super helpful to check data at ingestion, data cleaning, feature engineering or model output stages. In the use case we are looking at below, the first three validation steps shown on the diagram fit into the first component. Let’s dive right in and look how to get started with the GE component in Kubeflow Pipelines.

To learn more about how Great Expectations fits into the MLOps life cycle, check out this post from the Great Expectations blog:

How does Great Expectations fit into MLOps?

Quickstart: Getting Started with Great Expectations on Kubeflow Pipelines

Great Expectations for Kubeflow. Image by author

Setting up and using Great Expectations with Kubeflow Pipelines is a breeze. Besides the prerequisites, you’ll need just a few more steps to get started:

  1. Load GE component definition (component.yaml)
  2. Add Expectation Suite file to a pipeline
  3. Integrate the GE component into a pipeline

After that, you will be able to leverage Great Expectations in your Kubeflow workloads.

0. Prerequisites

For this tutorial, you need:

  1. A Kubeflow cluster. To get a production infrastructure with Kubeflow on AWS up and running, you can use Swiss Army Kube — a simple blueprint for creating GitOps deployments. Check out the quickstart here: SAKK Quickstart.
  2. An Expectation Suite — a JSON file that contains your rules (Expectations) for a dataset. Scaffold a boilerplate and edit it to define your Expectations following this official guide: How to create a new Expectation Suite using suite scaffold. Alternatively, you can create custom Expectations: How to create custom Expectations.
  3. Optionally, you might want to get a cloud account to store Expectation Suite — on S3, Cloud Storage, Azure or any other location that Pandas supports (e.g. static HTTP location).

Use Case

Let’s take the following Kubeflow pipeline as an example. It implements an ML workflow that trains a model to predict the tip amount that a customer paid for a taxi trip:

An example classification ML model that determines a company responsible for a taxi trip in Chicago.
An example classification ML model that determines a company responsible for a taxi trip in Chicago.

This pipeline uses the default Component Store provided by Kubeflow Pipelines SDK (more on this in the next section).

After compilation and upload to Kubeflow the pipeline should look like this (no GE component yet):

Example Kubeflow pipeline. Image by author.

1. Load Great Expectations component

1.1 Load definition file via Kubeflow ComponentStore

First, load the GE component definition file component.yaml. The easiest way to do so is using the default ComponentStore provided by Kubeflow Pipelines SDK. Load ComponentStore from kfp.components:

The store creates components that are defined in the master branch of Kubeflow Pipelines SDK repository. You can create a custom store to fix a repository version:

Then, you can use the store to load the component definition:

1.2 Load definition by URL or from file

Another way is to load the component.yaml file is by its URL using the load_component_from_url function:

Alternatively, you can save the definition to a file and load the component from this file:

2. Add Expectations Suite to data pipeline

GE component needs an Expectation Suite to be passed to it as a file. It can be passed by either loading it from a cloud storage or injecting it into the pipeline definition.

For example, you have an Expectation Suite stored in AWS S3. To load it from S3 you need to create a component that will do the loading. Write the component definition for the loading component in a .yaml file:

Load it using Python:

Similar components can be implemented for other cloud storages in accordance with their infrastructure requirements.

Finally, add this step to the pipeline to load the suite:

Alternatively, you can store the Expectation Suite locally and add it directly to the pipeline:

3. Integrate the GE component into a data pipeline

The integration is done by passing paths to data and Expectation Suite to GE component:

Here, “csv_path” needs to be an output of a component that ingests data into the pipeline (e.g. reads dataset from S3), and “expectation_suite” needs to be a file or an output of a pipeline component.

4. Compile the pipeline and upload it to Kubflow

After steps 1–3 we’ll have the following code for the pipeline assuming that Expectation Suite is saved locally in the same folder (code we added on steps 1–3 is in bold):

In Kubeflow interface, the pipeline with GE component will look like this:

Example Kubeflow pipeline with Great Expectations component (“Validate csv using Great Expectations”). Image by author.

5. Check your data with Great Expectations

Now you have a pipeline that contains the GE component and can leverage it to validate datasets coming to this pipeline. As you can see, the “Validate CSV” GE component is now one of the “Xgboost train”’s dependencies along with “Chicago Taxi”.

Every time you run the pipeline, the “Validate CSV” component will check your dataset against the rules in the Expectations Suite. If it successfully validates the data, the rest of the pipelines will run and all its steps will turn green (1). If the GE step detects data issues and fails, it will terminate further execution down the pipeline and turn red (2).

1 — validation by GE component passed successfully, 2— validation by GE component failed and terminated further execution down the pipeline. Image by author.

Data Doc

After execution, GE component also creates a Data Doc: an HTML report for data validation. To open it, click the link under the Output Artifact section of the Validate step in the Input/Output panel of the Kubeflow UI:

Click the output artifact link to open Data Doc report. Image by author.

Here’s what the report looks like:

Data Doc report generated by the Great Expectations component in Kubeflow. Image by author.

Data Doc shows available expectation suits, validation results, statistics of passed and failed expectations and allows the data team to drill down into values. Here you can see the state and value of each check defined in the Expectation Suite.

Conclusion

In this article, we briefly looked at the importance of data quality for machine learning pipelines and how it can be ensured in Kubeflow using the new Great Expectations component for Kubeflow Pipelines. Key takeaways:

  • Data quality is #1 priority when it comes to building robust ML pipelines.
  • Mind the GIGO principle — you can’t have robust pipelines and ML models without testing and validating your data first.
  • Test data before model training — running pipelines with incorrect data is costly, so test it as early as possible up the stream.
  • Great Expectations component is a go-to tool to ensure data quality in your Kubeflow pipelines and currently the only DQ tool for Kubeflow.
  • Data Doc helps to collaborate with other data team members on analyzing and managing data quality of your Kubeflow pipelines.

Special Thanks

Thanks to Anton Kiselev and Yaroslav Beshta for developing and contributing the Great Expectations component for Kubeflow.

Glossary

Expectation Suite is a description of your dataset. It is stored as a JSON file. There are several ways to create an Expectation Suite: scaffold a boilerplate and edit it to build your rules (expectations) your dataset .csv will be validated against. Or create an Expectation Suite from scratch.

GE component definition is a definition of a Kubeflow component that describes it to Kubeflow Pipelines in the specification format in the component.yaml file. From this predefined component specification .yaml, ComponentStore.load_components automatically creates a component op, that allows invoking the component in Kubeflow pipelines. Load component.yaml to ComponentStore to let it create a component op to allow using the component in Kubeflow Pipelines.

Expectations are your requirements that you define in the Expectation Suit file in JSON format. They act as unit tests for your data. Great Expectations will validate your datasets by checking them against your expectations. To create expectations, you can either select asserts from a built-in glossary of expectations or write your own from scratch.

Glossary of Expectations — a list of all built-in Expectations.

Data Doc — a structured formatted validation report compiled by Great Expectations for your dataset featuring Expectations and Validations. One example of Data Docs is HTML documentation.

Kubeflow Pipelines is a platform for building and deploying portable, scalable machine learning (ML) workflows based on Docker containers.

Kubeflow Pipelines SDKkfp package — a set of Python packages that you can use to specify and run your ML workflows (defining and manipulating pipelines and components).

Pipeline — a description of an ML workflow, including all of the components that make up the steps in the workflow and how the components interact with each other.

XGBoost (Extreme Gradient Boosting) is an open-source library that provides efficient implementation of the gradient boosting for C++, Java, Python, R, Julia, Perl, and Scala.

GIGO principle — a concept that flawed, low-quality “garbage” data produces flawed, low-quality output (“garbage”).