The importance of data quality validation in machine learning is hard to overestimate. Nevertheless, major ML platforms are still lacking tools to establish the data QA process. Recently, Provectus has made a contribution to the Kubeflow repository that will allow ML engineers to test and validate data inside Kubeflow Pipelines with the Great Expectations component. In this article, we talk about what it is and show how you can use it to keep data for your ML pipelines in check.
The Rise of ML Data Quality
The lack of data quality assurance in ML model life cycle always costs and backfires. You can have all the stellar MLOps tools and high-quality models you want, but if your data is garbage, none of that matters. Unprepared data will annihilate their worth.
Available research continues to confirm this year after year.
For instance, a recent 2020 Gartner survey found that organizations estimate the average cost of poor data quality at $12.8 million per year, and this number will likely rise as business environments become increasingly complex.
Risks of optimizing for the wrong thing are at all times high, and no one can afford to ignore testing data for models and products based on them. No matter what framework you’re using to orchestrate your ML, having data assurance in place is a must to eliminate the #1 enemy of effective ML and companies’ budgets: bad data. At Provectus, we address it by making Data Quality Assurance a part of any of our end-to-end ML and Data Infrastructure solutions.
Kubeflow Pipelines, a popular open-source end-to-end ML orchestration platform, is no exception to this rule. When you use Kubeflow Pipelines as your go-to tool for building and managing your ML workflows on Kubernetes, you need to have data quality assurance in place. You definitely want to eliminate the GIGO (“garbage in, garbage out”) situation and get the second picture:
Though the need for data QA is already widely recognized by the ML community, data quality tools are still in their infancy and are largely absent from many major ML platforms, including Kubeflow Pipelines. Up until recently, Kubeflow users couldn’t establish a proper process to make sure that their data is tested and validated before training their model.
At Provectus, we decided to fix this and contributed a pull request that adds a Great Expectations data validation component to the Kubeflow repository. From now on you can add the GE component to your Kubeflow pipelines to test and validate .csv datasets against the rules you set in the Expectations Suite. If a dataset is invalid, the component stops pipeline execution. If it’s valid, you know your data is healthy and the pipeline continues to execute. The GE component also outputs an HTML Data Doc with a validation report in each of these scenarios.
What is Great Expectations
Great Expectations is an open-source data validation framework written in Python that allows you to test, profile, and document data to measure and maintain its quality on any stage of your ML pipeline. It’s a great way to avoid ending up with inconsistent records and resulting issues in data pipelines.
For example, when you ingest data from various outside data providers and need to check the output of an ETL before adding it to your database. Say, you’re doing ML for a medical enterprise and are constantly getting datasets from insurance companies. Often this data is of low quality and suffers from typical issues like not matching a dictionary, incorrect data types, faulty formatting, duplication, shifted in values distribution.
To validate data, you have to write so-called Expectations — assertions that describe the desired state of your datasets. Expectations are basically unit tests that your data will be run against. You can either select asserts from a built-in glossary of expectations or write your own. How does Great Expectations fit into your MLOps?
ML model life cycle consists of three major stages:
- Data Preparation. Data is collected, cleaned, structured and enriched to prepare it for later stages.
- Model Development. The model is designed, built, trained, tested, and fine-tuned, in preparation for deployment.
- Deployment & Operations. A fine-tuned model is packaged and deployed to production, then monitored for quality and reliability.
At what stage of ML life cycle data requires quality assurance, would you say?
That’s right: just like code and model, data of each ML stage should be covered with tests.
Great Expectations is a tool that can be used for data testing in each of the stages, and we brought it to Kubeflow Pipelines to do exactly that.
Depending on how you structure your MLOps, your Kubeflow pipeline can include all or some of these blocks. Data QA belongs to several of them. Here is where you might want to fit Data QA in your ML pipeline to validate your data:
Great Expectations is super helpful to check data at ingestion, data cleaning, feature engineering or model output stages. In the use case we are looking at below, the first three validation steps shown on the diagram fit into the first component. Let’s dive right in and look how to get started with the GE component in Kubeflow Pipelines.
To learn more about how Great Expectations fits into the MLOps life cycle, check out this post from the Great Expectations blog:
Quickstart: Getting Started with Great Expectations on Kubeflow Pipelines
Setting up and using Great Expectations with Kubeflow Pipelines is a breeze. Besides the prerequisites, you’ll need just a few more steps to get started:
- Load GE component definition (component.yaml)
- Add Expectation Suite file to a pipeline
- Integrate the GE component into a pipeline
After that, you will be able to leverage Great Expectations in your Kubeflow workloads.
For this tutorial, you need:
- A Kubeflow cluster. To get a production infrastructure with Kubeflow on AWS up and running, you can use Swiss Army Kube — a simple blueprint for creating GitOps deployments. Check out the quickstart here: SAKK Quickstart.
- An Expectation Suite — a JSON file that contains your rules (Expectations) for a dataset. Scaffold a boilerplate and edit it to define your Expectations following this official guide: How to create a new Expectation Suite using suite scaffold. Alternatively, you can create custom Expectations: How to create custom Expectations.
- Optionally, you might want to get a cloud account to store Expectation Suite — on S3, Cloud Storage, Azure or any other location that Pandas supports (e.g. static HTTP location).
Let’s take the following Kubeflow pipeline as an example. It implements an ML workflow that trains a model to predict the tip amount that a customer paid for a taxi trip:
This pipeline uses the default Component Store provided by Kubeflow Pipelines SDK (more on this in the next section).
After compilation and upload to Kubeflow the pipeline should look like this (no GE component yet):
1. Load Great Expectations component
1.1 Load definition file via Kubeflow ComponentStore
The store creates components that are defined in the master branch of Kubeflow Pipelines SDK repository. You can create a custom store to fix a repository version:
Then, you can use the store to load the component definition:
1.2 Load definition by URL or from file
Alternatively, you can save the definition to a file and load the component from this file:
2. Add Expectations Suite to data pipeline
GE component needs an Expectation Suite to be passed to it as a file. It can be passed by either loading it from a cloud storage or injecting it into the pipeline definition.
For example, you have an Expectation Suite stored in AWS S3. To load it from S3 you need to create a component that will do the loading. Write the component definition for the loading component in a .yaml file:
Load it using Python:
Similar components can be implemented for other cloud storages in accordance with their infrastructure requirements.
Finally, add this step to the pipeline to load the suite:
Alternatively, you can store the Expectation Suite locally and add it directly to the pipeline:
3. Integrate the GE component into a data pipeline
The integration is done by passing paths to data and Expectation Suite to GE component:
Here, “csv_path” needs to be an output of a component that ingests data into the pipeline (e.g. reads dataset from S3), and “expectation_suite” needs to be a file or an output of a pipeline component.
4. Compile the pipeline and upload it to Kubflow
After steps 1–3 we’ll have the following code for the pipeline assuming that Expectation Suite is saved locally in the same folder (code we added on steps 1–3 is in bold):
In Kubeflow interface, the pipeline with GE component will look like this:
5. Check your data with Great Expectations
Now you have a pipeline that contains the GE component and can leverage it to validate datasets coming to this pipeline. As you can see, the “Validate CSV” GE component is now one of the “Xgboost train”’s dependencies along with “Chicago Taxi”.
Every time you run the pipeline, the “Validate CSV” component will check your dataset against the rules in the Expectations Suite. If it successfully validates the data, the rest of the pipelines will run and all its steps will turn green (1). If the GE step detects data issues and fails, it will terminate further execution down the pipeline and turn red (2).
After execution, GE component also creates a Data Doc: an HTML report for data validation. To open it, click the link under the Output Artifact section of the Validate step in the Input/Output panel of the Kubeflow UI:
Here’s what the report looks like:
Data Doc shows available expectation suits, validation results, statistics of passed and failed expectations and allows the data team to drill down into values. Here you can see the state and value of each check defined in the Expectation Suite.
In this article, we briefly looked at the importance of data quality for machine learning pipelines and how it can be ensured in Kubeflow using the new Great Expectations component for Kubeflow Pipelines. Key takeaways:
- Data quality is #1 priority when it comes to building robust ML pipelines.
- Mind the GIGO principle — you can’t have robust pipelines and ML models without testing and validating your data first.
- Test data before model training — running pipelines with incorrect data is costly, so test it as early as possible up the stream.
- Great Expectations component is a go-to tool to ensure data quality in your Kubeflow pipelines and currently the only DQ tool for Kubeflow.
- Data Doc helps to collaborate with other data team members on analyzing and managing data quality of your Kubeflow pipelines.
Thanks to Anton Kiselev and Yaroslav Beshta for developing and contributing the Great Expectations component for Kubeflow.
Expectation Suite is a description of your dataset. It is stored as a JSON file. There are several ways to create an Expectation Suite: scaffold a boilerplate and edit it to build your rules (expectations) your dataset .csv will be validated against. Or create an Expectation Suite from scratch.
GE component definition is a definition of a Kubeflow component that describes it to Kubeflow Pipelines in the specification format in the component.yaml file. From this predefined component specification .yaml, ComponentStore.load_components automatically creates a component op, that allows invoking the component in Kubeflow pipelines. Load component.yaml to ComponentStore to let it create a component op to allow using the component in Kubeflow Pipelines.
Expectations are your requirements that you define in the Expectation Suit file in JSON format. They act as unit tests for your data. Great Expectations will validate your datasets by checking them against your expectations. To create expectations, you can either select asserts from a built-in glossary of expectations or write your own from scratch.
Glossary of Expectations — a list of all built-in Expectations.
Data Doc — a structured formatted validation report compiled by Great Expectations for your dataset featuring Expectations and Validations. One example of Data Docs is HTML documentation.
Kubeflow Pipelines is a platform for building and deploying portable, scalable machine learning (ML) workflows based on Docker containers.
Pipeline — a description of an ML workflow, including all of the components that make up the steps in the workflow and how the components interact with each other.
XGBoost (Extreme Gradient Boosting) is an open-source library that provides efficient implementation of the gradient boosting for C++, Java, Python, R, Julia, Perl, and Scala.
GIGO principle — a concept that flawed, low-quality “garbage” data produces flawed, low-quality output (“garbage”).