How to Handle Missing Values?

Practical Advice from Experts: Preventing, Managing Missing Data + Using Synthetic Data

Under the 'Community Twist' tag, we offer practical solutions to questions frequently asked on various forums. Usually, there are 2-3 commentators revealing how they address the issue in their work. Enjoy, and share if you find it useful!

How to Handle Missing Values?

{Technical level: Advanced}

Dealing with missing data is a frequent and tricky issue you'll face when working with real-world datasets. In this article, we'll explain why missing values happen, how to reduce them, and what to do when you can't avoid them. We'll also look at how synthetic data can help solve this problem.

For a well-rounded view, we've consulted three data experts:

  1. Ryan Kearns, a Founding Data Scientist at Monte Carlo

  2. David Berenstein, a Developer Advocate at Argilla

  3. Abhishek Pawar, a Senior Data Scientist at Precisely

Let’s dive in!

1. What is missing data?

Missing values are like missing puzzle pieces in an otherwise complete picture. In a dataset that tracks the height and weight of people, missing values show up when one or both of these details are missing. Just like a puzzle missing a piece can't give you the full image, these gaps in data can make it difficult to get a complete understanding of what you're studying. While these gaps can occur because some people choose not to share this info, that's not the only reason.

In general, missing values can be attributed to several factors, including:

  1. Human Error: Mistakes made during data collection or processing. These errors may arise from oversight, misinterpretation, or other inadvertent actions.

  2. Machine Error: Technical glitches or equipment malfunctions. When data collection relies on machinery or automated systems, any disruptions in their functioning can lead to gaps in the dataset.

  3. Respondent Refusal: I don’t want to answer! This refusal may be due to privacy concerns, discomfort, or other personal reasons.

  4. Drop-Outs: Respondents in longitudinal studies are lost over time. When research spans a long period, people may either withdraw or be lost to follow-up. This is particularly problematic if they contribute crucial information.

The presence of missing values in your data can have a significant impact on the quality of your predictions. But is there a way to minimize the occurrence of missing data in your business?

2. Strategies for Minimizing Missing Data

There are several best practices focused on ensuring data quality and avoiding missing data. Ryan Kearns has shared how these practices are implemented within their company.

The best practice is a combination of being declarative about your invariants and using automated data observability to handle scale.

High-importance, load-bearing data assets need to be treated with priority (you'll hear phrases like "certified gold data," Airbnb has their "Midas Certified" standard, etc.). For these datasets, employ code-based checks such as:

"I expect these 5 categorical values in this column; so if I ever see 4 or 6 distinct values, that's a problem."

On the scalability point, we actually invest a lot into our Monte Carlo Monitors-as-Code setup, which configures monitors for things like the above. dbt tests can also provide this sort of coverage. We benefit from a mix of monitoring where critical assets, like the feature table immediately serving an ML model, have detailed testing and validation on every important field, while the rest of the pipeline upstream of this has "lighter" monitoring like freshness and volume metadata monitoring that MC has out-of-the-box.

Ryan Kearns, a Founding Data Scientist at Monte Carlo

Ryan stressed some important steps for getting good data and making a data-focused culture in your company. For startups or teams new to data, there are extra key points to think about:

  • Training Data Users: Teaching your data team the right skills is crucial. Training and classes can help them understand why good data matters, follow the rules and use the best tools effectively.

  • Building a Data Quality Mindset: It means everyone is on the same page about the vision, values, and goals related to data quality. It involves fostering a sense of ownership and accountability for data quality among data users.

However, there can be situations when avoiding missing values is simply not feasible. This can occur for various reasons, including the irreversibility of data collection, occasional errors, or the nature of the data itself. This is what Abhishek Pawar told us about:

Missing data often carries inherent signals or patterns within the dataset. Every instance of missing data is typically indicative of a specific underlying cause. Therefore, to effectively mitigate these biases, it is imperative to gain a comprehensive understanding of the data's origin, generation process, and the mechanisms by which it was captured within the system.

Abhishek Pawar, a Senior Data Scientist at Precisely

In such scenarios, our objective shifts to mitigating the impact of these missing values.

3. Potential biases due to missing data

In the realm of modern machine learning, which heavily relies on data, the lack of or inadequate representation of specific data types can severely degrade the performance of predictive systems. Ryan Kearns provided examples:

For example, if one of your APIs is misconfigured and sending junk data, you might end up serving irrelevant ads to customers using the browser where that API is supposed to be working.

You can also make misinformed decisions on the business intelligence side, if data fails to reach a downstream Tableau asset, you might end up convinced that your sales in some segments have tanked when in reality the volume just isn't getting through the pipeline.

Ryan Kearns

David Berenstein reminded us about the biases we’ve seen during the rise of pre-trained language models and later large language models.

Since the introduction of fine-tuning by transfer learning, missing data can be the root cause of significant biases and inequalities in core models like BERT.

We’ve seen this during the introduction of new models being firstly and sometimes solely available within their domain, language, or cultural space, and, despite their more general nature, we can see this ever so clearly with the rise of LLMs.

David Berenstein, a Developer Advocate at Argilla

One of the strategies to mitigate the effect of missing values involves model tracking and deployment monitoring. What else can you do? Let's explore more ways to manage missing data and how to choose the best one for you.

Subscribe to keep reading

This content is free, but you must be subscribed to Turing Post to continue reading.

Already a subscriber?Sign In.Not now

Reply

or to participate.