Using visualizations to explore a dataset

Understanding how Exploratory Data Analysis works

May 28, 2021

After setting up the metrics, hypothesis, and overall structure of the data, it's now time to think about how Data Science uses data visualization.

I've spent a lot of time talking about the necessity of structure for approach data: whether it's making sure you have the right metrics, crafting hypotheses to figure out what you're testing, or things like that.

It may seem almost excessive the emphasis there is on structuring data well, but it's essential to make sure that you have a usable and accurate dataset.

So now let’s revisit the scenario that I’ve talked about before. I’ve downloaded the dataset, set up a Jupyter Notebook, and then formatted the data into a usable dataset.

What’s next? Checking out your data set through visualizations. Visualizations in Data Science are typically used in three ways: Exploratory Data Analysis, building a model and presenting to your audience. Let's start with the most common use of data visualization in data science: Exploratory Data Analysis.

But to explain this, let's go back to why I began to learn visualization in the first place: creating a better presentation.

Your stakeholders trust that you know what you’re doing

When I transitioned from academia to industry, I got my first piece of presentation advice from my mentor when she was critiquing a lousy presentation that I had made.

"Cut out the methodology and analysis sections: it's too long." My mentor said, but what she said next stuck with me to this day.

"We trust that you know what you're doing."

In academia, we are taught a formal presentation structure: describing the larger context, the problem, the methods, analysis, results/discussion, and conclusion.

But that approach is often not ideal with your stakeholders: not only do they have a diverse range of educational backgrounds, but they often aren't that concerned with the rigor of your methods.

Instead, they're concerned about what you found and what to do about it.

In other words, they trust you to figure out the best methods of addressing a specific problem, but they want to know the results of your efforts and what to know or do next.

And the reason why this is the case might be something you haven't realized: you're likely the expert on the data. By spending days, weeks, or months reviewing and analyzing the data, you probably know the data's ins and outs better than anyone else on the team.

If someone new joined your team and wanted to know about this dataset, everyone else on the team would direct them to you.

So you don't need to explain every detail of your data analysis, but they want a summary. And that's where visualizations come into play.

The visualizations you create through Exploratory Data Analysis are the perfect tool to help summarize your work and point your team in the right direction.

Exploratory Analysis and Anscombe’s Quartet

“Having all the information in the world at our fingertips doesn’t make it easier to communicate: it makes it harder.”- Cole Nussbaumer Knaflic.

When you're presented with a large block of data, filled with hundreds or thousands of rows of data, what's the first thing that Data Scientists do?

Assuming that it's well-structured and formatted, they do what's called Exploratory Data Analysis, which is the process of investigating a dataset and summarizing its fundamental characteristics.

This is one of the most common uses of Data visualization in Data Science for one main reason: as humans, we can detect patterns, trends, and anomalies much easier if the data seen through a visual filter.

To illustrate this, let's talk about a very famous dataset called Anscombe's Quartet.

Anscombe's quartet

It's hard to gain any valuable insight just by looking at the data table: the numbers seem to be somewhat different, but you can't quickly tell if there are any differences or wisdom to be gained just by looking at the table.

Now, let's look at the same dataset but visualized it this time.

When we look at the visualized dataset, this time, we can see very different patterns between the four datasets. We're able to see how the datasets differ and if there are any outliers in the data, we might need to take a second look at them. These could be valid points in the data that are exceptional cases, but this is more likely erroneous data that might throw off calculations (One famous example of this is hospitals assigning the error code 999 to age).

Data Scientists use data visualization as part of the Exploratory Data Analysis phase to quickly discover many things about the data set:

Seeing whether it’s to see if your hypothesis has any merit
Seeing if there are errors or outliers.
Seeing if there is missing data or variables that need to be eliminated.
Being able to see if the data needs to be transformed into a different format (such as AM/PM to 24-hour clock format)
Seeing if there are relationships between variables based on the type of visualization it generates
Being able to take a smaller sample of the dataset and see what it looks like

Etc.

This provides them with a picture of what's interesting about a particular dataset. While they might see numbers that stand out upon reading, visualizing the data elements is crucial to understand what relationships, patterns, or trends might be valuable to explore further and explore.

And it's effortless to do once the data is well-structured.

I won't go into coding specifics (as I'm not a coding expert), but a simple snippet of code allows you to generate a map that can give you incredible insight. And this visualization will quickly enable you to see what you need to focus on next.

Visualizing car accidents suggests we need to pay attention to Downtown Seattle.

Visualizing these things allows us to quickly understand a dataset, what patterns might emerge from your data, and what relationships you might want to pay attention to. These are valuable things that help you figure out what you want to say about the data and quickly draw attention to problems.

In that case, why not use them when you're presenting to allow your audience to see the same thing?

Visualizing a better presentation

A picture is worth a thousand words. An interface is worth a thousand pictures. — Ben Shneiderman

You don't have all the time in the world to present your findings: you either have limited time (60 minutes) or limited attention (your stakeholder's short-term memory).

So walking them through the steps you did, coming from importing the data up to the results, will likely result in them remembering the wrong thing. For example, they might remember that you tested with 130 users and asked three questions instead of a meaningful relationship between time on page and customer retention if you spent the first 10 minutes talking about that.

And that's assuming that they're there for the entire meeting.

So how do you make sure that they get the main points? By using the visuals, you've created in Exploratory Data Analysis. You created them to learn more about the dataset, and seeing them piqued your curiosity to explore further and examine relationships, trends, or insights.

Which means that they may make great starting points for your presentation. For example, if you wanted to talk about a specific outlier in the data, you could talk about your process, how you plotted the data, and how you came across this point in the data that didn't make sense. Or you could show this chart.

Source: https://commons.wikimedia.org/wiki/File:Outlier_statistics.svg

If the outlier on this graph here turns out to not be erroneous data but instead something about your company that you wanted to highlight, then you've already got something that will draw your audience's attention. Charts that you generated to examine specific phenomena can be turned into slides that summarize your work with only a couple of changes, making it a powerful tool for guiding your audience towards the data you want to present.

But do this, you have to ask a single question: What did I learn from this analysis?

Visualize your thought process

Exploratory Data Analysis is, in essence, your thought process regarding learning about a dataset. It visualizes how you went about making sure that the data was structured, which variables you thought might be related, and how you figured out your next steps.

It's a better method of guiding your users through a dataset and your work. By consolidating all of those words into a single picture, you provide a snapshot of what stood out and led you to do the things you did.

If they can see what stands out in the data and nothing in the data seems like an error with your process, they might be able to understand and accept the conclusions you've drawn.

And doing so allows them to get the answers they care about at a level of detail that is not overwhelming.

But this isn't the only use Data Science has for visualization. It's also a crucial part of another common task: building a model.

Data-Informed Design by Kai Wong

Discussion about this post