In the data science and data mining communities, several practitioners are applying various algorithms on data, without attempting to visualize the data. This is a big mistake because sometimes, visualizing the data greatly helps to understand the data. Some phenomena are obvious when visualizing the data. In this blog post, I will give a few examples to convince you that visualization can greatly help to understand data.
An example of why using statistical measures may not be enough
The first example that I will give is a the Francis Anscombe Quartet. It is a set of four datasets consisting of X, Y points. These four datasets are defined as follows:
Dataset I | Dataset II | Datset III | Dataset IV | ||||
---|---|---|---|---|---|---|---|
x | y | x | y | x | y | x | y |
10.0 | 8.04 | 10.0 | 9.14 | 10.0 | 7.46 | 8.0 | 6.58 |
8.0 | 6.95 | 8.0 | 8.14 | 8.0 | 6.77 | 8.0 | 5.76 |
13.0 | 7.58 | 13.0 | 8.74 | 13.0 | 12.74 | 8.0 | 7.71 |
9.0 | 8.81 | 9.0 | 8.77 | 9.0 | 7.11 | 8.0 | 8.84 |
11.0 | 8.33 | 11.0 | 9.26 | 11.0 | 7.81 | 8.0 | 8.47 |
14.0 | 9.96 | 14.0 | 8.10 | 14.0 | 8.84 | 8.0 | 7.04 |
6.0 | 7.24 | 6.0 | 6.13 | 6.0 | 6.08 | 8.0 | 5.25 |
4.0 | 4.26 | 4.0 | 3.10 | 4.0 | 5.39 | 19.0 | 12.50 |
12.0 | 10.84 | 12.0 | 9.13 | 12.0 | 8.15 | 8.0 | 5.56 |
7.0 | 4.82 | 7.0 | 7.26 | 7.0 | 6.42 | 8.0 | 7.91 |
5.0 | 5.68 | 5.0 | 4.74 | 5.0 | 5.73 | 8.0 | 6.89 |
To get a feel of the data, the first thing that many would do is to calculate some statistical measures such as the mean, average, variance, and standard deviation. This allows to measure the central tendency of data and its dispersion. If we do this for the four above datasets, we obtain:
Dataset 1: mean of X = 9, variance of X= 11, mean of Y = 7.5, variance of Y = 4.125
Dataset 2: mean of X = 9, variance of X= 11, mean of Y = 7.5, variance of Y = 4.125
Dataset 3: mean of X = 9, variance of X= 11, mean of Y = 7.5, variance of Y = 4.125
Dataset 4: mean of X = 9, variance of X= 11, mean of Y = 7.5, variance of Y = 4.125
So these datasets appears quite similar. They have exactly the same values for all the above statistical measures. How about calculating the correlation between X and Y for each dataset to see how the points are correlated?
Dataset 1: correlation 0.816
Dataset 2: correlation 0.816
Dataset 3: correlation 0.816
Dataset 4: correlation 0.816
Ok, so these datasets are very similar, isn’t it? Let’s try something else. Let’s calculate the regression line of each dataset (this means to calculate the linear equation that would best fit the data points).
Dataset 1: y = 3.00 + 0.500x
Dataset 2: y = 3.00 + 0.500x
Dataset 3: y = 3.00 + 0.500x
Dataset 4: y = 3.00 + 0.500x
Again the same! Should we stop here and conclude that these datasets are the same?
This would be a big mistake because actually, these four datasets are quite different! If we visualize these four datasets with a scatter plot, we obtain the following:
This shows that these datasets are actually quite different. The lesson from this example is that by visualizing the data, difference sometimes becomes quite obvious.
Visualizing the relationship between two attributes
Simple visualizations techniques like scatter plots are also very useful for quickly analyzing the relationship between pairs of attributes in a dataset. For example, by looking at the two following scatter plots, we can quickly see that the first one show a positive correlation between the X and Y axis (when values on the X axis are greater, values on the Y axis are generally also greater), while the second one shows a negative correlation (when values on the X axis are greater, values on the Y axis are generally also smaller).
If we plot two attributes on the X and Y axis of a scatter plot and there is not correlation between the attributes, it may result in something similar to the following figures:
These examples again show that visualizing data can help to quickly understand the data.
Visualizing outliers
Visualization techniques can also be used to quickly identify outliers in the data. For example in the following chart, the data point on top can be quickly identified as an outlier (an abnormal value).
Visualizing clusters
In data mining, several clustering algorithms have been proposed to identify clusters of similar values in the data. These clusters can also often be discovered visually for low-dimensional data. For example, in the following data, it is quite apparent that there are two main clusters (groups of similar values), without applying any algorithms.
Conclusion
In this blog post, I have shown a few simple examples of how visualization can help to quickly see patterns in the data without actually applying any fancy models or performing calculations. I have also shown that statistical measures can actually be quite misleading if no visualization is done, with the classic example of the Francis Anscombe Quartet.
In this blog post, the examples are mostly done using scatter plots with 2 attributes at a time, to keep things simple. But there exists many other types of visualizations.
—
Philippe Fournier-Viger is a professor of Computer Science and also the founder of the open-source data mining software SPMF, offering more than 120 data mining algorithms.