Today, I will discuss the topic of Big Data, which is a very popular topic nowadays. The popularity of big data can be seen for example in universities. Many universities are currently searching for professors who do research on “big data”. Moreover, all the big conferences on data mining have workshops, tutorials and keynote speeches on “big data”. Besides, many researchers try to call their research “big data” or try to find some “big data” projects to get more funding.
What is big data? There have been many attemps to define what is big data. One key aspect is that there should be a huge amount of data (e.g. terabytes) that cannot be handled by a single computer. Other aspects that are proposed to defined big data is that the data is heterogeneous (e.g. combine many different types of data), that it is evolving (e.g. new data is always coming), and that it is decentralized and distributed.
In this post, I will argue that Small Data also have difficult problems that need to be solved. Actually, even in very small datasets, there exists some very difficult problems that we have not solved yet and that are more difficult than some big data problems. In general, I will call problems that are difficult to solve by an algorithm “Big Problems“. These problems can be found either in Small Data or Big Data.
An example of a Big Problem in Small Data
I will give an example of a Big Problem that can occurs with just a few kilobytes of data.
In Data Mining, there is one popular task that is called frequent itemset mining (proposed by Agrawal & Srikant, 1993). It is defined as follows. Let be a set of transactions, where each transaction is a set of symbols. For example, consider the three following transactions:
T1: {beer}, {egg}, {bread}
T2 : {beer, {cheese}, {wine}, {bread},{milk}
T3: {cheese}, {wine}, {bread}
The goal of frequent itemset mining is to enumerate all subsets that are common to more than minsup transactions, where minsup is set by the user. For example, if minsup = 2 transactions, the result would be {{beer}, {cheese}, {wine}, {bread}, {beer, bread}, {cheese, wine}, {cheese, bread}, {wine, bread}, {cheese, wine, bread}}.
This problem is an enumeration problem, that has many more difficult variations such as when quantities and weights are considered (e.g. high utility itemset mining). However, even for the basic version of this problem, the search space can be huge, even for very small datasets. I will give a simple example. Consider 5 transactions such that three transactions contains 100 common symbols (items). Set minsup = 2 transactions, and there will be more than 2^100 – 1 groups of items to be output. In this case, no algorithm would terminate or even if it would, it would run out of disk space of memory to save the result. This example shows that even with such a small set of data, we can find very difficult problems (Big Problems).
An example of an Easy Problem in Big Data
On the other hand, there are many Big Data problems that are easy to solve. For example, consider calculating the average sale price of items sold in a retail store during the evening. Calculating the average price of a set of customer transactions can be easily paralellizde and there is no major challenge to do that.
Conclusion.
So in conclusion, what I want to highlight in this post is that although the amount of data plays a role in the difficulty to solve a problem, we should perhaps not put too much emphasis on only this aspect. The reason is that it is not solely the amount of data that makes a problem difficult, It is also the type of the problem and the search space that is involved. For example, making a program that can play Go and beat the best humans at that game is extremelly difficult even if it does not involve Big Data. We should also remember that we have not solved all the Big Problems in Small Data
—–
That is all I wanted to write for now. If you like this blog, you can tweet about it and/or subscribe to my twitter account @philfv to get notified about new posts.
Philippe Fournier-Viger is an assistant professor in Computer Science and also the founder of the open-source data mining software SPMF, offering more than 52 data mining algorithms.