This post continue my report of the PAKDD 2014 in Tainan (Taiwan).
The panel about big data
Friday, there was a great panel about big data with 7 top researchers from the field of data mining. I will try to faithfully report some interesting opinions and ideas heard during the panel. Of course, the text below is my interpretation.
Learning from large data
Geoff Webb discusses the challenges of learning from large quantities of data. He mention that the majority of research focuses on how we can scale up existing algorithms rather than designing new algorithms. He mentionned that different algorithms have different learning curves and that some models may work very well with small data but other model may work better with big data. Actually, some models that can fit complex and large amount of data may tend to overfit with small data.
In his opinion, we should not just try to scale up the state of the art algorithm but to design new algorithms that can cope with huge quantities of data, high dimensionality and fine grained data. We need low bias, very efficient and probably out of core algorithms.
Another interesting point is that there is a popular myth that using any algorithms will work well if we can train it with big data. That is not true. Different algorithm have different learning curves (produce different error rate based on the size of the training data).
Big data and the small footprint
Another interesting opinion was given by Edward Chang. It was mentionned that often simple methods can outperforms complex classifiers when the number of training examples is larger. He mentionned that complex algorithms are hard to parallelize and that the solution may thus be to use simple algorithms for big data. As example, he mentionned that he tried to parallelize “deep learning” algorithms for 2 years and fail because it is too complex.
Another key idea is that doing data mining with big data should have a small footprint in terms of memory and power consumption. The latter point is especially important for wearable computers. But of course some of the processing could be done in the cloud.
Should we focus on the small data problems?
Another very interesting point of view was presented by George Karypis. We are told that big data is everywhere and that there is more and more data. We responded by proposing technologies such as Map Reduce, linear model, deep learning, sampling, sub-linear algorithms etc. However, we should stop spending time on big data problems relevant to only a few companies (e.g. Google, Microsoft).
We should rather focus on “deep data”. This means data that may be small but highly complex, computationally expensive, require a “deep” understanding. But also data that can easily fit on today workstation and small scale clusters.
We should focus on applications that are useful rather than concentrating too much work on big data.
On the need to cross disciplines
Another refreshing point of view what the one of Shonali Krishnaswamy.
She also mentioned that data mining on mobile platforms may be hard due to complex computation, limited resources and users that have short attention span.
Moreover, to be able to perform data mining on big data, we will need to cross disciplines by getting inspired by work from the fields of: (1) parallel/distributed algorithms, (2) mobile/pervasive computing (3) interfaces / visualizations (4) decision sciences and (5) perhaps semantic agents.
Issues in healthcare
There was also some discussion about issues in health care by Jiming Liu. I will not go into too much details about this one since I’m not much related to this topic. But some challenges that were mentionned is how to deal with diversity, complexity, timeliness, diverse data sources, tempo-spatial scales wrt problem, complex interactions, structural biases, how to perform data driven modelling, how to test result and service and how to access & share data.
There was also another discussion by Longbing Cao about the need of coupling. I did not take too much notes about this one so I will not discuss it here.
Continue reading my PAKDD 2014 report (part 2) here
That is all I wanted to write for now. If you like this blog, you can tweet about it and/or subscribe to my twitter account @philfv to get notified about new posts.
Philippe Fournier-Viger is an assistant professor in Computer Science and also the founder of the open-source data mining library SPMF, offering more than 65 data mining algorithms.