Report of the PAKDD 2014 conference (part 1)

I am currently at the PAKDD 2014 conference in Tainan, In this post, I will report interesting information about the conference and talks that I have attended.

pakdd2014

Importance of Succint Data Structures for Data Mining

I have attended a very nice tutorial at PAKDD by R. Raman  about succint data structures for data mining.  I will try to report some main points of this tutorial here as faithfully as possible (but it is my interpretation).

A simple definition of what is a succint data structure is as follows. It is a data structure that uses less memory than a naive data structure for storing the some information.

Why should we care?

  • A reason is that to perform fast data mining, it is usually important to have all the data into memory. But sometimes the data cannot fit. In the age of big data, if we use some very compact data structures, then we can fit more data into memory and perhaps that we don’t need to use distributed algorithms to handle big data.  An example that was provided in the tutorial is a paper by Cammy & Zhao that used a single computer with a compact structure to beat a distributed map reduce implementation to perform the same task.  If we can fit all data into the memory of a single computer, the performance may possibly be better because data access is faster on a single computer than if the computation is distributed.
  • A second reason is that if a data structure is more compact, then in some cases a computer may store more memory in its cache and therefore access to the data may even be faster.  Therefore, there is not always a negative effect on execution time when data is more compressed using a succint data structures.

What characteristics a compressed data structure should provide?

  • One important characteristic is that it should compresses information and an algorithm using the data structure should ideally be able to work directly on it without decompressing the data.
  • Another desirable characteristic is that it should provide the same interface as an uncompressed data structure. In other words, for an algorithm, we should be able to replace the data structure by a compressed data structure without having to modify the algorithm.
  • A compressed data structure is usually composed of data and an index for quick access to the data.  The index should be smaller than the data.
  • Sometimes a trade-off is redundancy in the data structure vs query time. Reducing redundancy may increase query time.

There exists various measures to assess how much bits are necessary to encode some information: naive, information-theoretic, entropy…  If we design a succint data structure and we use more memory than what is necessary using these measures, then we are doing something wrong.

In the tutorial, it was also mentionned that there exists several libraries providing succint data structure implementations such as Sux4J, SDSL-lite, SDSL…

Also many examples of succint data structures were provided such as binary trees implemented as a bit vectors, multibit trees, wavelet trees, etc.

On applications of association rule mining

Another very interesting talk was given by G. Webb. The talk first compared association rule mining with methods from the field of statistics to study associations in data. It was explained that:

  • statistics often tries to find a single model that fit the data, wherehas association rules discovers multiple local models (associations), and let the user choose the best models (which rule better explain the data).
  • association rule mining is scalable to high dimensional data, wherehas classical techniques from statistics cannot be applied to a large amount of variables

So why association rule mining is not so much used in real applications? It was argued that the reason is that researchers in this field focus too much on performance (speed, memor) rather than on developing algorithms that can find unusual and important patterns.  By focusing only on finding frequent rules, too much “junk” is presented to the user (frequent rules that are obvious).  It was shown that in some applications, actually, it is not always the frequent rules that are important but the rare ones that have a high statistical significance or are important to the user.

So what is important to the user? It is a little bit subjective. However, there are at least four principles that can help to know what is NOT important to the user.

  • 1) If frequency can be predicted by assuming independency then the association is not important. For example, finding that all persons having prostate cancer are males is an uninteresting association, because it is obvious that only male can get prostate cancer.
  • 2) Redundant associations should not be presented to the user. If an item X is a necessary consequence of a set of items Y, then {X} U Y should be associated with everything that Y is. We don’t need all these rules. In general, we should either keep simple of complex rules (we should remove redundant rules)
  • 3) doing some statistical tests to filter non significant associations

Also, it is desirable to mine association efficiently and to be able to explain to the user why some rules are eliminated if necessary.

Also, if possible we may use top-k algorithms where the user chooses the number of patterns to be found rather than using the minsup threshold. The reason is that sometimes the best associations are rare associations.

These were the main ideas that I have noticed in this presentation.

Continue reading my PAKDD 2014 report (part 2) here

—–

That is all I wanted to write for now. If you like this blog, you can tweet about it and/or subscribe to my twitter account @philfv to get notified about new posts.

Philippe Fournier-Viger is an assistant professor in Computer Science and also the founder of the open-source data mining software SPMF, offering more than 52 data mining algorithms.

This entry was posted in Conference, Data Mining. Bookmark the permalink.