SPMF 2.52 is released

This is just a short blog post to let you know that a new version of the SPMF library has been released, called version 2.52.

SPMF

This new version contains two new algorithms for high utility itemset mining and one for high utility quantitative itemset mining.

  • The TKU-CE algorithm for heuristically mining the top-k high-utility itemsets with cross-entropy (thanks to Wei Song, Lu Liu, Chuanlong Zheng et al., for the original code)
  • The TKU-CE+ algorithm for heuristically mining the top-k high-utility itemsets with cross-entropy with optimizations (thanks to Wei Song, Lu Liu, Chuanlong Zheng et al., for the original code)
  • The TKQ algorithm for mining the top-k quantitative high utility itemsets (thanks to Nouioua, M. et al., for the original code)

Besides, since December, four more algorithms have been released (in SPMF 2.50 and 2.51):

  • The SFU-CE algorithm for mining skyline frequent high utility itemsets using the cross-entropy method (thanks to Wei Song, Chuanlong Zheng et al., for the original code)
  • The POERMH algorithm for mining partially ordered episode rules in a sequence of events, using the head support (thanks to Yangming Chen et al. for the original code)
  • The SFUI_UF algorithm for mining skyline utility itemsets using utility filtering (thanks to Wei Song, Chuanlong Zheng et al., for the original code)
  • The HAUIM-GMU algorithm for mining high average utility itemsets (thanks to Wei Song, Lu Liu, et al. for the original code)

Need your contributions!

For the SPMF project, we are always looking for new contributors. If you are interested to participate (e.g. contributing code of new algorithms, bug fixes, etc.), you can contact with me at philfv AT qq.com.

Philippe Fournier-Viger is a distinguished professor working in China and founder of the SPMF open source data mining software.

Posted in Data Mining, open-source, Pattern Mining, spmf, Utility Mining | Tagged , , , , , , , , | Leave a comment

Typhoon Path Prediction using Deep Learning

Typhoons can be very destructive. Predicting their paths is important to be prepared when they arrive. In this blog post, I will talk briefly about an applied research topic which is to predict the paths of typhoons. This blog post is based on a recent research paper published in Neural Computing and Applications, where I have participated as co-author:

Xu, G., Xian, D., Fournier-Viger, P., Li, X., Ye, Y., Hu, X. (2022). AM-ConvGRU: A Spatio-Temporal Model for Typhoon Path Prediction. Neural Computing and Applications, Springer, 

Over the years many models have been proposed for typhoon path prediction. But the accuracy of these models could be improved. In general, we want to have models that are as accurate as possible.

Predicting typhoon paths is a difficult problem because it involves spatial data and temporal data, that is described using numerous features. Moreover, some features are 2D features while others are 3D features and combining them is also a challenge.

To address this issue, in the above paper, we presented a deep learning framework to perform accurate predictions of the paths of typhoons. The model is called Attention-based Multi ConvGRU (AM-ConvGRU).

For that research project, typhoons data was obtained from two sources: (1) the China Meteorological Administration (CMA) and (2) the European Centre for Medium-Range Weather Forecasts (ECMWF). The first provides data about 2D typhoons while the second provides 3D typhoon data. The data covers typhoons in the Western North Pacific (WNP) basin. Here is a visualization of typhoon paths from the paper:

After obtaining the data, the data has to be preprocessed. In particular, the 2D typhoon data is transformed into 53 features, according to a method called CLIPPER. These features are depicted in the table below as example.

Similarly, the 3D typhoon data has to been prepared. This is done by dividing the earth into a grid of 1 degree by 1 degree, by geopotential, and then looking more closely at the zone around the typhoon center. I will skip the details. But the result is a 3D time series structure:

After that, the deep learning model is trained using the 3d and 2D typhoon data. This is an overview of the model’s architecture:

I will skip the details.

To evaluate the proposed model, it was compared with state-of-the-art models. It was shown that the proposed model can generally provide better predictions.

To show a little bit more clearly what is the output, here is an illustration of some prediction by the proposed model, a baseline model, and to the historical path for Typhoon Mangkhut:

It can be seen that the proposed model is closer to the historical path than the baseline by over 50 km. Here is another example for Typhoon Talim:

The improvement of distance error for the proposed model over the baseline is over100 km.

Hope this has been interesting. This is just a very short overview of the topic of typhoon path prediction. If you are interested, please check the paper!

Philippe Fournier-Viger is a full professor working in China and founder of the SPMF open source data mining software.


Posted in artificial intelligence, Data Mining, Machine Learning | Tagged , , , , , , | Leave a comment

Brief report about ADMA 2021

This week, I have attended the 16th International Conference on Advanced Data Mining and Applications (ADMA 2021) conference, which is held online due to the COVID pandemic.

What is ADMA ?

ADMA is a medium-scale conference that focus on data science and its applications (hence its name). The ADMA conference is generally held in China but was twice in Australia and once in Singapore. I participated to this conference several times. If you want to read my report about previous ADMA conferences, you can click here: ADMA 2019, ADMA 2018, ADMA 2013 and ADMA 2014.

This time, the conference was called ADMA 2021, although it is held from the 2nd to 4th February 2022. The reason why the conference is held in 2022 is that it was postponed due to the COVID-19 pandemic. ADMA 2021 was co-located with the australasian artificial intelligence conference (AJCAI 2021), which is a national conference about AI.

Proceedings

The proceedings are published by Springer in the Lecture Notes in Artificial Intelligence series as two volumes. The proceedings contain 61 papers, among which 26 were presented orally at the conference while the remaining were presented as posters.

The papers were selected from 116 paper submissions, which means an overall acceptance rate of 61 / 116 = 52 % and 22 % for the papers presented orally.

Schedule

The ADMA conference was held on three days. There was three keynote talks, two panels, six invited talks, two hours for poster sessions and some regular paper sessions. The schedule is below.

The conference was held according to the Australian time zone, which means that I had to wake up at 6 PM in China to see the first events.

A virtual conference

The ADMA conference was hosted on the Zoom platform for viewing the presentations and another platform called GatherTown for social interactions. The Gathertown platform is used by several conferences. Using this platform, each attendee can create an avatar and to explore a 2D world. Then, when you go closer to the avatar of another person, you can have a discussion via webcam and microphone with that person. This allows to recreate a little bit the atmosphere of a real conference. Here are a few screenshots of this virtual environment:

Several options to edit your avatar
The welcome room of ADMA 2021
Another room
A chat room with a few chairs and a table
One of the poster rooms

Day 1 – Panel on Responsable AI

On the first day, there was a panel on responsable AI with 4 invited panelists. The discussion was on topics such as how to improve the brand of Australia for AI, the need on more funding for Responsabble AI in Australia, AI regulations, AI ethics, deepfakes, etc.

Day 1 – Paper sessions

On the first day, there was also some paper sessions. There was several topics such as personalized question recommendation, cheating detection, a paper about a new dataset, and medical applications.

Day 1 – Poster sessions

The poster session was held in Gather Town. During the poster session, I have stayed mostly beside my poster in case some people would come to talk with me. It works as follows. If some persons approach your poster, then it starts a webcam discussion with them. There was over 60 persons online at that time. I have discussed with maybe 4 or 5. Here are a few screenshots from the poster session:

Waiting for people to come see my virtual poster
An example of view that we get when looking at a poster (my poster)

Globally, this idea of using GatherTown is interesting. It allows to make some social interactions, which otherwise would be lacking for a virtual conference. However, some thing that I think could be improved about poster sessions in GatherTown is that there is no index or search function to find a poster. Thus, to search for a poster we must go around the room to try to find it, which takes time. Also another area for improvement is that when showing a poster to another attendee, that person cannot see your mouse cursor. That is something that GatherTown developers could improve.

Pattern mining papers

As readers of this blog know, I am interested by pattern mining research. So here, I have made a list of the main pattern mining papers presented at the conference:

It is interesting to see that three of this papers are related to high utility pattern mining, a popular research direction in pattern mining. The last paper is related to process mining, which is also a popular topic about the application of pattern mining and data mining to analyze business processes.

Award ceremony

I missed the award ceremony because it started very early (6:00 AM) in my time zone (China) so I will not report the details about awards but I got the news that I received this award afterward:

Next ADMA conference (ADMA 2022)

The next ADMA conference will be called ADMA 2022 and be in Brisbane, Australia, probably around December.

Conclusion

Overall, ADMA 2021 was a good conference. That is all for today!


Philippe Fournier-Viger is a full professor, working in China, and founder of the SPMF open-source data mining library.

Posted in Big data, Conference, Data Mining, Data science | Tagged , , , , , , | 2 Comments

(video) TKQ : Top-K Quantitative High Utility Itemset Mining

In this blog post, I will share a short video about a new algorithms for top-k quantitative high utility itemset mining, which will be presented at ADMA 2021.

Here is the link to watch the paper presentation:
https://www.philippe-fournier-viger.com/spmf/videos/poerm_video.mp4

And here is the reference to the paper:

Nouioua, M., Fournier-Viger, P., Gan, W., Wu, Y., Lin, J. C.-W., Nouioua, F. (2021). TKQ: Top-K Quantitative High Utility Itemset Mining. Proc. 16th Intern. Conference on Advanced Data Mining and Applications (ADMA 2021) Springer LNAI, 12 pages [ppt]

The source code and datasets will be made available in the next release of the SPMF data mining library.

If you are interested by this topic, you can also read my blog post that explain the key ideas about high utility quantitative itemset mining.

That is all I wanted to write for today!

Philippe Fournier-Viger is a full professor working in China and founder of the SPMF open source data mining software.

Posted in Data Mining, Pattern Mining, Video | Tagged , , , , , , | Leave a comment

How many association rules in a dataset?

This is a very short blog post about the calculation of the number of possible association rules in a dataset. I will assume that you know already what is association rule mining.

Let’s say that you have a dataset that contains r distinct items. With these items, it is possible to make many rules. Since the left side and right side of a rule cannot be empty, then the left side of a rule can contain between 1 to r-1 items (since the right side of a rule cannot be empty). Lets say that the number of items on the left size of a rule is called k and that the number of items on the right size is called j.

Then, the total number of association rules that can be made from these r items is:

For example, lets say that we have r = 6 distinct items. Then, the number of possible association rules is 602.

This may seems a quite complex expression but it is correct. I have first seen it in the book “Introduction to Data Mining” of Tan & Kumar. If you want to type this expression in Latex, here is the code:

\sum_{k=1}^{r-1}\left[\binom{r}{k} \times \sum_{j=1}^{r-k}\binom{r-k}{j}\right] = 3^r - 2^{r+1}+1

By the way, this expression is also correct for counting the number of possible partially-ordered sequential rules or partially-ordered episode rules.

Related to this, the number of itemsets that can be made from a transaction dataset having r distinct items is:

In that expression, the -1 is because we exclude the empty set. Here is the Latex code for that expression:

2^{r}-1

Conclusion

This was just a short blog post about pattern mining to discuss the size of the search space in association rule mining. Hope you have enjoyed it.


Philippe Fournier-Viger is a full professor working in China and founder of the SPMF open source data mining software.

Posted in Pattern Mining | Tagged , , , | 3 Comments

Serious issues with Time Series Anomaly Detection Research

In this short blog post, I will talk about some serious issues that have been raised about many studies on anomaly detection for time series. More precisely, it was recently shown that a lot of papers on anomaly detection for time series have results that should not be trusted due in part to factors such as trivial benchmark datasets and measures for evaluating the performance of models that are not suitable. Two research groups have highlighted these serious problems.

(1) Keynote talk “Why we should not believe 95% of papers on Time Series Anomaly Detection” by Eamon Keogh

The very well-known time series researcher Eamon Keogh recently gave a keynote talk where he argued that most anomaly detection papers’ results should not be trusted. The slides of his presentations can be found here. Besides, there is a video on Youtube of his talk that you can find. Basically, it was observed that:

  • Several experiments cannot be reproduced because datasets are private.
  • For public datasets used in time series anomaly detection, many are deeply flawed. They contain few anomalies and anomalies are often mislabeled. Thus, predicting the mislabeled anomalies can indicate overfitting rather than good performance.
  • Many benchmark datasets are trivial. It was estimated that maybe 90% of benchmark datasets can be solved with one line of code or decades old method. This means that many of the complex deep learning models are completely unecessary on such datasets as a single line of code has the same or better performance. Here is a slide that illustrates this problem:

Based on these observations and others, Keogh et al. proposed a new set of benchmark datasets for anomaly detection. I recommend to watch this talk on Youtube if you have time. It is really insighful, and show deep problems with anomaly detection research for time series.

(2) Paper from AAAI 2022

There is also a recent paper published at AAAI 2022 that makes similar observations. The paper is called “Towards a Rigorous Evaluation of Time-series Anomaly Detection”. Link: 2109.05257.pdf (arxiv.org).

The authors basically show that a common measure called PA used for time series anomaly detection can lead to greatly overestimating the performance of anomaly detection models. The authors did an experiment where they compared the performance of several state-of-the-art deep learning models for anomaly detection with three trivial models: (case 1) random anomly scores, (case 2) a model that gives the input data as anomaly scores, and (case 3) scores produced by a randomized model. Then, they found that in several cases, these three trivial models were performing better than the state-of-the-art deep learning models. Here is a picture of that result table (more details in the paper):

These results are quite shocking… It means that results of several papers cannot be really trusted.

In the paper, the authors proposed a solution that is to use an alternative metric to evaluate models.

Conclusion

The above observation about the state of research on time series anomaly detection gives a bad look on several studies in that area. It shows that many studies are flawed and that models are poorly evaluated. I do not work on this topic but I think it is interesting and it can remind all researchers to be very careful about how to evaluate their models to not fall in a similar trap.

This is just a short overview. Hope that this is interesting!


Philippe Fournier-Viger is a full professor working in China and founder of the SPMF open source data mining software.

Posted in Big data, Data Mining, Data science, Time series | Tagged , , , , , | Leave a comment

Merry Christmas and Happy New Year!

Hi everyone! This is just a short message to wish a merry Christmas and a happy new year to all the readers of this blog and users of the SPMF software!

I will take a short break for a few days and then be back with more contents for the blog. I am also currently preparing a new version of the SPMF software that will include new data mining algorithms.


Philippe Fournier-Viger is a full professor working in China and founder of the SPMF open source data mining software.

Posted in General | Tagged | Leave a comment

Brief report about IEEE ICDM 2021

This blog post provides a brief report about the IEEE ICDM2021 conference (International Conference on Data Mining), which was held virtually from New Zealand from the 7th December to the 10th December 2021.

What is ICDM?

ICDM is one of the very top data mining conferences. ICDM 2021 is the 21st edition of the conference. The focus of this conference is on data mining and machine learning. I have attended ICDM a few times. For example, you can read my report about ICDM 2020.

Opening ceremony

The opening ceremony started by a performance from local people. Then, there was some greetings from the general chair Prof. Yun Sing Koh.

Then Prof. Xingdong Wu, the founder of the conference, talked about ICDM. Here are a few slides from this presentation:

It was said that this year, that there is over 500 participants.

This is the acceptance rate of ICDM over the years:

The program co-chairs then gave some more details about the ICDM 2021 review process. In particular, this year, there was 990 submission, and 98 were accepted as regular papers (9.9% acceptance rate), and 100 as short papers, for a global acceptance rate of 20%. All papers have been reviewed in a triple-blind way.

In terms of papers by country, the largest number of accepted papers came from China, and then from the USA.

The most popular topics were about deep learning, neural networks and classification.

Most of the program committee members are from USA and China

The workshop chairs then talked about the workshops. This year, there was 18 workshops on various topics. On overall, the acceptance rate for workshop papers was about 50%. All the workshop papers are published in formal proceedings by IEEE and indexed by EI and stored in the IEEE Digital Library. In particular, this year, I have co-organized the UDML 2021 workshop on utility-driven mining and learning. Here are more details about the workshops:

Each workshop had from 4 to 12 accepted papers.

Then, the virtual platform chair Heitor Murilo Gomes introduced the virtual platform used by ICDM2021.

Keynote talks

There was several keynote talks.

The first keynote talk was by Prof. Masashi Sugiyama from Japan about robust machine learning against various factors (weak supervision, noisy labels and beyond). Here are a few slides:

Paper presentation

There was many paper presentations. I will not report in details about them.

Conclusions

The ICDM 2021 conference was interesting. It is a leading conference in data mining. Looking forward to attending it again next year for ICDM 2022 in Orlando, Florida.


Philippe Fournier-Viger is a full professor working in China and founder of the SPMF open source data mining software.

Posted in Big data, Conference, Data Mining | Tagged , , , , , , , | 5 Comments

Brief report about UDML 2021 (4th International Workshop on Utility-Driven Mining and Learning

Today, it was the 4th International Workshop on Utility-Driven Mining and Learning (UDML 2021), held at the IEEE ICDM 2021 conference. I am a co-organizer of the workshop and will give a brief report about it.

What is UDML?

UDML is a workshop that has been held for the last four years. It was first held at the KDD 2018 conference, and then at the ICDM 20119, ICDM 2020 and ICDM 2021 conference.

The focus of the UDML workshop is how to integrate the concept of utility in data mining and machine learning. Utility is a broad concept that represents the importance or value of patterns or models. For instance, in the context of analyzing customer data, the utility may represent the profit made by sales of products, while in the context of a multi-agent system, utility may be a measure of how desirable a state is. For many machine learning or data mining problems, it is desirable to find patterns that have a high utility or models that optimize some utility functions. This is the core topic of this workshop. But the workshop is also open to other related topics. For example, most pattern mining papers can fit the the scope of the workshop as utility can take a more broad interpreation of finding interesting patterns.

The program

UDML 2021 was held online due to the COVID pandemic. We had a great program with an invited keynote talk and 8 accepted papers selected from about 14 submissions. All papers were reviewed by several reviewers. The papers are published in the IEEE ICDM Workshop proceedings, which ensures a good visibility. Besides, a special issue in the journal of Intelligent Data Analysis was announced for extended versions of the workshop papers.

Keynote talk by Prof. Tzung-Pei Hong

The first part of the workshop was the keynote by Prof. Tzung-Pei Hong from National University of Kaohsiung, who kindly accepted to give a talk. The talk was very interesting. Basically, Prof. Hong has shown that the problem of erasable itemset mining can be converted to the problem of frequent itemset mining and that the opposite is also possible. This implies that one can simply reuse the very efficient itemset mining algorithms with some small modifications to solve the problem of erasable itemset mining. The details about how to convert one problem to the other were presented. Besides, some experimental comparison was presented using the Apriori algorithm (for frequent itemset mining) and the META algorithm (for erasable itemset mining). It was found that META is faster for smaller or more sparse databases but that in other cases, Apriori was faster.
Here is a few slides from this presentation:

Paper presentations

Eight paper were presented:

Paper ID: DM368, Md. Tanvir Alam, Amit Roy, Chowdhury Farhan Ahmed, Md. Ashraful Islam, and Carson Leung, “Mining High Utility Subgraphs
Paper ID: S10201, Cedric Kulbach and Steffen Thoma, “Personalized Neural Architecture Search” (best paper award )
Paper ID: S10213, Uday Kiran Rage, Koji Zettsu, “A Unified Framework to Discover Partial Periodic-Frequent Patterns in Row and Columnar Temporal Databases
Paper ID: S10210, Wei Song, Caiyu Fang, and Wensheng Gan, “TopUMS: top-k utility mining in stream data
Paper ID: S10211, Mourad Nouioua, Philippe Fournier-Viger, Jun-Feng Qu, Jerry Chun-Wei Lin, Wensheng Gan, and Wei Song, “CHUQI-Miner: Mining Correlated Quantitative High Utility Itemsets
Paper ID: S10209, Chi-Jen Wu, Wei-Sheng Zeng, and Jan-Ming Ho, “Optimal Segmented Linear Regression for Financial Time Series Segmentation
Paper ID: S10202, Jerry Chun-Wei Lin, Youcef Djenouri, Gautam Srivastava, and Jimmy Ming-Tai Wu, “Large-Scale Closed High-Utility Itemset Mining
Paper ID: S10203, Yangming Chen, Philippe Fournier-Viger, Farid Nouioua, and Youxi Wu, “Sequence Prediction using Partially-Ordered Episode Rules

These papers cover various topic in pattern mining such as high utility itemset mining, subgraph mining, high utility quantitative itemset mining and periodic pattern mining but also about some machine learning topics such as for linear regression and neural networks.

Best paper award of UDML 2021

This year, a best paper award was given. The paper was selected based on the review scores and a discussion among the organizers. The recipient of the award is the paper “Personalized Neural Architecture Search“, which presented an approach to search for a good neural network architecture that optimize some criteria (in other words, a form of utility).

Conclusion

That was the brief overview about UDML 2021. The workshop was quite successful. Thus, we plan to organize the UDML workshop again next year, likely at the ICDM conference, but we are also considering KDD as another possibility. There will also be another special issue for next year’s workshop.


Philippe Fournier-Viger is a full professor working in China and founder of the SPMF open source data mining software.

Posted in Conference, Data Mining, Pattern Mining, Utility Mining | Tagged , , , , | 1 Comment

What is a good pattern mining algorithm?

Today, I will continue talking about pattern mining, and in particular about what is a good pattern mining algorithm. There are a lot of algorithms for discovering patterns in data that are not useful in real-life. To design a good pattern mining algorithm, I will argue that it should ideally have some of the following desirable properties (not all are required):

  • Can be used in many scenarios or applications: Some algorithms are designed for tasks that are not realistic or that makes assumption that do not hold in real life. It is important to design algorithms that can be used in real-life scenarios.
  • Is flexible. An algorithm should ideally provide optional features so that it can be used in different situations where requirements are different.
  • Has excellent performance: An algorithm should be efficient, especially if the goal is to analyze large datasets. In particular, it can be desirable to have algorithms that have linear scalability or can scale well to be able to handle big data. Efficiency can be measured in terms of runtime and memory.
  • Has few parameters: An algorithm that has too many parameters is generally hard to use. However, it is OK to have optional parameters to provide more features to users. Some algorithms also do not have any parameters. This is the case for example of skyline pattern mining algorithms and some compression-based pattern mining algorithms.
  • Is interactive: Some systems for pattern mining will provide interactive features such as giving to the user the ability to guide the search for patterns by providing feedback about the patterns that are discovered. Some systems will also let the user perform targeted queries about specific items rather than finding all possible patterns.
  • Visualization: Having visualization capabilities is also a useful feature to help browsing through numerous patterns.
  • Can deal with complex data: It is also desirable to design algorithms that can deal with complex data types such as sequences and graphs as real-life data is often complex.
  • Can discover statistically significant or correlated patterns: Many algorithms can find millions of patterns but many of the patterns are spurious. In other words, some patterns may be weakly correlated or just appear by chance. To find significant patterns, a key approach is to use statistical test or correlation measures.
  • Let the user select constraints to filter out patterns: For users, a key features is to be able to set constraints on the patterns to be found such as a length constraint, so as to reduce the number of patterns.
  • Summarize or compress the data: Another important feature of a good pattern mining algorithm is the ability to find patterns that summarize or compress the data. In fact, rather than finding millions of patterns, it can be useful to find patterns that are representative in the sense that they capture well the characteristics of the data.
  • Discover pattern types that are interesting: The type of patterns that is discovered by an algorithm must be useful, or event surprising. It should not be too complex or too simple.
  • Can find an approximate solution: Because exact algorithms for pattern mining are often slow or do not scale well, designing approximate algorithms that can give an approximate solution is also important.

This is the list of properties that I think are the most important for pattern mining algorithms. Hope it has been interesting. Leave a comment below, if you want to add something else. 🙂


Philippe Fournier-Viger is a full professor working in China and founder of the SPMF open source data mining software.

Posted in Data Mining, Data science, Pattern Mining | Tagged , , | Leave a comment