An Introduction to Data Mining

In this blog post, I will introduce the topic of data mining. The goal is to give a general overview of what is data mining.

what is data mining

What is data mining?

Data mining is a field of research that has emerged in the 1990s, and is very popular today, sometimes under different names such as “big data” and “data science“, which have a similar meaning. To give a short definition of data mining,  it can be defined as a set of techniques for automatically analyzing data to discover interesting knowledge or pasterns in the data.

The reasons why data mining has become popular is that storing data electronically has become very cheap and that transferring data can now be done very quickly thanks to the fast computer networks that we have today. Thus, many organizations now have huge amounts of data stored in databases, that needs to be analyzed.

Having a lot of data in databases is great. However, to really benefit from this data, it is necessary to analyze the data to understand it. Having data that we cannot understand or draw meaningful conclusions from it is useless. So how to analyze the data stored in large databases?  Traditionally, data has been analyzed by hand to discover interesting knowledge. However, this is time-consuming, prone to error, doing this may miss some important information, and  it is just not realistic to do this on large databases.  To address this problem, automatic techniques have been designed to analyze data and extract interesting patternstrends or other useful information. This is the purpose of data mining.

In general, data mining techniques are designed either to explain or understand the past (e.g. why a plane has crashed) or predict the future (e.g. predict if there will be an earthquake tomorrow at a given location).

Data mining techniques are used to take decisions based on facts rather than intuition.

What is the process for analyzing data?

To perform data mining, a process consisting of seven steps is usually followed. This process is often called the “Knowledge Discovery in Database” (KDD) process.

  1. Data cleaning: This step consists of cleaning the data by removing noise or other inconsistencies that could be a problem for analyzing the data.
  2. Data integration: This step consists of integrating data  from various sources to prepare the data that needs to be analyzed. For example, if the data is stored in multiple databases or file, it may be necessary to integrate the data into a single file or database to analyze it.
  3. Data selection: This step consists of selecting the relevant data for the analysis to be performed.
  4. Data transformation: This step consists of transforming the data to a proper format that can be analyzed using data mining techniques. For example, some data mining techniques require that all numerical values are normalized.
  5. Data mining:  This step consists of applying some data mining techniques (algorithms) to analyze the data and discover interesting patterns or extract interesting knowledge from this data.
  6. Evaluating the knowledge that has been discovered: This step consists of evaluating the knowledge that has been extracted from the data. This can be done in terms of objective and/or subjective measures.
  7. Visualization:  Finally, the last step is to visualize the knowledge that has been extracted from the data.

Of course, there can be variations of the above process. For example, some data mining software are interactive and some of these steps may be performed several times or concurrently.

What are the applications of data mining?

There is a wide range of data mining techniques (algorithms), which can be applied in all kinds of domains where data has to be analyzed. Some example of data mining applications are:

  • fraud detection,
  • stock market price prediction,
  • analyzing the behavior of customers in terms of what they buy

In general data mining techniques are chosen based on:

  • the type of data to be analyzed,
  • the type of knowledge or patterns to be extracted from the data,
  • how the knowledge will be used.

What are the relationships between data mining and other research fields?

Actually, data mining is an interdisciplinary field of research partly overlapping with several other fields such as: database systems, algorithmic, computer science, machine learning, data visualization, image and signal processing and statistics.

There are some differences between data mining and statistics although both are related and share many concepts.  Traditionally, descriptive statistics has been more focused on describing the data using measures, while inferential statistics has put more emphasis on hypothesis testing to draw significant conclusion from the data or create models. On the other hand, data mining is often more focused on the end result rather than statistical significance. Several data mining techniques do not really care about statistical tests or significance, as long as some measures such as profitability, accuracy have good values.  Another difference is that data mining is mostly interested by automatic analysis of the data, and often by technologies that can scales to large amount of data. Data mining techniques are sometimes called “statistical learning” by statisticians.  Thus, these topics are quite close.

What are the main data mining software?

To perform data mining, there are many  software programs available. Some of them are general purpose tools offering many algorithms of different kinds, while other are more specialized. Also, some software programs are commercial, while other are open-source.

I am personally, the founder of the SPMF open-source data mining librarywhich is free and open-source, and specialized in discovering patterns in data. But there are many other popular software such as Weka, Knime, RapidMiner, and the R language, to name a few.

Data mining techniques can be applied to various types of data

Data mining software are typically designed to be applied on various types of data. Below, I give a brief overview of various types of data typically encountered, and which can be analyzed using data mining techniques.

  • Relational databases:  This is the typical type of databases found in organizations and companies. The data is organized in tables. While, traditional languages for querying databases such as SQL allow to quickly find information in databases, data mining allow to find more complex patterns in data such as trends, anomalies and association between values.
  • Customer transaction databases: This is another very common type of data, found in retail stores. It consists of transactions made by customers. For example, a transaction could be that a customer has bought bread and milk with some oranges on a given day. Analyzing this data is very useful to understand customer behavior and adapt marketing or sale strategies.
  • Temporal data: Another popular type of data is temporal data, that is data where the time dimension is considered. A sequence is an ordered list of symbols. Sequences are found in many domains, e.g. a sequence of webpages visited by some person, a sequence of proteins in bioinformatics or sequences of products bought by customers.  Another popular type of temporal data is time series. A time series is an ordered list of numerical values such as stock-market prices.
  •  Spatial data: Spatial data can also be analyzed. This include for example forestry data, ecological data,  data about infrastructures such as roads and the water distribution system.
  • Spatio-temporal data: This is data that has both a spatial and a temporal dimension. For example, this can be meteorological data, data about crowd movements or the migration of birds.
  • Text data: Text data is widely studied in the field of data mining. Some of the main challenges is that text data is generally unstructured. Text documents often do no have a clear structure, or are not organized in predefined manner. Some example of applications to text data are (1) sentiment analysis, and  (2) authorship attribution (guessing who is the author of an anonymous text).
  • Web data: This is data from websites. It is basically a set of documents (webpages) with links, thus forming a graph. Some examples of data mining tasks on web data are: (1) predicting the next webpage that someone will visit, (2) automatically grouping webpages by topics into categories, and (3) analyzing the time spent on webpages.
  • Graph data: Another common type of data is graphs. It is found for example in social networks (e.g. graph of friends) and chemistry (e.g. chemical molecules).
  • Heterogeneous data. This is some data that combines several types of data, that may be stored in different format.
  • Data streams: A data stream is a high-speed and non-stop stream of data that is potentially infinite (e.g. satellite data, video cameras, environmental data).  The main challenge with data stream is that the data cannot be stored on a computer and must thus be analyzed in real-time using appropriate techniques. Some typical data mining tasks on streams are to detect changes and trends.

What types of patterns can be found in data?

As previously discussed, the goal of data mining is to extract interesting patterns from data. The main types of patterns that can be extracted from data are the following (of course, this is not an exhaustive list):

  • Clusters: Clustering algorithms are often applied to automatically group similar instances or objects in clusters (groups).  The goal is to summarize the data to better understand the data or take decision. For example, clustering techniques such as K-Means can be used to automatically groups customers having a similar behavior.
  • Classification models: Classification algorithms aims at extracting models that can be used to classify new instances or objects into several categories (classes). For example, classification algorithms such as Naive Bayes, neural networks and decision trees can be used to build models that can predict if a customer will pay back his debt or not, or predict if a student will pass or fail a course.  Models can also be extracted to perform prediction about the future (e.g. sequence prediction).
  • Patterns and associations: Several techniques are developed to extract frequent patterns or associations between values in database. For example, frequent itemset mining algorithms can be applied to discover what are the products frequently purchased together by customers of a retail store. Some other types of patterns are for example, sequential patternssequential rules,  periodic patterns, episode mining and frequent subgraphs.
  •  Anomalies/outliers: The goal is to detect things that are abnormal in data (outliers or anomalies). Some applications are for example: (1) detecting hackers attacking a computer system, (2) identifying potential terrorists based on suspicious behavior, and (3) detecting fraud on the stock market.
  • Trends, regularities:  Techniques can also be applied to find trends and regularities in data.  Some applications are for example to (1) study patterns in the stock-market to predict stock prices and take investment decisions, (2) discovering regularities to predict earthquake aftershocks, (3) find cycles in the behavior of a system, (4) discover the sequence of events that lead to a system failure.

In general, the goal of data mining is to find interesting patterns. As previously mentioned, what is interesting can be measured either in terms of objective or subjective measures. An objective measure is for example the occurrence frequency of a pattern (whether it appears often or not), while a subjective measure is whether a given pattern is interesting for a specific person. In general, a pattern could be said to be interesting if: (1) it easy to understand, (2) it is valid for new data (not just for previous data); (3) it is useful, (4) it is novel or unexpected (it is not something that we know already).

Conclusion

In this blog post, I have given a broad overview of what is data mining. This blog post was quite general. I have actually written it because I am teaching a course on data mining and this will be some of the content of the first lecture. If you have enjoyed reading, you may subscribe to my Twitter feed (@philfv) to get notified about future blog posts.


Philippe Fournier-Viger is a professor of Computer Science and also the founder of the open-source data mining software SPMF, offering more than 120 data mining algorithms.

Posted in Big data, Data Mining, Data science | Tagged , , , , | 23 Comments

Unethical Reviewers in Academia!

In this blog post, I will talk about a common problem in academia, which is the unethical behavior of some reviewers that ask authors to cite several of their papers.

It is quite common that some reviewer will ask authors to cite his papers to increase his citation count. I have encountered this problem many times for my own papers when submiting to journals. Sometimes the reviewer will try to hide his identify by asking to cite four or five papers and include one or two from himself among those. But sometimes, it is very obvious as the reviewer will directly ask to cite many papers and they will all be from the same author. For example, just a few weeks ago, I received a notification for one of my papers where the reviewer wrote:

The related work needs improvement: Please add the following works:
…. title of paper 1 …
…. title of paper 2 ..
…. title of paper 3 ….
…. title of paper 4 …

That reviewer asked to cite four papers by the same person. In that case, it is very easy to guess who is the reviewer. In some cases, I have even seen two reviewers of the same papers both asking the author to cite their papers. Each of them was asking to cite about five of their papers. This was completely ridiculous and gave a very bad impressionabout the review process. This unethical behavior is quite common. If you submit many papers to journals, you will sooner or later encounter this problem, even for top 20 % journals.

Why it happens? The reason is that many universities consider citation count as an important metric for performance evaluation. Thus, some authors will try to artificially increase their citation count by forcing other authors to cite their papers.

So what are the solutions?

  • Authors facing this problem will often accept to cite the papers from the reviewer because they are afraid that the reviewer will reject the paper if they don’t. This is understandable. However, if the authors accept, this will encourage the reviewer to continue this unethical behavior for other papers. Thus, the best solution is to send an e-mail to the editor to let them know about it. This is what I do when I am in this situation. If you let the editor knows, the editor will normally take this into account and may even take some punitive actions like removing the reviewer from the journal.
  • To avoid this problem before it happens, some editors will read carefully the reviews and delete unethical requests by reviewers. However, this does not always happen because editors are often very busy and may not spend the time to read all comments made by reviewers. But it is good that some journal such as IEEE Access will put a disclaimer in the notification to inform authors that they are not required to cite papers that are not relevant to the article. This is a good way of preventing this problem.
  • Reviewers should only ask to cite papers that are relevant to the paper and will contribute to improving the quality of the paper. To avoid conflict of interests, a reviewer can suggest to cite a paper rather than tell authors that they must cite paper. This is more acceptable.

Conclusion

In this blog post, I have talked about some unethical behavior that many people have encountered when submiting to journals, and sometimes also for conferences. The reason why I wrote this blog post is that I have encountered this situation for two of my papers in the last two months and I have become quite tired to see this happen in academia.

If it also happened to you, please leave a comment below with your story. I will be happy to read it!


Philippe Fournier-Viger is a computer science professor and founder of the SPMF open-source data mining library, which offers more than 170 algorithms for analyzing data, implemented in Java.

Posted in Academia, Research | Tagged , , , | 1 Comment

How to Improve Work Efficiency for Researchers?

In this blog post, I will talk about the topic of increasing work efficiency for researchers. This is an important topic as during a researcher’s career, the workload tend to increase over time but there is always only 24 hours every day. Thus, becoming more efficient is important. Being efficient also means to have more time to do other things after work such as spending time with your family and friends. I will share a few ideas below about how to improve efficiency for researchers.

work efficiency

Working on what is important

To improve efficiency, it is important to work on what is really important. For every task that a researcher wants to do, he should first evaluate how much time he will spend on the task and what will be the expected benefits. The reason is that sometimes the time spent on a task could be used to do something else that would bring more benefits for the same amount of time. For example, if someone is writing a research paper, he could spend a day on improving the quality of some figure or instead spend that day to proof-read the paper and improve the writing style. There are sometimes some tasks that we want to do that are not really important and require a lot of time. In that case, we maybe don’t need to do them.

Having a schedule and planning tasks

It is also a good habit to have a schedule to keep track of all the things that you need to do. Moreover, you can order tasks by priority to focus on the more important ones. It is also important to set goals and then try to make a plan of all the tasks that need to be done to achieve these goals.

planning and scheduling

For scheduling and planning, one can have a calendar and also a to-do list of important things to do. It is also good to keep a small book to write your research ideas when you have some to not forget them.

It is also good to do all the similar tasks on the same day. For example, if you have many papers to review, you can decide to review all of them in one afternoon rather than doing one every few days. Generally, this will be more efficient.

Working in a better environment

The work environment is also very important. It can be good for example to clean your desk, or find a quiet environment to work such as a library, to be more efficient. If you are in a noisy environment, it can also be useful to use some noise cancelling earphones or noise blocking earmuffs.

quiet work environment

And of course, one should avoid working in a distracting environment such as while watching TVs or working in positions that decrease productivity such as laying on the bed.

Using software to reduce distractions

There are also exists some software that helps to get more focused. For example, on Windows, I use a software called AutoHideDesktopIcons that will hide the desktop, the taskbar and all opened windows except the current window. This helps to remove many distractions.

auto hide icons software

There are also exists some software for writing that have minimal user interface to make sure that one can focus on writing. This is the case for example of WriteMonkey on Windows. The user interface of WriteMonkey is basically just a blank page, which can really help to concentrate on writing (see below).

writemonkey

Collaborating with others and giving work to others

Another way of becoming more efficient is to share your workload with other people. For example, if you invite someone else to participate to your paper, then this person will do some work and thus your work will be reduced. If you are a team leader, you can also give some work to your team members to reduce your own work, or even hire a personal assistant or someone else to do some work for your (e.g. paying someone to proofread your papers).

Conclusion

In this blog post, I gave a few tips about how to become more efficient at research. I could certainly say much more about this but I wanted to give a few ideas. Please share your other ideas or views in the comment section, below.


Philippe Fournier-Viger is a computer science professor and founder of the SPMF open-source data mining library, which offers more than 170 algorithms for analyzing data, implemented in Java.

Posted in Academia, Research | Tagged , , , , | 1 Comment

Five recent books on pattern mining

In this blog post, I will list a few interesting and recent books on the topic of pattern mining (discovering interesting patterns in data). This mainly lists books from the last 5 years.

High utility pattern mining: Theory, Applications and algorithms (2019). This is the most recent book, edited by me. It is about probably the hottest topic in pattern mining right now, which is high utility pattern mining. The book contains 12 chapters written by experts from this field about discovering different kinds of high utility patterns in data. It gives a good introduction to the field, as it contains five survey papers, and also describe some of the latest research. Link: https://link.springer.com/book/10.1007/978-3-030-04921-8

Supervised Descriptive Pattern Mining (2018). A book that focuses on techniques for mining descriptive patterns such as emerging patterns, contrast patterns, class association rules, and subgroup discovery, which are other important techniques in pattern mining. https://link.springer.com/book/10.1007/978-3-319-98140-6

Pattern Mining with Evolutionary Algorithms (2016). A book that focuses on the use of evolutionary algorithms to discover interesting patterns in data. This is another emerging topic in the field of pattern mining. https://link.springer.com/book/10.1007/978-3-319-33858-3

Frequent pattern mining (2014). This book does not cover the latest research as it is already almost five years old. But it gives an interesting overview of some popular techniques in frequent pattern mining. http://link.springer.com/book/10.1007%2F978-3-319-07821-2

Spatiotemporal Frequent Pattern Mining from Evolving Region Trajectories (2018). This is a recent book, which focus on spatio-temporal pattern mining. Adding the time and spatial dimension in pattern mining is another interesting research issue. https://link.springer.com/book/10.1007/978-3-319-99873-2#about

That is all I wanted to write for today. If you know about some other good books related to pattern mining that have been published in recent years, please let me know and I will add them to this list. Also, I am looking forward to edit another book related to pattern mining soon…. What would be a good topic? If you have some suggestions, please let me know in the comment section below!


Philippe Fournier-Vigeris a professor of Computer Science and also the founder of the open-source data mining software SPMF, offering more than 150 data mining algorithms.

Posted in Big data, Data Mining, Pattern Mining, Utility Mining | Tagged , , , , , | Leave a comment

How to convert Latex to HTML?

In this blog post, I will explain a simple way of transforming a Latex document to HTML. Why doing this? There are many reasons. For example, you may have formatted some text in Latex and would like to quickly integrate it in a webpage.

convert latex to html

The wrong way

First, there is a wrong way of doing this. It is to first create a PDF from your Latex document, and then use a tool to convert from PDF to HTML. If you try this and the document is even just slightly complex, the result may be very bad… and the HTML code may be horrible with many unecessary tags.

The good way

Thus, the best way to convert Latex to HTML is to use some dedicated tool. There are several free tools, but many are designed to run on Linux. If you are using Windows, it may thus take you some time to find the right tool.

Luckily the popular Latex distributions like MikTek and TexLive include an executable of a softwate to convert from Latex to HTML that works on Windows. Thus, if you have the full TexLive distribution, you do not need to download or install anything else. Below, I will describe how to do with TexLive on Windows.

Using TexLive on Windows

First, you need to open the command line and go to the directory containing your Latex document. Let say that your Latex document is called article.tex. Then, you can run this command:

   htlatex article.tex

The result will be a new file article.html

The result is usually quite good. For example, I have converted a research paper that I wrote about high utility episode mining and the results looks like this:

latex to html example

I would say that 90 % of the paper was converted correctly. There are some other parts that I have not shown like some pseudocode for some algorithms that were not formatted properly. But I would say that the conversion is on overall really good.

Conclusion

In this blog post, I have shown a simple way of converting Latex to HTML on Windows using the TexLive distribution. If you are using MikTex or Linux, similar commands can be used.


Philippe Fournier-Viger is a computer science professor and founder of the SPMF open-source data mining library, which offers more than 170 algorithms for analyzing data, implemented in Java.

Posted in Academia, Latex, Research | Tagged , , , , | 3 Comments

Datasets of 30 English novels for pattern mining and text mining

Today, I want to announce that I have just made public datasets of 30 novels from English Novels from 10 authors of the XIX century. These datasets can be used for testing algorithms for sequential pattern miningsequential rule mining, as well as for some text mining applications such as authorship attribution (guessing the authors of an anonymous text) and sequence prediction.

All the datasets  were public domain texts that have been prepared and converted to a suitable format for text analysis by Jean-Marc Pokou et al. (2016) so that they can be used with the SPMF library. 

These books are written by 10 different English novelists from the XIX century. The total number of words/sentences in the corpus of each author is as follows:
Catharine Traill (276,829/ 6,588),
Emerson Hough (295,166/ 15,643),
Henry Addams (447,337/ 14,356),
Herman Melville (208,662/ 8,203),
Jacob Abbott (179,874/ 5,804),
Louisa May Alcott (220,775/ 7,769),
Lydia Maria Child (369,222/ 15,159),
Margaret Fuller (347,303/ 11,254),
Stephen Crane (214,368/ 12,177),
Thornton W. Burgess (55,916/ 2,950).

The list of books is:

AuthorDatasets (books) in SPMF format
Catharine Traill– A Tale of The Rice Lake Plains
-Lost in the Backwoods
– The Backwoods of Canada
Emerson Hough– The Girl at the Halfway House
– The Law of the Land
– The Man Next Door
Henry Addams– Democracy, an American novel
– Mont-Saint-Michel and Chartres
– The Education of Henry Adams
Herman Melville– I and My Chimney
-Israel Potter
-The Confidence-Man His Masquerade
Jacob Abbott– Alexander the Great
– History of Julius Caesar
– Queen Elizabeth
Louisa May Alcott– Eight Cousins
– Rose in Bloom
– The Mysterious Key and What Opened
Lydia Maria Child– A Romance of the Republic
-Isaac THoppe
-Philothea)
Margaret Fuller– Life Without and Life Within
-Summer on the Lakes, in 1843
– Woman in the Nineteenth Century
Stephen Crane– Active Service
– Last Words
– The Third Violet
Thornton WBurgess– The Adventures of Buster Bear
– The Adventures of Chatterer the Red Squirrel
-The Adventures of Grandfather Frog

Each dataset has two versions: (1) sequences of words and (2) sequences of Part-of-Speeches (POS) tags (obtained using the Stanford NLP Tagger).

Here are the links to download the books:

If you use the above book datasets, you may want to cite this paper:

Pokou J. M., Fournier-Viger, P., Moghrabi, C. (2016). Authorship Attribution Using Small Sets of Frequent Part-of-Speech Skip-grams. Proc. 29th Intern. Florida Artificial Intelligence Research Society Conference (FLAIRS 29), AAAI Press, pp. 86-91

In that paper, we have discovered skip-grams (sequential patterns) and n-grams (consecutive sequential patterns) of part-of-speech tags to guess the authors of books.

More datasets can also be found on the dataset webpage of the SPMF software.


Philippe Fournier-Viger is a computer science professor and founder of the SPMF open-source data mining library, which offers more than 170 algorithms for analyzing data, implemented in Java.

Posted in Data Mining, Data science | Tagged , , | Leave a comment

The PAKDD 2020 conference (a brief report)

In this report, I will talk about the 24th Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD 2020), from the 11th to 14th May 2020.

The PAKDD conference

PAKDD is a top international conference on data mining / big data in the Pacific-Asia part of the world. I have attended this conference times and written reports about several editions of the conference. If you are interested, you can read these reports here: PAKDD 2014PAKDD 2015PAKDD 2017,  PAKDD 2018 and PAKDD 2019.

PAKDD Proceedings

As usual, the conference proceedings of PAKDD 2020 are published by Springer in the Lectures Notes on Artificial Intelligence (LNAI) series. This ensures that the proceedings are indexed in DBLP and other major indexes, and gives good visibility to papers.

This year, there was 628 submissions to PAKDD 2020. From those, 135 papers have been accepted, which means an acceptance rate of 21.5%.

The conference went online

This year, the PAKDD 2020 conference was planned to be held in Singapore. But due to the unforeseen COVID-19 virus pandemic around the world, the PAKDD 2020 conference was held online instead. Part of the registration fee was re-imbursed to the authors because organizers saved money by doing the conference online. And of course, since the conference was online, all social events like banquet, reception were cancelled.

All authors were asked to submit a pre-recorded 13 minute video of their paper in 720p resolution with their slides, before the conference. Then during the conference, authors had to be available to answer questions online after the presentation of their paper. Thus, each paper was alloted a total of 17 minutes. This is somewhat less than previous years where long presentations had about 30 minutes, if I remember well.

The conference could be accessed through the Zoom online meeting system. To attend the different sessions, a password was required, which was made available to registered attendees.

Some video ettiquette tips were given to authors

As for proceedings, since the conference was online, proceedings were made for download from the conference website in PDF format.

Day 1 – Tutorials and workshop day

On the first day, there was 5 workshops and 2 tutorials.

I first went to have a look at the literature based discovery workshop using Zoom. There was about 22 persons in that workshop at 9:26 AM, watching this presentation about using evolutionary algorithms for matching biodemical ontologies.

Then, I popped in the Data Science for Fake News workshop at 9:40 AM to see how it was. Although, it was supposed to start at 9 AM, the workshop had not started. Using the chatroom, I asked and was answered that it was delayed until 10 AM (perhaps some technical problem or someone missing due to time zones?).

Thus, I went next to check the Game Intelligence & Informatics workshop at 9:50. There was about 11 persons watching the presentations at 9:47 AM. Game intelligence is a quite interesting topic. Here is a screenshot from that workshop, where game strategies were analyzed:

Then, at 9:57 AM I went to have a quick look at the Tutorial on Deep Explanations in Machine Learning via Interpretable Visual Methods, which was in the fourth parallel session. There was about 44 persons watching it, so it seemed to be the most popular session. This topic is interesting as neural networks can be very effective but are mostly black-box models . In that tutorial, they talked about how to interpret such models, and they also discussed some other ways of interepreting knowledge in data mining such as how to visualize association rules (screenshot below).

So far, all of this was quite interesting. And there was some good questions in the sessions that I have attended.

In the afternoon at 2PM, I attended the 9th Workshop on Biologically Inspired Data Mining (BDM 2020). This is a workshop that has been running for many years at PAKDD, that I personally like as it cover various topics such as genetic algorithms, particle swarm optimization (PSO), ant colony optimization, and also applications of such algorithms. There was about 18 persons attending the workshop at 2:11 PM. First, the organizer Shafiq Alam gave an overview of the motivations for biologically inspired data mining by explaining that optimization algorithms like genetic algorithms can be used to quickly find an approximate solutions to hard problems, if we can accept to lose a little bit about the accuracy. Then, some results were about using PSO for clustering and recommendation. Then, there was some paper presentations, and a discussion about current trends.

At the same time in the afternoon, there was a Tutorial on deep Bayesian network that had about 31 attendees at 2:19 PM, and a workshop on Learning Data Representations for clustering, which had about 14 attendees at 14:21 PM. Overall, it seems that the tutorials were the most popular sessions during this first day.

Day 2

At 8:30 to 9:00 AM, there was the conference opening. There was about 59 persons in that session at 8:58 AM. Some awards were announced:

It was followed by a keynote from Prof. Bing Liu about open-world AI and “continual learning”, which discusssed the need for software that can continuously learn. Here are a few slides:

This was followed by two Industry talks, one by Ussama Fayyad and another by Ankur Teredesai. Below is a few slides from the talk of A. Teredesai about AI for health, which was watche. He discussed how data mining and AI can help for healthcare. In particular, he talked about epidemiological models for diseases such as COVID-19. At 11:18 AM, there was about 27 persons in that session. That talk interesting but there was some internet connection problems at some point such that the audio was hard to hear for a few minutes. But then, it was OK.

Then, in the afternoon, there was paper presentations.

Day 3

On the morning 8:30 AM, there was a keynote talk by Inderjit S. Dhillon about multi-output prediction. There was about 42 persons watching at 8:51 AM. Here is a screenshot of that talk:

In the afternoon, there was a keynote talk by Prof. Samuel Kaski titled “Data Analysis with Humans” about how humans can participate in the machine learning process. There was about 34 persons attending the talk at 2:08 PM. He first illustrated that different problems (and method) require different levels of human intervention.

Generally, the user can participate in different ways in the machine learning of data mining process.

First the user can be a passive data source. Second the user can participate more actively in the process of machine learning or data mining to guide the software program.

Here is a slide from approach 1).

Then, there was more slides and details but I did not take note of everything.

Then, after that there was more paper presentations.

Day 3

On Day 3, there was the most influential paper talk, a keynote talk by Prof. Jure Leskovec in the afternoon, and more paper presentations.

Papers about pattern mining

Now I will talk a little bit about papers related to pattern mining, as it is one of my topics of interest. I presented a paper about a new algorithm named LTHUI-Miner to discover high utility itemsets that are trending in non predefined time periods in customer transaction databases. This is the work of my master degree student:

Fournier-Viger, P., Yang, Y., Lin, J. C.W., Frnda, J. (2020). Mining Locally Trending High Utility Itemsets. Proc. 24th Pacific-Asia Conf. Knowledge Discovery and Data Mining (PAKDD 2020), Springer, LNAI [video]

You can watch the video of my presentation here.

Also another paper related to pattern mining that was published in PAKDD this year is about discovering frequent subsequences in a set of sequences using an algorithm called Tree-Miner:

Tree-Miner: Mining sequential patterns from SP-Tree. Redwan Ahmed Rizvee (University of Dhaka), Chowdhury Farhan Ahmed (University of Dhaka), Mohammad Fahim Arefin (University of Dhaka)

Is this online format a success?

Overall, the online format of this conference is fine. But I miss the social activities of an offline conference like the coffee breaks, where we can talk with other researchers to exchange ideas and meet new people. For me, this is perhaps the most interesting parts of a conference. For me, this is one of the most interesting aspects of a conference.

Also, as a suggestion, it would have been nice if there was a playback feature to watch presentations that we have missed. In my case, I am in the same time zone as Singapore so it was convenient for me to watch the presentations, but I can imagine that people from some other countries (e.g. some part of Canada with a 12 hours time difference) would have a harder time to watch some presentations.

Special journal issues

Some papers were invited for a special issue in the JDSA journal. This is always interesting to be invited in a special issue. However, although this journal is published by Springer, a problem is that this journal is still quite new, and as such it is to my knowledge not indexed in databases like SCI or EI. In some countries like where I work, this is important and papers not indexed do not have so much value. So for this reason I had to decline the invitation to extend my paper. I would have prefered to be invited in a special issue in a more established journal like some other conferences do.

In the call for papers, there was also a mention that some papers would be invited for an issue in the KAIS journal. This is a quite good journal, but apparently it was only for the few very best papers.

Conclusion

Overall, it was an interesting conference. Due to the virus situation, the conference was held online. The organizers manage to organize the conference very well in this situation. Looking forward to PAKDD 2021 next year.

Update: You can now also read my report about PAKDD 2024 in Taipei.


Philippe Fournier-Viger is a computer science professor and founder of the SPMF open-source data mining library, which offers more than 170 algorithms for analyzing data, implemented in Java.

Posted in Conference, Data Mining, Data science | Tagged , , , | 4 Comments

“Pattern Mining :Theory and Practice” (textbook in Thai, with SPMF)

Hi all, this is to announce that a new textbook in Thai has been published about pattern mining, which includes many examples using the SPMF software. The textbook named “Pattern Mining: Theory and Practice” is written by teacher Panida Songram from Mahasarakham University (Thailand) and can be used for teaching or self-learning, for students or practitionners. I have known the auhor for many years and I am very happy that she let me host a copy of the book that you can download from this link:
Pattern Mining: Theory and Pratice (PDF, 14.2 MB),

The book gives a good coverage of pattern mining. It explains algorithms but also contains many practical examples about how to use SPMF. Some key topics in the book are itemset miningsequential pattern mining and multi-dimensional sequential pattern mining.

That is all I wanted to share for today. If you can read Thai, I highly recommend to download this book. 😉


Philippe Fournier-Viger is a computer science professor and founder of the SPMF open-source data mining library, which offers more than 170 algorithms for analyzing data, implemented in Java.

Posted in Data Mining, Data science, Pattern Mining, spmf | Tagged , , | 2 Comments

(video) Mining Locally Trending High Utility Itemsets

Today, I want to share with you the video presentation that I have prepared for my paper at PAKDD 2020. It presents a new problem where we want to discover locally trending high utility itemsets (LTHUIs). A LTHUI is a set of items purchased by customers that are trending (generate money that follows an upward or downward trend during some non predefined time periods. It is a variation of the popular high utility itemset mining problem.

VIDEO LINK: http://philippe-fournier-viger.com/spmf/videos/pakdd720p.mp4

Hope you will enjoy this video! If you want more details about this topic, you can read this paper:

Fournier-Viger, P., Yang, Y., Lin, J. C.W., Frnda, J. (2020). Mining Locally Trending High Utility Itemsets. Proc. 24th Pacific-Asia Conf. Knowledge Discovery and Data Mining (PAKDD 2020), Springer, LNAI, 12 pages.

The source code will be released soon in the SPMF data mining software.

==
Philippe Fournier-Viger is a professor, data mining researcher and the founder of the SPMF data mining software, which includes more than 150 algorithms for pattern mining.

Posted in Data Mining, Data science, Pattern Mining, Video | Tagged , , , , , | Leave a comment

Success and Health for Researchers

Many researchers or students want to be successful researchers in their field. For this they make many sacrifices such as working long hours at the lab every day from morning to the evening. This is important because honestly, success comes with hard work. But it is important to still keep a good life balance to stay healthy. In this blog post, I will talk about the importance of having good life and work habits for researchers.

First let me tell you a bit about my story. Since the start of my graduate studies, I have worked countless hours to improve myself. For example, during my master degree and Ph.D. studies, I would basically not take any rest during the whole year, and work maybe 12 hours every day. That has allowed me to be successful in my field, receive big grants during my studies, publish many papers, and then to land some good jobs in academia. Nowadays, as I have a familly, I cannot work as much as when I was a student, but I still work hard, and I am much more efficient that I was before due to the skills that I have gained. For example, I can write a paper much more quickly. I still work very late at night almost every day.

Health is important

Now, what I have learnt over the year is that working is not everything. Health is also very important. Working for long hours at the lab can eventually bring several health problems like pains in the wrist, neck, back problem, and eye problems. Luckily, I do not have any major problems, but it is something to be aware of, as problems will typically appear later down the road.

My advices

First, it is important to eat healty food.

Second, it is important to have a good posture while working. For example, it is worthy to find a good chair for working and to adjust the height of the table, screen and to have some appropriate mouse and keyboard, to be comfortable.

Third, it is important to avoid sitting for a too long time, and to sometimes rest your eyes. Several studies have shown that sitting for long periods of time may lead to various diseases. Thus, every hour, it is good to stand up and go for a walk for a few minutes, for example.

Fourth, it is equally important to do some exercise every week. Even doing a few hours of exercise every like running, swimming or playing badminton can make you feel better. I personally like to go run for 30 minutes to an hour every day.

Also, if you are tired or are always siting on a chair, you may consider working in a standing position. I have recently started to do this, and it really feels great. I even wonder why I have not done this before! It is very good for the posture and the back. Here is a picture of my setup at home:

working in a standing position

Some people recommend to alternate between a standing and sitting position to avoid getting tired. But personally, I have no problem working for several hours in a standing position. If you dont have a support like mine on the picture, you could as well use some boxes to raise your computer higher.

Another good advice is that if you are working on a laptop, you should consider using an external screen or external keyboard. The reason is that if you put your laptop low, then the keyboard will be perhaps at an appropriate height but the screen will be too low and you will have to bend your neck. But on the other hand, if you put your laptop higher the screen will be at an appropriate height for your eyes but the keyboard will be too high. Thus, using an external screen or keyboard can solve this problem.

Conclusion

In this blog post, I have discussed about the importance of having some good life habits to be a healthy researcher and avoid health problems later in life. If you have some other suggestions related to this, please post them in the comment section below!

Posted in Academia, Research | Tagged , , | 1 Comment