I have made two new videos to explain interesting topics about pattern mining. The first video is an introduction to sequential rule mining, while the second video explains in more details how the CMRules algorithm for sequential rule mining works!

If you want to try these algorithms, you can check the SPMF open-source software, which offers fast implementations of these algorithms.

Hope you will enjoy the videos. I will make more videos about pattern mining soon. By the way, you can also check my website about The Pattern Mining course. It gives videos and slides for a free online course on pattern mining. It explains all the main topics about pattern mining and is good for students who are starting to do research in this area. But this course is in beta version, which means that I am still updating it. More videos and content will be added over time.

Hi all, this is to let you know that I have made another video to explain some interesting pattern mining topics. This time, I will talk about periodic pattern mining.

Today, I will show how to draw the powerset of a set using Latex and TIKZ, to produce some nice figures that can be used in paper written using Latex. The code that is shown below is adapted from code from StackOverflow.

First, we will draw the powerset of the set {a,b} as a Hasse diagram:

Today, I am happy to announce a cool new feature in SPMF 2.54, which is a tool to automatically run performance experiments to compare several algorithms when a parameter is varied. This is a useful feature to compare algorithms when writing a research paper. The new tool to do performance experiments let you choose oneor more algorithms, indicate default parameter values and that a parameter must be varied. A time-out can be specified to avoid running algorithms for too long. Besides, each algorithm execution is done in a separate Java virtual machine to ensure that the comparison is always fair (e.g. memory will not accumulate from previous algorithm executions). When running the tool, the results can also be easily exported to Excel and Latex to generate charts for research papers. This tool can save a lot of time for performing experimental comparisons of algorithms!

Briefly the main way to use the new tool is to run the graphical user interface of SPMF and then select the tool from the list of algorithms:

This will open-up a new window, where we can configure the parameters for running the experiment:

For example, without going into details, here we choose five algorithms called Apriori, Eclat, FPGrowth_itemsets, FPClose and FPMax. We also select an input file called Chess.text on which we will run the algorithms and we select a directory called EXPERIMENTS to save all the results that will be generated to files. We also say that the algorithms have one parameter which will be varied (We indicate that a parameter is varied using a special code ##), and the parameter will be varied from 0.95 to 0.80. We say that the time limit for each execution will be 10 seconds, and we select the option that we want not only to compare time and memory usage but also the number of patterns (lines in the output) and save all results as Latex figures, as well.

Then we click “Run the experiments” to run the experiments. We will get summarized results as shown below for execution time, memory and number of patterns:

This was generated in just a few seconds. If we would have run these experiments by hand, it would have took us a lot more time. Now, after we have these results, since they are tab-separated we can directly import them into Excel to generate charts:

Besides, since we selected the option to generate PGFPlots figures for Latex, SPMF will have generated latex documents as output, that we can directly compile to use in Latex documents:

I will continue improving this tool to generate experiments in the next few weeks. In particular, I am now working on an option to generate scalability experiments as well, which should be released soon. Also, I will modify the tool to make it easier to compare algorithms from SPMF with algorithms that are not in SPMF.

Today, I will talk about signs that show that something is wrong or suspicious with some researchers based on their Google Scholar profiles. I will talk about this because I have recently received many CVs and while browsing Google Scholar, and I found some very suspicious profiles.

Normal Google scholar profiles

First, lets talk about what is a normal Google Scholar profile. Generally, a normal Google scholar profile shows a slow increase in the number of citations over time, and sometimes it will stagnates or decrease towards the end of the career of a researcher. Here are some examples of normal profiles:

And sometimes, the citations are increasing a bit more quickly due to some great contributions, but still the curve is quite smooth:

Abnormal Google scholar profiles (case 1)

Now let’s talk about some abnormal Google Scholar profiles. Let’s look at a first example:

What is wrong? Well, here we have a researcher that received 500 citations in 2017, and then about 20 citations in 2018, before going back to 400 citations per year in 2021. I don’t see any logical explanation that would explain a big drop like this other than citation manipulation. Even if we assume that the person did not publish any papers for two years, papers should still continue to be cited. So it seems to indicate that there is some manipulation of the citations that has been done.

Abnormal Google scholar profiles (case 2)

A third example is even more suspicious. Someone who had 1406 citations in 2017 suddenly dropped to 84 citations per year in 2018. I cannot see any reason for this other than some manipulations of citations.

Abnormal Google scholar profiles (case 3)

And here is another suspicious Google Scholar profiles.

Again, why would citations drop like this and never go up again? There is no logical reason other than citation manipulation.

Conclusion

I wrote this blog post to show that the Google Scholar Profiles can reveal many interesting things. Some profiles clearly show something abnormal. I will not write any names. That is not the point of this blog post. My goal is just to show that suspicious behavior can be easily detected in Google Scholar.

For young researcher, my advice is to always be honest, focus on the quality of papers rather than quantity, and do not try to cheat. To become a good researcher, it is important to build a good reputation and this start by doing quality work, being honest and following the rules.

Hope that this has been interesting. If you have any comments, please leave them in the comment section below.

In this blog post, I will show how to draw a beautiful FP-Tree data structure in a Latex document. The FP-Tree is a tree-like structure that was proposed in the FP-Growth algorithm for itemset mining, and is also used in many other pattern mining algorithms. I will show how to draw an FP-tree in Latex using the TIKZ library. An FP-tree consists of a table and a tree that are linked using some pointers (dashed arrows). The result is like this:

I took me a while to obtain this result, so here is the code:

\documentclass[a4paper, 12pt]{article}
\usepackage[utf8]{inputenc}
\usepackage{amssymb}
\usepackage{amsmath}
\usepackage{tikz}
\usetikzlibrary{positioning,matrix}
\begin{document}
\begin{figure}
\centering
\begin{tikzpicture}%%%%% THE TREE
\begin{scope}[->,font=\small,draw,circle, every node/.style={fill=white!10,shape=circle,draw},
edge from parent/.style={black,thick,draw},
level 1/.style={sibling distance=2.5cm},
level 2/.style={sibling distance=2.5cm}]
\node (TREE) {root}
child {node {$a$:3}
child {node {$b$:1} }
child {node {$c$:2}
child {node{$d$:1}}}}
child {node {$c$:2}
child {node {$d$:2}}
};
% Node links within the tree
\draw[->, dashed] (TREE-1-2) -- (TREE-2);
\draw[->, dashed] (TREE-1-2-1) -- (TREE-2-1);
\end{scope}
%%%%% THE TABLE
\begin{scope}[xshift=-5cm,yshift=-2cm,every tree node/.style={shape=rectangle,draw}]
\matrix (TABLE) [matrix of nodes,
row sep=-\pgflinewidth, column sep=-\pgflinewidth,
nodes={draw, text height=5mm,
align=center, minimum width=15mm, inner sep=0mm, minimum height=7mm}]{
Item & Link\\
$a$ & ~\\
$b$ & ~ \\
$c$ & ~\\
$d$ & ~\\
};
\end{scope}
%%%% LINKS FROM THE TABLE TO THE TREE
\draw[->,dashed] (TABLE-4-2.center) to (TREE-1-2.west);
\draw[->,dashed] (TABLE-5-2.center) to (TREE-1-2-1.west);
\end{tikzpicture}
\caption{An Example of FP-tree}
\label{fig:my_label}
\end{figure}
\end{document}
}

Hope that this will be useful! If you like it or if you want to suggest some improvement, please let me know in the comment section below or by e-mail.

The main focus of this workshop is about the concept of utility or importance in data mining and machine learning. The workshop is suitable for papers on pattern mining (e.g. high utility itemsets, frequent pattern mining, sequential pattern mining and other topics) but also for machine learning papers where these is a concept of utility or importance. In fact, the scope of the workshop can be quite broad, so please do not hesitate to submit your papers.

All the accepted papers are published by IEEE in the ICDM Workshop proceedings, which are indexed by EI, DBLP etc. The dates are as follows:

Paper submission deadline: 2nd September 2022

Paper notifications: 23rd September 2022

Camera-ready deadline and copyright forms: 1st October 2022

Today, I will explain how to analyze the complexity of pattern mining algorithms for topics such as itemset mining, sequential pattern mining, subgraph mining, sequential rule mining and periodic pattern mining. This is something that is often asked by reviewers especially when submitting papers about new pattern mining algorithms to journals. I explain the main idea of how to analyze complexity below, by breaking it down into several questions.

What to analyze?

Generally, we should analyze the complexity of a pattern mining algorithm in terms of time and also in terms of space.

The complexity depends on what?

Generally, the complexity depends on the number of items, which I will call n, and the number of records (e.g. transactions), which I will call m. Sometimes, we will also consider other aspects such as the average record length, which I will call z.

How to analyze the overall complexity?

To analyze the complexity, we need to think about every operations that an algorithm is doing and how it is affected by the factors such as m and n. It may seem to be difficult to talk about the complexity of pattern mining algorithms because an algorithm might process thousands of patterns and records, and do some complex operations as well like building tree structures, reading the database, etc. But in the end, we can break this down into several steps and analyze each step one by one, and then combine what we found to get the overall complexity, and it is not so difficult 😉

Generally, the time complexity of an algorithm can be broken down into two main parts like this:

The Time_Complexity_for_initial_operations refers to some basic operations that the algorithm must do before starting to search for patterns. The Time_Complexity_search_for_patterns refers to operations that are done during the search for patterns after performing the initial operations.

For the space complexity, we can also break it down into two parts:

Now lets look at this in more details. Below, we will discuss each part separately.

How to analyze the complexity of initial operations?

First we will first talk about the complexity of initial operations, that is Time_Complexity_for_initial_operations and Space_Complexity_for_initial_operations.

Initially a pattern mining algorithm will typically read the database one or more times and also build some initial data structures before it starts searching for patterns.

As example, let’s talk about the Eclat algorithm for itemset mining. Eclat initially reads the database once to build a vertical data structure for each item. Reading the database once can be said to require O(m * z) time, since the algorithm scans m transactions and transactions have an average length of z.

If the algorithm also builds some data structure, we need to take this into account in the complexity as well.

For example, an algorithm like Eclat builds a data structure for each item during the initial database scan. We need to build this data structure for each item. And building this data structure for an item requires to do one operation for each transaction that contains the item. If we assume the worst case, where each item appears in each transaction, we can say that building the data structures for all items is O(m * n) in the worst case. I say O(m*n) because for each item (n) we must add up to m entries in its data structure. So globally, for Eclat, we can say that the time complexity for initial operations (Time_Complexity_for_initial_operations) is O(m * z)+O(m * n)in the worst case. It is important to say “in the worst case” because this is what we analyzed.

We should also not forget to analyze the space complexity. It is done in a similar way.

If we take Eclat as example, the space complexity for initial operations is just to build the data structure for each item, so it is: Space_Complexity_for_initial_operations) is O(m * n)in the worst case based on the previous discussion.

We could also take other algorithms as example like FPGrowth. An algorithm like FPGrowth is more complex. It will build a tree structure as initial operations. In that case, we need to talk about the operations for building the tree and also about the space for storing that tree. I will not go into details but the size of the tree is in the worst case equal to the database size (if transactions do not overlap) so it would be something like O(m * z). This a rough analysis and we could analyze this into more details as I did not talk about pointers in that tree and the header list. But that is the rough idea.

How to analyze the complexity of the search for patterns?

The second part that we need to analyze to get the overall complexity of a pattern mining algorithm is the complexity of the search for patterns. This is what I called Time_Complexity_search_for_patterns and Space_Complexity_search_for_patterns.

So let’s start. Analyzing the complexity of the search for patterns may not seem easy because usually there is some recursive function that is called to search for patterns or even some kind of iterative process (e.g. in the Apriori algorithm). But we can still analyze this. So how to do? The easiest way to think about this is to think about how many patterns there are in the search space and then what is the complexity for processing each pattern. In other words:

So first, we must think about how many patterns there are in the search space. Usually, we don’t really know because it depends on the input data. So we can only talk about the worst case. The number of patterns is usually like this:

For itemset mining problems, if there are n items, there are 2^n -1 itemsets in the search space, in the worst case. Why? This is simply the number of possible combinations that we can make from n items. I will not do the proof.

For association rule mining or sequential rule mining problems, there are up to 3^m – 2^(m+1) + 1 rules in the search space, in the worst case. Again, I will not do the proof. See this blog post for more details about why.

Now, after we know how many patterns there are in the search space, we need to think about the complexity for processing each pattern.

As example, lets consider the Eclat algorithm. For the Eclat algorithm, for each pattern having more than 1 item, we need to build a data structure. This data structure contain m entries in the worst case (one entry for each transaction). To build the data structure, we must do the join of the data structures of two itemsets. This operation requires to scan one time each data structure to compare them and to build the new data structure which also contains up to m entries. All of this is done in O(m)time because we scan each structure once and it has up to m entries. The space complexity of the data structure that we build for a pattern is O(m) since it has at most one entry for each transaction in the worst case. So now, we can combine this information to get the complexity of the search for patterns of Eclat. Since we have 2^n -1 itemsets in the search space (in the worst case) and constructing the structure of each itemset is O(m) time, then Time_Complexity_search_for_patterns = O(2^n -1) * O(m) in the worst case, which we simplifies as O(2^n * m) For the Space complexity of Eclat, it is the same: Space_Complexity_search_for_patterns = O(2^n -1) * O(m) in the worst case, which we simplifies as O(2^n * m). Again, it is important to say that this is the worst case.

We could analyze other algorithms in the same way. For example, for an algorithm like FP-Growth, we will need to build an FP-tree for each frequent itemset in the search space. Since this is also an itemset mining problem, there are 2^n -1 itemsets to process, in the worst case (where all itemsets are frequent). Then, we would have to analyze the operations that are performed for each itemset (building an FP-tree) and the space required (for storing the FP-tree). I will not do the details here but gives some links to some papers as example at the end of this blog post.

Let me also briefly mention how to analyze Apriori-based algorithm that perform a level-wise search. For these algorithms like Apriori, iterations are performed. The number of iterations is in the worst case equal to the number of items, that is n. So, we could also analyze the complexity in terms of these iterations. For example, if Apriori does a database scan for each iterations, then the time complexity of those database scans would be O(m * n * z) since there are n iterations, and for each iteration we scan m transactions and each transaction has on average z items. But to analyze the time complexity of Apriori we also need to think that during each iterations we need to combine each pattern with other pattern to generate new pattern. This is a bit more complicated and it depends on how the algorithm is implemented to do those combinations so I will not go into details. But I just want to mention that iterative algorithms like Apriori should be analyzed by also taking the number of iterations into account.

Lastly, related to the complexity of the search for patterns, I would like to mention something else that is important to say in a complexity analysis. It is that above, I have discussed the worst case complexity of the search for patterns. However, for many algorithms, the search will become easier as larger patterns are explored. For example, for an algorithm like FPGrowth, we build an FP-tree for each frequent itemset and in the worst case that tree can be as large as the original database. This may seem terrible, but in practice when FP-Growth searches for larger itemsets, the FP-trees become smaller and smaller, and the branches of the tree become shorter and thus have more chances to overlap reducing the tree size. Thus, we can say that in practice, the size of these trees becomes smaller. Something similar can be said for algorithms such as Eclat. For Eclat, as itemsets becomes larger, the data structures that are stored for itemsets (the transaction id lists) tend to become smaller. Thus, less memory is consumed and joining the data structures becomes less costly. This does not change the complexity, but it helps the reader to understand that after all, the algorithm can behave well.

Putting this all together, to get the overall complexity

After we have analyzed the complexity for initial operations and the complexity of the search for patterns, we can put this all together to get the overall complexity of an algorithm.

For example, for Eclat, we have the time complexity in the worst case as Time_complexity_of_the_algorithm = Time_Complexity_for_initial_operations + Time_Complexity_search_for_patterns = O(m * z)+O(m * n) + O(2^n -1) * O(m) and the space complexity: Space_complexity_of_the_algorithm = Space_Complexity_for_initial_operations + Space_Complexity_search_for_patterns = O(m * n) + O(2^n -1) * O(m)

That is the main idea. We could also perhaps go into more details but this gives a good overview of the complexity of this algorithm.

Some examples of complexity analysis

Below, I have given the rough idea about how to analyze the complexity of pattern mining algorithms. If you want some examples, you may check the below papers, where there is some analysis of different types of algorithms:

TSPIN: I analyze the complexity of an itemset mining algorithm based on FP-Growth for mining periodic patterns. This also shows how to analyze the complexity of a top-k pattern mining algorithm.

EFIM: I analyze the complexity of an algorithm for high utility itemset mining that does a depth-first search and use a pattern-growth approach.

FHUQI-Miner: I analyze the complexity of a quantiative high utility itemset mining algorithm, which does a depth-first search, and is based on Eclat / HUI-Miner.

CEPB: In this paper, I analyze the complexity of a kind of sequential pattern mining algorithm that uses a pattern-growth approach, and perform a depth-first search.

LHUI-Miner: I analyze the complexity of another algorithm based on HUI-Miner for high utility itemset mining.

Conclusion

That was a blog post to give a rough idea about how to analyze the complexity of pattern mining algorithms. Hope that this blog post is helpful! If you have any comments, you may leave them in the comment section below. Thanks for reading.

Today, I will talk briefly about short papers at academic conferences, and the problems that they create (in my opinion). For those who don’t know, several conferences can reject or accept papers as short or full papers. A short paper means that the paper has fewer pages in the proceedings than the full papers. For instance, someone may submit a 12-page paper to a conference, which may then be accepted as a short paper that must fit in 6 pages.

Why there are short papers? The reason is quite simple. It allows conference organizers to accept more papers for their conference and thus to have more attendees, which means to earn more money through the registration fees (because it is the same price anyway for a short or a full paper). Short papers also allows conferences to accept some papers that are sometimes of low quality (almost rejected in some cases).

Why I dislike short papers? I think they are not a good idea. Let me explain why:

They force authors to remove many important details to try to squeeze their content in a small number of pages. For example, I recently got a paper accepted as short paper, and to make it fit within the page limit, I would have to either remove much of the technical details or remove some experiments. Either way, this will decrease the quality of the paper and the reader will miss some important details.

Not able to address the reviewers’ comments. In many cases, reviewers will give feedback on the paper saying that content must be added such as additional experiments. But if the paper is accepted as a short paper and the paper must be reduced by several pages, authors will be unable to address these comments, and authors may even have to do the opposite and remove experiments and details.

Quality control. Many conferences that accept short papers will not send the revised papers to reviewers again to ensure the quality of the revision. But after cutting let’s say 50% of the content, the quality has a good chance to decrease.

Less chances of having an impact. Short papers are less likely to have an impact or to be cited.

Probably less reproducible. Short papers are generally harder to reproduce because authors may have to left out much details.

The authors have to spend extra time to shorten their paper. To make a paper shorter, it can requires a few days of extra work.

It will likely not be possible to publish the results elsewhere as a full paper. Generally, if someone publish some results or ideas in a short paper, it will likely not be possible to make a full paper on the same idea at another conference (due to the overlap). Someone people may still do this by splitting their ideas into multiple papers but this is not a good practice.

For all these reasons, I dislike short papers, and often withdraw my papers if they are accepted as short papers, and then improve them and submit them to other conferences.

There are also some publishers such as Springer that acknowledge in some way the problem of short papers by mentioning that conferences are not expected to have too many short papers and that it is encouraged to not put a strict limit on the number of pages to avoid reducing the quality of papers.

That is all for today. If you have any comments, you may post them in the comment section below.

I have uploaded a new 38 minutes video to explain rare itemset mining. This video is like a small course, where you can learn about infrequent itemsets, minimal rare itemsets and perfectly rare itemsets and some algorithms : AprioriInverse and AprioriRare.