Introduction to frequent subranking mining

Rankings are made in many fields, as we naturally tend to rank objects, persons or things, in different contexts. For example, in a singing or a sport competition, some judges will rank participants from worst to best and give prizes to the best participants. Another example is persons that rank movies or songs according to their tastes on a website by giving them scores.

If one has ranking data from several persons, it is  possible that the rankings appear quite different. However, using appropriate techniques, it is possible to extract information that is common to several ranking that can help to understand the rankings. For example, although a group of people may disagree on how they rank movies, many people may agree that Jurassic Park 1 was much better than Jurassic Park 2. The task of finding subrankings common to several rankings is a problem called frequent subranking mining. In this blog post, I will give an introduction to this problem. I will first describe the problem and some techniques that can be used to mine frequent subrankings.

The problem of subranking mining

The input of the frequent subranking mining problem is a set of rankings. For example, consider the following database containing four rankings named r1, r2, r3 and r4 where some food items are ranked by four persons. 

Ranking ID Ranking
r1 Milk < Kiwi < Bread < Juice
r2 Kiwi < Milk < Juice < Bread
r3 Milk < Bread < Kiwi < Juice
r4   Kiwi < Bread < Juice < Milk

The first ranking r1 indicates that the first person prefers juice to bread, prefers bread to kiwi, and prefers kiwi to milk. The other lines follow the same format.

To discover frequent subrankings, the user must specify a value for a parameter called the minimum support (minsup). Then, the output is the set of all frequent subrankings, that is all subrankings that appear in at least minsup rankings of the input database. Let me explain this with an example. Consider the ranking database of the above table and that minsup = 3. Then, the subranking  Juice < Milk is said to be a frequent subranking because it appears at least 3 times in the database. In fact, it appears exactly three times as shown below:

Ranking ID Ranking
r1 Milk < Kiwi < Bread < Juice
r2 Kiwi < Milk < Juice < Bread
r3 Milk < Bread < Kiwi < Juice
r4   Kiwi < Bread < Juice < Milk

The number of occurrence of a subranking is called its support (or occurrence frequency). Thus the support of he subranking Milk < Juice is 3. Another example is the subranking Kiwi < Bread < Juice which has a support of 2, since it appears in two rankings of the input database:

Ranking ID Ranking
r1 Milk < Kiwi < Bread < Juice
r2 Kiwi < Milk < Juice < Bread
r3 Milk < Bread < Kiwi < Juice
r4   Kiwi < Bread < Juice < Milk

Because he support of  Kiwi < Bread < Juice is less than the minimum support threshold, it is NOT a frequent subranking.

To give a full example, if we set minsup = 3, the full set of frequent subrankings is:

Milk < Bread   support : 3
Milk < Juice     support : 3
Kiwi < Bread   support : 3
Kiwi < Juice    support : 4
Bread < Juice support : 3

In this example, all frequent subrankings contains only two items. But if we set minsup = 2, we can find some subranking containing more than two items such as Kiwi < Bread < Juice, which has a support of 2.

This is the basic idea about the problem of frequent subranking mining, which was proposed in this paper:

Henzgen, S., & Hüllermeier, E. (2014). Mining Rank Data. International Conference on Discovery Science.

Note that in the paper, it is also proposed to then use the frequent subrankings to generate association rules.

How to discover the frequent subrankings?

In the paper by Henzgen & Hüllermeier, they proposed an Apriori algorithm to mine frequent subrankings. However, it can be simply observed that the problem of subrank mining can already be solved using the existing sequential pattern mining algorithms such as GSP (1996), PrefixSpan (2001), CM-SPADE (2014), and CM-SPAM (2014).  This was explained in an extended version of the “Mining rank data” paper published on Arxiv (2018) and other algorithms specially designed for subranking mining were proposed.

Thus, one can simply apply sequential pattern mining algorithms to solve the problem. I will show how to use the SPMF software for this purpose. First, we need to encode the ranking database as a sequence database. I have thus created a text file called kiwi.txt as follows:

@CONVERTED_FROM_TEXT
@ITEM=1=milk
@ITEM=2=kiwi
@ITEM=3=bread
@ITEM=4=juice
1 -1 2 -1 3 -1 4 -1 -2
2 -1 1 -1 4 -1 3 -1 -2
1 -1 3 -1 2 -1 4 -1 -2
2 -1 3 -1 4 -1 1 -1 -2

In that format, each line is a ranking. The value -1 is a separator and -2 indicates the end of a ranking.  Then, if we apply the CM-SPAM implementation of SPMF with minsup = 3, we obtain the following result:

milk -1 #SUP: 4
kiwi -1 #SUP: 4
bread -1 #SUP: 4
juice -1 #SUP: 4
milk -1 bread -1 #SUP: 3
milk -1 juice -1 #SUP: 3
kiwi -1 bread -1 #SUP: 3
kiwi -1 juice -1 #SUP: 4
bread -1 juice -1 #SUP: 3

which is what we expected, except that CM-SPAM also outputs single items (the first four lines above). If we dont want to see the single items, we can apply CM-SPAM with the constraint that we need at least 2 items, then we get the exact result of all frequent subrankings:

milk -1 bread -1 #SUP: 3
milk -1 juice -1 #SUP: 3
kiwi -1 bread -1 #SUP: 3
kiwi -1 juice -1 #SUP: 4
bread -1 juice -1 #SUP: 3

We can also apply other constraints on subrankings such as a maximum number of items using CM-SPAM. If you want to try it, you can download SPMF, and follows the instructions on the download page to install it. Then, you can create the kiwi.txt file, and run CM-SPAM as follows:

You will notice that in the above window, the minsup parameter is set to 0.7 instead of 3. The reason is that in the SPMF implementation, the minimum support is expressed as a percentage of the number of ranking (sequences) in the database. Thus, if we have 4 rankings, we multiply by 0.7, and we also obtain that minsup is equal to 3 rankings (a subranking will be considered as frequent if it appears in at least 3 rankings of the database). 

What is also interesting, is that we can apply other sequential pattern mining algorithms of SPMF to find different types of subrankingsVMSP to find the maximal frequent subrankings, CM-CLASP to find the closed subrankings, or even VGEN to find the generator subrankings.  We could also apply sequential rule mining algorithms such as RuleGrowth and CMRules to find rules between subrankings.

Conclusion

In this blog post, I discussed the basic problem of mining subrankings from rank data and that it is a special case of sequential pattern mining. 


Philippe Fournier-Viger is a full professor working in China and founder of the SPMF open source data mining software.

(Visited 79 times, 1 visits today)

Comments

Introduction to frequent subranking mining — 2 Comments

  1. Thanks for writing this interesting post. I totally agree that mining subrankings is just a special case of mining sequential patterns. Thus, it’s hard to point out the clear differences between these two topics.

    Additionally, I don’t think it will be easy to find a real-world dataset which is suitable for subranking mining. As we can see that each transaction (or ranking) requires all items. In contrary, most real-world ranking datasets are incomplete, for example, on Netflix, people just give the scores for the movies they have watched, but not for all the movies in the database.

    • Hi Dang,

      Yes, I agree with you. That is a good observation. I think it would indeed be a challenge to deal with sparse matrices.

      In that paper about frequent subrankings, they only use one real dataset. It contains data about how some persons have ranked about 10 different types of Sushi. From what I understand, this database has no values that are missing. But, that seems like a very special case. Maybe there could be a way to extend the subranking approach to deal with missing values using something similar to uncertain itemset mining where there is some existential probability to estimate what is the likely value for a missing value. I don’t know. But yes, since the problem is basically a special case of sequential pattern mining, I think it is not too interesting.

      Best regards,
      Philippe

Leave a Reply

Your email address will not be published. Required fields are marked *