The Data Blog

An ethical issue in the Elsevier “International Journal of Hydrogen Energy” ?

Posted on 2024-11-18 by Philippe Fournier-Viger

Today, I saw something quite incredible in the Elsevier journal “International Journal of Hydrogen Energy“, which you can see in the screen shot below:

The authors wrote:

As strongly requested by the reviewers, here we cite some references [[35], [36], [37], [38], [39], [40], [41], [42], [43], [44], [45], [46], [47]] although they are completely irrelevant to the present work.

which apparently means that the reviewers forced the authors to cite a dozen unrelated papers, presumably for boosting their citations counts. ]

If we look more closely at these citations, it is not surprising that most of these papers appear to all have a few researchers in common:

M. El-Ghobashy, H. Hashim, M. Darwish, ... SV. Trukhanov * Eco-friendly NiO/polydopamine nanocomposite for efficient removal of dyes from wastewater Nanomaterials, 12 (7) (2022), 10.3390/nano12071103
Ahmed Maher Henaish a b , Moustafa A. Darwish a , Osama M. Hemeda a , Ilya A. Weinstein b , Tarek S. Soliman d e , Alex V. Trukhanov f g , Sergei V. Trukhanov f g , Di Zhou h , Ali M. Dorgham Structure and optoelectronic properties of ferroelectric PVA-PZT nanocomposites Opt Mater, 138 (2023), 10.1016/j.optmat.2022.113402
Marwa M. Hussein,*a Samia A. Saafan,a H. F. Abosheiasha,b Amira A. Kamal, … Sergei V. Trukhanov, Tatiana I. Zubar,f Alex V. Trukhanov ef and Moustafa A. Darwish*a Structural and dielectric characterization of synthesized nano-BSTO/PVDF composites for smart sensor applications Mater Adv, 4 (22) (2023), pp. 5605-5617, 10.1039/D3MA00437F
Marwa M. Hussein a , Samia A. Saafan a , Hatem F. Abosheiasha b , Di Zhou c , Daria I. Tishkevich d e , Nikita V. Abmiotka e , Ekaterina L. Trukhanova d e , Alex V. Trukhanov d e , Sergei V. Trukhanov d e , M. Khalid Hossain f , Moustafa A. Darwish a Preparation, structural, magnetic, and AC electrical properties of synthesized CoFe2O4 nanoparticles and its PVDF composites Mater Chem Phys, 317 (2024), 10.1016/j.matchemphys.2024.129041
Dmitry B. Migas, Vitaliy A. Turchenko,c A. V. Rutkauskas,c Sergey V. Trukhanov,de Tatiana I. Zubar,Daria I. Tishkevich,Alex V. Trukhanov and Natalia V. Skorodumovagh Temperature induced structural and polarization features in BaFe12O19 J Mater Chem C, 11 (36) (2023), pp. 12406-12414, 10.1039/D3TC01533E
D.I. Tishkevich, S.S. Grabchikov, S.B. Lastovskii, S.V. Trukhanov, D.S. Vasin, T.I. Zubar, A.L. Kozlovskiy, M.V. Zdorovets, V.A. Sivakov, T.R. Muradyan, A.V. Trukhanov, Function composites materials for shielding applications: correlation between phase separation and attenuation properties J Alloys Compd, 771 (2019), pp. 238-245, 10.1016/j.jallcom.2018.08.209
S.V. Trukhanov, V.V. Fedotova, A.V. Trukhanov, et al. Cation ordering and magnetic properties of neodymium-barium manganites Tech Phys, 53 (1) (2008), pp. 49-54, 10.1134/S106378420801009X
A.V. Trukhanov, D.I. Tishkevich, A.V. Timofeev, et al. Structural and electrodynamic characteristics of the spinel-based composite system Ceram Int, 50 (12) (2024), pp. 21311-21317, 10.1016/j.ceramint.2024.03.241
A.V. Trukhanov, V.O. Turchenko, I.A. Bobrikov, et al. Crystal structure and magnetic properties of the BaFe12-xAlxO19 (x=0.1–1.2) solid solutions J Magn Magn Mater, 393 (2015), pp. 253-259, 10.1016/j.jmmm.2015.05.076
S.V. Trukhanov, A.V. Trukhanov, A.N. Vasiliev, et al. Frustrated exchange interactions formation at low temperatures and high hydrostatic pressures in La0.70Sr0.30MnO2.85 J Exp Theor Phys, 111 (2) (2010), pp. 209-214, 10.1134/S106377611008008X
S.V. Trukhanov, T.I. Zubar, V.A. Turchenko, An.V. Trukhanov, T. Kmječ, J. Kohout, L. Matzui, O. Yakovenko, D.A. Vinnik, A.Yu. Starikov, V.E. Zhivulin, A.S.B. Sombra, D. Zhou, R.B. Jotania, C. Singh, A.V. Trukhanov, Exploration of crystal structure, magnetic and dielectric properties of titanium-barium hexaferrites Mater Sci Eng, B, 272 (2021), 10.1016/j.mseb.2021.115345
Vinnik D.A., Starikov A.Y., Zhivulin V.E., Astapovich K.A., Turchenko V.A., Zubar T.I., Trukhanov S.V., Kohout J., Kmječ T., Yakovenko O., Matzui L., Sombra A.S.B., Zhou D., Jotania R.B., Singh C., Yang Y., Trukhanov A.V. Changes in the structure, magnetization, and resistivity of BaFe12–xTixO19 ACS Appl Electron Mater, 3 (4) (2021), pp. 1583-1593, 10.1021/acsaelm.0c01081
Vladimir E. Zhivulin 1 , Evgeniy A. Trofimov 1 , Olga V. Zaitseva 1 , Daria P. Sherstyuk 1 , Natalya A. Cherkasova 1 , Sergey V. Taskaev 2, Denis A. Vinnik 1 11 , Yulia A. Alekhina 3 4 , Nikolay S. Perov 3 4 , Kadiyala C.B. Naidu 5 , Halima I. Elsaeedy 6 , Mayeen U. Khandaker 7 8 , Daria I. Tishkevich 9 , Tatiana I. Zubar 9 , Alex V. Trukhanov 9 10 , Sergei V. Trukhanov Preparation, phase stability, and magnetization behavior of high entropy hexaferrites iScience, 26 (7) (2023), 10.1016/j.isci.2023.107077

I am not working in that field, so I cannot really judge whether these papers are unrelated to the current paper, and we cannot know for sure the identities of the reviewers. Thus, this would require further investigation from the journal to determine what happened in the review process of that paper, and if really the reviewers are the one who asked to cite these papers or not. Thus, we should not jump to conclusions too quickly about this.

Having said that, this whole situation raises questions about the editorial process of the International Journal of Hydrogen Energy, and in particular: (1) why the handling editor did not notice that apparently some reviewers had unethical requests and (2) why the editor and reviewers did not notice the sentence that the author added to their manuscript to complain about the review process before it was published?

I can understand that some problems are unnoticed by the editorial process because there are probably many papers to handle. But now that the problem has been discovered, I guess that the journal should do some investigation around this issue, if it has not been done yet, or if an investigation is not in process.

Update: The paper has been retracted and Elsevier has published the following statement at https://www.sciencedirect.com/science/article/pii/S0360319924043957#:~:text=In%20the%20present%20work%2C%20the%20origin%20of%20the,but%20the%20tetrahedral%20interstice%20in%20Zr%20and%20Hf.:

This blog post was just to share what I found about this story on social media. I am not involved in this paper, nor in this field, in this journal. But I know that there are many unethical reviewers in academia that ask to cite their papers or papers from their friends. And it is still quite surprising to see that this is mentioned directly in a paper like this. Thus, I decided to share the story on this blog.

Posted in Academia | Tagged academia, elsevier, ethics, journal, reviewers | Leave a comment

End of my term as associate editor for Array

Posted on 2024-09-29 by Philippe Fournier-Viger

I have been associate editor for the Array journal of Elsevier for 4 years. I have been happy to do this work but it is now the time for me to move on and focus on other things.

What is Array? It is a quite new journal that is multi-disciplinary and focus on computer science, and is open-access. It can be viewed as a journal similar to IEEE Access but less well-known. It is currently indexed in ESCI (Emerging SCI index) in the second quartile but still has a low volume of published papers (about 50 per years according to statistics from LetPub).

I have joined that journal in 2020 as one of the associate editors as I thought that it is a promising journal and I wanted to get more experience in this type of work. I did get some good experience with this journal but it did not grow as much as I would have expected. However, that is not the reason for me to leave this job. I am leaving simply because I am more busy than before and I want to focus on other things that I think are more important.

At this stage of my career, I find that the work of associate editor is less interesting because let’s be honest, it is extra work that I don’t need to do, it takes time, and it is unpaid. In general, a publisher can earn quite a lot of money from authors, especially for open-acess journals, and typically only the editor-in-chief might receive a part of that money. That’s just how it work.

Having said that, I wish the best for the Array journal, though, and am happy to have worked with them for a few years.

Posted in Academia | Tagged array, journal, open access | Leave a comment

Computer Science Journals and Conferences with the most withdrawals in 2023

Posted on 2024-09-24 by Philippe Fournier-Viger

Today, we will look at Computer Science Journals and Conferences that have the largest number of retracted or withdrawn articles in 2023. A paper can be withdrawn for various reasons such as plagiarism, papers with fake results, conflict of interests, or a compromised review process.

Methodology

To check the number of withdrawn or retracted papers from computer science journals for a given year, I searched in the DBLP database using the following query (access = withdrawn, year = 2023):
https://dblp.org/search/publ?q=access%3Awithdrawn%3A%20year%3A2023%3A

Thus, I only consider journal and conferences that are indexed in DBLP, which covers the best journals and conferences in computer science.

Results for 2023

For 2023, there are 911 withdrawn papers, with 494 papers from ArxIv. The papers from ArxIv do not count as formal publications and are thus ignored. From the remaining papers, the results are shown in the table below for journal or conferences with at least 2 withdrawn papers.

Number	Journal Name	Publication Count
1	Soft Comput.	133
2	Multim. Tools Appl.	37
3	J. Comb. Optim.	30
4	Int. J. Comput. Intell. Syst.	21
5	IET Softw.	21
6	EURASIP J. Wirel. Commun. Netw.	17
7	J. Robotics	16
8	Int. J. Syst. Assur. Eng. Manag.	10
9	Inf. Syst. E Bus. Manag.	10
10	IET Commun.	9
11	Electron. Commer. Res.	9
12	J. Electronic Imaging	8
13	J. Electr. Comput. Eng.	7
14	Pers. Ubiquitous Comput.	6
15	IET Circuits Devices Syst.	5
16	Distributed Parallel Databases	4
17	J. Supercomput.	4
18	EURASIP J. Inf. Secur.	3
19	IACR Cryptol. ePrint Arch.	3
20	Neural Comput. Appl.	3
21	Neural Process. Lett.	3
22	Intell. Serv. Robotics	3
23	CVPR	3
24	J. Comput. Virol. Hacking Tech.	3
25	Clust. Comput.	2
26	J. Intell. Fuzzy Syst.	2
27	Ann. Oper. Res.	2
28	J. Inf. Sci.	2
29	NeuroImage	2
30	Appl. Intell.	2
31	Remote. Sens.	2
32	Stud Logica	2
33	Int. J. Comput. Math.	2
34	Behav. Inf. Technol.	2
35	Eng. Appl. Artif. Intell.	2
36	Educ. Inf. Technol.	2
37	Wirel. Pers. Commun.	2

From this table, we can make a few interesting observations.

First, the largest number of withdrawn papers is by far in the Soft Computing journal of Springer with 133 papers. To try to understand what is going on with this journal, I clicked the first 5 withdrawn journal papers from Soft Computing to see the reasons. The reasons were basically all the same:

“The publisher has retracted this article in agreement with the Editor-in-Chief. The article was submitted to be part of a guest-edited issue. An investigation by the publisher found a number of articles, including this one, with a number of concerns, including but not limited to compromised editorial handling and peer review process, inappropriate or irrelevant references or not being in scope of the journal or guest-edited issue. Based on the investigation’s findings the publisher no longer has confidence in the results and conclusions of this article.”

In other words, several withdrawn papers from that journal seems to have been in guest special issues.

To go further in this analysis, we can compare to the previous years. I see that there were only 6 withdrawn papers in Soft Computing in 2022, 22 in 2021 but 76 in 2020. Thus, there seems to be quite a large amount in this journal with some big variations over the years, but in 2023, it has truly sky rocketed with 133.

By searching further, I can see that the Soft Computing journal is now “On hold” in terms of indexing in the SCI list of journals as of 2024/09/09. This means that this journal has been temporarily expelled from the list as a kind of punishment.

Now, let’s look at the second journal with the most withdrawn papers, which is Multimedia Tools and Applications with 37 papers, and is also published by Springer. The first five retracted papers were withdrawn for the following reasons:

“An investigation by the publisher found a number of concerns, including but not limited to citations which do not support claims made in the text, non-standard phrasing, and image irregularities.” (4 times)
“The article was submitted to be part of a guest-edited issue. An investigation by the publisher found a number of articles, including this one, with a number of concerns” (1 time)

So, in this case, again the problem of papers in special issues occur, but also the problem of papers with irregularities.

I can see that the Multimedia Tools and Applications has also been put “On hold” in terms of indexing in the SCI list of journals as of 2024/09/09.

Now after that, the third journal with the most withdrawn papers is Journal of Combinatorial Optimization, also published by Springer with 30 papers, then the International Journal of Computational Intelligence Systems of Springer with 21 papers.

Thus, first four journals with the most retracted papers are from Springer. Then, the fifth journal is from Wiley.

Another interesting observation that we can make from this analysis is that CVPR is perhaps the only conference that shows up in this list with 3 withdrawn papers. This is quite surprising as CVPR is a top conference in the field of computer vision.

Thus, I have checked to see what are the reason(s) and it is “IEEE removed this article due to lack of legal authorization for its publication”. I think that this means that the authors did not fill the copyright form for this conference.

Conclusion

I hope that this blog post has been interesting.

For researchers, I recommend to avoid submitting to the journals that have been put “On hold” until the situation for these journals returns to normal. This is to ensure that your papers are indexed in SCI.

Please share your thoughts in the comment section below!

Posted in Academia | Tagged academia, computer science, dblp, withdrawal | Leave a comment

Reducing the cost of web hosting…

Posted on 2024-07-10 by Philippe Fournier-Viger

I am hosting my websites for more than 15 years on a web hosting service called IONOS (formerly 1and1), which is not free. I do this because it gives me a lot of flexibility for making websites. However, today, I received an e-mail from Ionos telling me that they think I am using too much traffic (over 50 GB per month), and thus they deactivate the CDN (Content Delivery Network) optimization that they offer for my websites. This won’t prevent my websites from being accessed but it may hurt the speed a little bit.

So why is there so much traffic on my websites? I do not have that many visitors. I think that the reason is that there are some large files that are hosted on my websites such as videos and datasets.

For videos, it might be a better idea to host them on an external website such as Youtube. I think that could reduce traffic by a good amount. I might change this. However, I like to have my videos on my own website so that they can be downloaded even from geographical locations where platforms like YouTube are not available.
For datasets, I will keep them on my websites as they are important, but I think that I will restrict the downloads to only the real visitors from my websites.

And as for IONOS, I have to say that I am not happy with IONOS. They charge a relatively high price, which they keep increasing, with not much improvement to their services. In fact, I think they almost tripled their price since I started using their services. And I also had other problems with that company in the last few years, such (1) four days of downtimes for a SQL database, and (3) a database being reverted to three years before and losing a good amount of data. Thus, I am thinking to move to another web hosting provider next year. But it requires some time to set up my websites properly, so I will have to think about it.

That’s all I wanted to talk about for today.

Posted in Website | Tagged 1and1, bad hosting, ionos, poor service, problems, website | Leave a comment

SPMF 2.62 is released!

Posted on 2024-06-13 by Philippe Fournier-Viger

This is a short blog post to announce that SPMF 2.62 is released, and can be downloaded from the SPMF website‘s download page.

The previous version of SPMF (2.60) introduced a lot of new features, also with some code refactoring, and many modifications to the user interface (including some that are not easily visible). Thus, the version SPMF 2.62 has fixed some bugs and problems that had been noticed after the release of 2.60. Moreover, I have done some further fine-tuning of the graphical user interface to improve some tools that were introduced in 2.60.

Besides that, the main novelty of 2.62 is the inclusion of a dozen new tools to calculate statistics about different types of datasets. Here is the details about changes in SPMF 2.62:

By the way, if you are wondering, what about version 2.61? Was it skipped? Actually, there was a version 2.61 between 2.60 and 2.62 but it was only released to a few people, as I wanted to wait a little bit more before to release more features together. So this is why, 2.61 appears to have been skipped!

If you have any comments, feel free to leave a comment below, or to e-mail me. An e-mail is a faster way of reaching out to me.

Posted in Data Mining, Data science, open-source, spmf | Tagged association rule, data mining, itemset mining, java, new version, open source, pattern mining, spmf | Leave a comment

My research is open-source

Posted on 2024-06-05 by Philippe Fournier-Viger

Since the last decade, I have taken the decision to publish most of the algorithms that I develop in my research, freely, as part of the open-source SPMF data mining software that I have founded. The goal is that anyone can benefit from most of my research work, either from academia and the industry. In particular, I am happy to see that many researchers have used the code and data that I provide in SPMF in their own research projects. As of today, over 1000 research papers from all over the world have applied SPMF in a very wide range of applications ranging from chemistry, music analysis to restaurant recommendation and e-learning. This success is thanks to the users of SPMF and also its several contributors who have provided code.

I believe that publishing research code and data as open-source can be greatly beneficial for young researchers. It can help to make the research more visible and increase the chance that it is used, cited, and thus that it has an impact!

This what just a short blog post for today! By the way, if you are developing Java data mining algorithms, feel free to contact with me for integrating them in SPMF!

—
Philippe Fournier-Viger is a computer science professor and founder of the SPMF open-source data mining library, which offers more than 170 algorithms for analyzing data, implemented in Java.

Posted in open-source | Tagged data mining, open source, software, spmf | Leave a comment

Two new shopping datasets with taxonomy

Posted on 2024-05-15 by Philippe Fournier-Viger

This is just to let you know that I have added two new transaction datasets to the SPMF datasets webpage:

Those are two customer transaction datasets obtained by transforming the data from the instacart competition that was held on Kaggle in 2017. They can directly be used for frequent itemset mining and association rule mining.

I have also included a file giving a taxonomy of the items in categories and sub-categories and a file indicating the full names of the items.

This can be useful for interpreting the results of itemset mining and association rule mining algorithms. All the files are in SPMF format for use with the SPMF software.

Posted in Database, Pattern Mining, spmf | Leave a comment

How to deal with unethical reviewers? The good example of the EAAI journal

Posted on 2024-05-13 by Philippe Fournier-Viger

Today, I will talk about a problem that has been plaguing many academic journals in recent years. It is that several unethical reviewers are asking authors to cite several of their papers to boost their citation count. I previous wrote a few blog posts about this: post1, post2, post3 but today, I want to talk about solutions.

I see this problem happening very often in academic journals, especially as an author. In some case, I for example, received reviews for a journal paper where the reviewer was asking to cite 10 papers (!!!) by the same author, which were very remotely relevant to the topic. In this case, it is obvious that the reviewer is just trying to boost his citation count to game the system, and perhaps obtain a promotion or something else.

Usually, when this happen, I will complain to the editor because this behavior is unethical, and I dont want add useless citations in my paper, which would degrade the quality of the paper. But I know several authors that are afraid of such unethical reviewers and that will just do what the reviewers ask and add the citations rather than reporting them for their unethical behavior. But I also understand the authors who do not want to report this problem because the reviewers are in a position of power. So it is hard to fight this problem, and sometimes the editor will not even pay attention to those reports.

So how to handle this problem? I think that a good solution is to follow the approach of the EAAI journal, and this is the topic of this blog for today. I want to praise that journal for the form that they provide to reviewers. The first question of that form is as follows:

In that question, the reviewer is required to clearly indicate all the references that they are asking the authors to cite and to provide the full author list and DOI of each reference. This is important because many unethical reviewers will ask to cite papers but remove their names from the author list, hoping that the editor would not detect the conflict of interest.

The second question of the review form ask the reviewer to clearly indicate any citations from themselves and the reason for asking to cite such papers. There is also a clear warning that this is generally considered unethical.

These two questions in the review form are very simple, but they show that the journal cares and this form tells the reviewers to stay away from this type of unethical behavior. It can certainly have a dissuasive effect.

I think that this is a good approach and other journals should take this as an example and do the same. I know that some other journals do this already but still many journals do not, and this problem is still very common in academia.

And also, a major issue is that several editors just don’t pay attention to this problem as they are sometimes handling hundreds of papers. Thus, some editor will just read the reviews from reviewers very quickly and will not see the problem. And it also happens in some cases that the editor is the problem itself. I have ever seen some suspicious behavior where reviewers where asking me to cite papers from the handling editor. In this case, it is rather obvious that the editor is editing the reviews from reviewers to boost his citation count or the impact factor of his own journal!

So that is all for this post. I just wanted to praise the approach adopted by the EAAI journal as a good example of how to deal with unethical reviewers by adding questions to the review form. Do you have any other suggestions? You may share them in the comment section below.

—
Philippe Fournier-Viger is a computer science professor and founder of the SPMF open-source data mining library, which offers more than 170 algorithms for analyzing data, implemented in Java.

Posted in Academia | Tagged academia, bad reviewer, cheating, citation, citation count, editor, journal, reviewer, unethical editor, unethical reviewer | Leave a comment

CSRankings: still a biased ranking

Posted on 2024-05-13 by Philippe Fournier-Viger

Today, I will write about a ranking called CSRankings, which is used by some people to rank computer science programs from different universities. That ranking has some interesting aspects such that its code and data is open to the public, but it is in my opinion biased for several reasons. I previously wrote a blog post about the shortcomings of CSRankings.

Some shortcomings are that it only considers conference papers, while journal papers are more important in several countries, and that CSRankings is US-centric in its methodology. But a bigger problem, as I explained last time, is that CSRankings is biased towards some subfields of computer science. For example, that ranking has a category for robotics with some conferences such as ICRA that have a very high acceptance rate (~49%), while some other categories have conferences with a much lower acceptance rate (~10%), and some subfields of computer science are basically ignored such as data mining.

About data mining, in my previous blog post, I pointed out that the KDD conference, which is arguably the #1 conference in data mining with an acceptance rate of around 10%, was deactivated by default in CSRankings. Here is a screenshot:

This is quite baffling given that KDD is highly regarded, even with top companies and universities regularly publishing in this conference. Other data mining conferences are also omitted from that ranking. So basically, we could say that the field of data mining does not exist in that ranking, and that would be somewhat true. There is a category called “database” but it also excludes ICDE by default.

On the Github page of CSRankings, several people have complained that KDD has been removed from CSRankings, as obviously several people are unhappy with this (source: github.com/emeryberger/CSrankings/issues/5397):

Some other supporting messages are as follows:

There was a message from one of the CSRankings owner announcing that this might change:

However, more than a year has passed and nothing has changed.

I wrote this blog post to raise awareness about this issue with CSRankings. So how to fix it? In my opinion, the best solution would be to add a “data mining” category with KDD and ICDM at least. Then, it would be more fair for the data mining community, which has been thriving for almost three decades. Data mining should not be wiped-out just like that from a ranking.

Or another solution would be that someone clone the code and data of CSRankings (since it is open-source) to start a more fair ranking that would fix these problems because CSRankings might not fix them. Of course, no ranking is perfect, but I think that some obvious improvements could be done in this case. Who is willing to do it?

That is all for today. By the way, note that the content of this blog post represents my personal opinion.

Posted in Academia, General | Tagged academia, computer science, computer science ranking, csrankings | Leave a comment

The story of the most influential paper award of PAKDD 2024

Posted on 2024-05-12 by Philippe Fournier-Viger

Recently, I have attended the PAKDD 2024 conference, where I was happy to receive the most influential paper award with my co-authors. This award is a test of time type of award that is given to the paper from PAKDD 2014 that received the largest number of citations or had the largest impact over the last ten years. In this blog post, I will briefly talk about the story of this paper, and why it has been successful. Then, I will talk about some of the applications of the algorithms presented in this paper, and talk about how to get such award.

The paper

I have received the award with my co-authors for this paper:

Fournier-Viger, P., Gomariz, A., Campos, M., Thomas, R. (2014). Fast Vertical Mining of Sequential Patterns Using Co-occurrence Information. Proc. 18th Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD 2014) Part 1, Springer, LNAI, 8443. pp. 40-52. [ppt][source code]

This paper is about sequential pattern mining (SPM). Let me explain quickly what this is. SPM is a popular data mining task that is used to analyze sequence data. So what is a sequence? A sequence is an ordered list of symbols. For example, let me show you a few examples of sequences that we may find in some real-life applications.

In this first example, we have a sequence of customer purchases indicating that a customer has bought an apple, then some bread, and then some cake. But sequences can be found also in many other domains. Another example is a sequence of words in a text:

In that sequence the words are sequentially ordered. Another example of sequence from another domain is a sequence of locations visited by a person driving a car in a city:

If we have data represented as sequences, we can apply the task of sequential pattern mining to find patterns (subsequences) that appear frequently in those sequences. The idea is to discover patterns that could reveal something about those sequences. For example, we may want to analyze the sequences of purchases made by several customers to find some sequences of purchases common to multiple customers. Let me show you how sequential pattern mining works with a simple example. Consider a database of four sequences representing the purchases made by four different customers:

In that example, the letter a, b, c, and d represent the purchase of apple, bread, cake, and dattes, respectively. The first sequence called S_1 indicates that the first customer has bought apple and bread at the same time, followed by buying cake, and then by purchasing apple. The other three sequences (S_2, S_3 and S_4) have a similar meaning.

Now, if we want to do sequential pattern mining with this data, we must set a parameter called the minimum support threshold (abbreviated as minsup). Consider that we decide to set this parameter as minsup = 3. It means that we want to find all the sequential patterns (subsequences) that appear in at least 2 sequences of the inpput sequence database.

Let me show you what is the output for minsup = 3:

For the input sequence database on the left and minsup =3, the output is the list of sequential patterns that are presented on the right side of the above figure. Take the pattern <{a,b},{c}> as an example. This pattern is said to have a support of 3 (to appear three times) because it occurs in three sequences in the input database, as highlighted in yellow below:

As it can be seen in the above figure, apple and bread appear together and are followed by cake in the sequence S_1, S_2 and S_3. Hence, this pattern <{a,b},{c}> is said to have a support of 3, and since this value is no less than minsup, this pattern <{a,b},{c}> is also said to be a frequent sequential pattern and it is output.

It can be observed in the sequence S_2 that there can be a gap between {a,b} and {c}, and this is ok, since {a,b} still appears before {c}.

So to summarize, the task of sequential pattern mining is to find all the frequent sequential patterns in a database, given some minsup threshold set by the user. And a frequent sequential pattern is a subsequence that appears in at least minsup sequences.

In the above example, with minsup = 3, the output is 6 sequential patterns. For a task of sequential pattern mining such as in the above example, there is always only one solution, which is the set of patterns to be discovered. The challenge is in designing efficient algorithms to find this solution.

So what was the paper about? In that paper, we presented a new optimization called co-occurrence pruning that allows to considerably speed-up sequential pattern mining algorithms. The improvement obtained by our optimization can be for example to speed up an algorithm by up to 10 times. The improved algorithms in the paper are called CM-SPAM, CM-SPADE and CM-CLASP, which are improved versions of the classical SPAM, SPADE and CLASP algorithms.

The paper has won the most influential paper award by having over 340 citations, according to Google Scholar, as shown below:

Citations to the paper from 2014 to 2024

These citations are mainly of two types: (1) applications of the improved algorithms (CM-SPAM, CM-SPADE and CM-CLASP) in some real-life applications and (2) papers that have used the proposed optimization to develop other similar algorithms.

The story of this paper

So now, let me explain the story behind this paper by going back to 2013-2014. The paper was written by me and three co-authors, shown below:

At that time, I was a young professor working on pattern mining, and Antonio and Rincy were students, while Manuel was the supervisor of Antonio. But initially, we did not know each other.

From my side, I had started to develop the SPMF open-source pattern mining software in 2008, which is a free software in Java offering efficient implementations of many pattern mining algorithms. Then, around 2013, I received several e-mails from Rincy to discuss sequential pattern mining algorithms:

In particular, Rincy wanted to know which sequential pattern mining algorithm is the best. He really wanted to find the answer and did several experiments about this with SPMF, which were very interesting. But at that time, we did not have the source code of all the best algorithms to make an exhaustive comparison.

Then, I started to discuss by e-mail with Antonio from Spain, who had just published a paper at PAKDD 2023 about the CLASP algorithm for closed sequential pattern mining. Antonio accepted to share with me the code of many additional algorithms including GSP, SPADE, SPAM, PrefixSpan, CloSpan and CLASP.

So now, we had many algorithm implementations for comparing sequential pattern mining algorithms. I then continued discussing with Rincy through e-mails:

As I recall, he made an important observation, which is that vertical algorithms such as SPAM generate too many candidates, and the main cost of such algorithm is the join operation that is performed to calculate the support of each candidate. Thus, if we could find a way to reduce that number, we might be able to speed-up the algorithms…

Then, after that, based on that observation, I designed the new co-occurrence pruning optimization that is presented in the paper. The optimization was implemented by me and Antonio in several algorithms, and we have all participated together to the paper writing, and Manuel also helped for the paper. Then, we submitted it to PAKDD 2014…. and it got accepted!

At that time, as I remember, the paper was accepted but maybe had two accept and a weak accept recommendations. Thus, it was not regarded as the best paper of PAKDD 2014, but 10 years later, it is arguably the paper that had the biggest impact. From this, we may draw the conclusion that reviewers are not always right. 🙂

Why it was successful?

I believe that the main reasons why the paper was successful are the following:

– The paper is about a fundamental topic (sequential pattern mining) that can have applications in many fields.
– We have shown a clear performance improvement over the state-of-the-art algorithms and compared with many algorithms on several datasets.
– I promoted the paper by talking about it to other researchers in numerous occasions, by having a website, a blog, and also mentioning this paper in several of my own papers, including a survey on sequential pattern mining.
– I published the source code of the algorithms and datasets in the SPMF pattern mining software. Thus, it is easy for anyone to reuse my code, apply it to other domains or extend it.

Applications

The algorithms from that paper and from the SPMF pattern mining software in general, have been used in a wide range of applications from multiple fields. Some are listed in the picture below:

In particular, two representative applications of the sequential pattern mining algorithm CM-SPAM are presented in those two papers written by my team:

Nawaz, M. S., Fournier-Viger, P., Nawaz, M. Z., Chen, G., Wu, Y. (2022) MalSPM: Metamorphic Malware Behavior Analysis and Classification using Sequential Pattern Mining. Computers & Security, Elsever, to appearDOI: 10.1016/j.cose.2022.102741

Nawaz, S. M., Fournier-Viger, P., He, Y., Zhang, Q. (2023). PSAC-PDB: Analysis and Classification of Protein Structures. Computers in Biology and Medicine, Elsevier, 158: 106814 (2023)
DOI: 10.1016/j.compbiomed.2023.106814

In the first paper above, we have applied sequential pattern mining to analyze the behavior of malware programs such as computer viruses, worms and trojans. In this case, the data are sequences of API calls made by programs, and we extract sequential patterns to detect (classify) different types of virus. More precisely, the sequential patterns were used as features to train different classifiers, and excellent performance was achieved over the state-of-the art approaches.

In the second paper above, a similar sequential pattern mining-based methodology is used but for analyzing biological viruses. The sequences are in this case genome sequences. Excellent results are also obtained.

Those two papers are examples, but in fact, sequential pattern mining can be applied in numerous other domains.

How to get such award?

So before concluding, how to get this award? I provide a summary in the picture below:

Conclusion

I hope that you have enjoyed this blog post! If you have any comments, you may write in the comment section below.

—
Philippe Fournier-Viger is a distinguished professor working in China and founder of the SPMF open source data mining software.

Posted in Big data, Data Mining, Data science, Pattern Mining | Tagged award, data mining, data science, pakdd, pakdd2014, pakdd2024, pattern mining, sequential pattern | Leave a comment

An ethical issue in the Elsevier “International Journal of Hydrogen Energy” ?

End of my term as associate editor for Array

Computer Science Journals and Conferences with the most withdrawals in 2023

Reducing the cost of web hosting…

SPMF 2.62 is released!

My research is open-source

Two new shopping datasets with taxonomy

How to deal with unethical reviewers? The good example of the EAAI journal

CSRankings: still a biased ranking

The story of the most influential paper award of PAKDD 2024

The paper

The story of this paper

Why it was successful?

Applications

How to get such award?

Conclusion

Archives

Categories

Recent Posts

Recent Comments

Number of visitors:

Related posts:

Related posts:

Related posts:

Related posts:

Related posts:

Related posts:

Related posts:

Related posts:

Related posts:

The paper

The story of this paper

Why it was successful?

Applications

How to get such award?

Conclusion

Related posts:

Archives

Categories

Recent Posts

Recent Comments

Tag cloud

Number of visitors: