Vertical and horizontal databases in itemset mining

Itemset mining is a data mining task for discovering patterns that appear frequently in transaction databases. In this context, a pattern, also called a frequent itemset, is a set of values that frequently occur together in transactions (records) of a database. Frequent itemset mining has many applications in various fields, but the traditional application is for the analysis of customer transactions. Given a database of customer transactions (sets of purchased items), applying an itemset mining algorithm can reveal the frequent itemsets, that is the set of items that are frequently bought together by customers.

In itemset mining research, two terms are often used, which are the concept of horizontal database and vertical database. They refer to how the data is represented. In this blog post, my goal is to explain what these terms mean.

There are two main ways of representing data for itemset mining. One way is to use a horizontal data format, where each row is a transaction, described by a transaction ID and a set of items in that transaction. For example, this is a horizontal database:

TIDItems
1A,B,C
2B,C,D
3A,C,D
4A,B,D

Here, the first row has the Transaction ID (TID) 1 and indicates that a customer has purchased some items A, B, and C, which could for example represent Apple, Bread and Cake. The other rows follow the same format.

The other main way of representing data for itemset mining is to use a vertical data format, where each column corresponds to an item and gives the list of transaction IDs that contain that item. For example, this is a vertical database:

ItemTID_set
A1,3,4
B1,2,4
C1,2,3
D2,3,4

For example, the first two indicates that the item A appears in the first, third and fourth transactions (that is the transactions with IDs 1, 3 and 4).

These two database formats are two different ways of representing the same data. In other words, it is possible to convert from one format to the other, that is to transform the first table into the second table, or transform the second table to obtain the first table.

The choice of the data format can affect the performance and scalability of itemset mining algorithms. In general, the horizontal data format is more suitable for breadth-first search algorithms, such as the Apriori algorithm, which generates candidate itemsets level by level and scans the database multiple times to count their support. On the other hand, the vertical data format is more suitable for depth-first algorithms, such as the ECLAT algorithm, which generates candidate itemsets by intersecting the sets of transactions containing different itemsets.

Let me explain a little bit more about how the database format influences the design of itemset mining algorithms with an example. Lets say that we want to count the support (the number of occurrences of the itemset {A, B} (the purchase of items A and B together).

To count the support of {A, B} in the first table (using the horizontal format), we need to scan all the transactions from the database and count each one that contains A and B together. After reading the four lines of the first table, we find that {A, B} appears two times (has a support of 2).

Now if we want to count the support of {A, B} in the second table (using the vertical format, instead), it is more simple. We only need to do the intersection of the row of A and the row of B. Let me explain this in detail. In the row of A, we have the list of transactions 1, 3, and 4. In the row of B, we have the list of transactions 1, 2, and 4. By doing the intersection of these two lists, we find that {A,B} appears in transactions 1 and 2, and thus that the support of {A, B} is 2. Thus, if we use the vertical database format, it can be more efficient for counting the support of {A, B} because we do not need to read the whole database but just to look at the rows of A and B. This is one reason why vertical itemset mining algorithms can perform quite well in some situations (but not always).

Besides the horizontal and vertical data format, there of course exists other ways of representing the data from a database. Another way is to use prefix-trees, which is for example the internal data representation adopted by the FP-Growth algorithm.

That is all for day. I just wanted to explain the difference between horizontal and vertical database formats. Hope that this was informative and helpful.


Philippe Fournier-Viger is a professor of computer science and founder of the SPMF data mining library.

Posted in Data Mining, Data science, Pattern Mining | Tagged , , , | Leave a comment

CFP: PM4B 2005: A new workshop on pattern mining and machine learning in bioinformatics @ PAKDD 2025

Today, I would like to announce that we are creating a new workshop for PAKDD 2025 called the 1st Workshop on Pattern mining and Machine learning for Bioinformatics (PM4B 2025).

The goal of the workshop is to establish a collaborative platform for researchers and practitioners to share theoretical advancements and practical applications of pattern mining (PM) and machine learning (ML), with a focus on bioinformatics

The dates are as follows:

  • Acceptance Notification: March 15, 2025
  • Camera Ready due: March 29, 2025
  • Workshop Date: June, 10, 2025

For the workshop, there will be a best paper award, and the accepted papers will be invited for a special issue in the lEEE Journal of Biomedical and Health Informatics (SCI Q1, Impact factor: 6.7). Welcome to submit your papers!

Website of the workshop: PM4B 2025

We choose PAKDD, as it is a good international venue for data mining and machine learning researchers!

Posted in artificial intelligence, Big data, Bioinformatics, Conference, Data Mining, Data science | Tagged , , , , , , , | Leave a comment

Merry X-mas and Happy New Year!

Today, it is just a short blog post to wish happy holidays, merry X-mas and Happy New Year to all users and developers of SPMF, from all around the world!

More surprises will come for 2025. I am currently ending the teaching semester.

After grading the final projects of students, I will start to work on finalizing the next version of SPMF!

Posted in spmf | Tagged , , | Leave a comment

About Academic conferences in China

Recently, I have attended the launch of The Blue Book on the Development of International Academic Conferences in china 2024 (中国国际学术会议发展蓝皮书 2024) published by the AEIC (哎思科蓝)company. This report is very interesting as it gives many insights about academic conferences in China over the last few years (2019-2023).

Here are a few interesting pictures from this report. This is the number of conferences organized by administrative regions in China from 2019 to 2023:

This is the number of conferences organized by province-level regions in China from 2019 to 2023:

Among the top, we can see that Guangdong province had 1,150 conferences and Beijing had 1,154 conferences.

This is the number of conferences organized by countries from 2019 to 2023, where China has 10,137 conferences and the USA has 10,360 conferences:

And those are the main topics of the conferences in China (the English translation follows):

At the center, the main topic is artificial intelligence, which is not surprising! We also see Big data.

There is much more interesting information in the report. It has over 50 pages. I just share a few pictures from the report that I think are interesting.

Posted in Academia, China | Leave a comment

An ethical issue in the Elsevier “International Journal of Hydrogen Energy” ?

Today, I saw something quite incredible in the Elsevier journal “International Journal of Hydrogen Energy“, which you can see in the screen shot below:

The authors wrote:

As strongly requested by the reviewers, here we cite some references [[35][36][37][38][39][40][41][42][43][44][45][46][47]] although they are completely irrelevant to the present work.

which apparently means that the reviewers forced the authors to cite a dozen unrelated papers, presumably for boosting their citations counts. ]

If we look more closely at these citations, it is not surprising that most of these papers appear to all have a few researchers in common:

  • M. El-Ghobashy, H. Hashim, M. Darwish, ... SV. Trukhanov * Eco-friendly NiO/polydopamine nanocomposite for efficient removal of dyes from wastewater Nanomaterials, 12 (7) (2022), 10.3390/nano12071103
  • Ahmed Maher Henaish a b , Moustafa A. Darwish a , Osama M. Hemeda a , Ilya A. Weinstein b , Tarek S. Soliman d e , Alex V. Trukhanov f g , Sergei V. Trukhanov f g , Di Zhou h , Ali M. Dorgham Structure and optoelectronic properties of ferroelectric PVA-PZT nanocomposites Opt Mater, 138 (2023), 10.1016/j.optmat.2022.113402
  • Marwa M. Hussein,*a Samia A. Saafan,a H. F. Abosheiasha,b Amira A. Kamal, … Sergei V. Trukhanov, Tatiana I. Zubar,f Alex V. Trukhanov ef and Moustafa A. Darwish*a Structural and dielectric characterization of synthesized nano-BSTO/PVDF composites for smart sensor applications Mater Adv, 4 (22) (2023), pp. 5605-5617, 10.1039/D3MA00437F
  • Marwa M. Hussein a , Samia A. Saafan a , Hatem F. Abosheiasha b , Di Zhou c , Daria I. Tishkevich d e , Nikita V. Abmiotka e , Ekaterina L. Trukhanova d e , Alex V. Trukhanov d e , Sergei V. Trukhanov d e , M. Khalid Hossain f , Moustafa A. Darwish a Preparation, structural, magnetic, and AC electrical properties of synthesized CoFe2O4 nanoparticles and its PVDF composites Mater Chem Phys, 317 (2024), 10.1016/j.matchemphys.2024.129041
  • Dmitry B. Migas, Vitaliy A. Turchenko,c A. V. Rutkauskas,c Sergey V. Trukhanov,de Tatiana I. Zubar,Daria I. Tishkevich,Alex V. Trukhanov and Natalia V. Skorodumovagh Temperature induced structural and polarization features in BaFe12O19 J Mater Chem C, 11 (36) (2023), pp. 12406-12414, 10.1039/D3TC01533E
  • D.I. Tishkevich, S.S. Grabchikov, S.B. Lastovskii, S.V. Trukhanov, D.S. Vasin, T.I. Zubar, A.L. Kozlovskiy, M.V. Zdorovets, V.A. Sivakov, T.R. Muradyan, A.V. Trukhanov, Function composites materials for shielding applications: correlation between phase separation and attenuation properties J Alloys Compd, 771 (2019), pp. 238-245, 10.1016/j.jallcom.2018.08.209
  • S.V. Trukhanov, V.V. Fedotova, A.V. Trukhanov, et al. Cation ordering and magnetic properties of neodymium-barium manganites Tech Phys, 53 (1) (2008), pp. 49-54, 10.1134/S106378420801009X
  • A.V. Trukhanov, D.I. Tishkevich, A.V. Timofeev, et al. Structural and electrodynamic characteristics of the spinel-based composite system Ceram Int, 50 (12) (2024), pp. 21311-21317, 10.1016/j.ceramint.2024.03.241
  • A.V. Trukhanov, V.O. Turchenko, I.A. Bobrikov, et al. Crystal structure and magnetic properties of the BaFe12-xAlxO19 (x=0.1–1.2) solid solutions J Magn Magn Mater, 393 (2015), pp. 253-259, 10.1016/j.jmmm.2015.05.076
  • S.V. Trukhanov, A.V. Trukhanov, A.N. Vasiliev, et al. Frustrated exchange interactions formation at low temperatures and high hydrostatic pressures in La0.70Sr0.30MnO2.85 J Exp Theor Phys, 111 (2) (2010), pp. 209-214, 10.1134/S106377611008008X
  • S.V. Trukhanov, T.I. Zubar, V.A. Turchenko, An.V. Trukhanov, T. Kmječ, J. Kohout, L. Matzui, O. Yakovenko, D.A. Vinnik, A.Yu. Starikov, V.E. Zhivulin, A.S.B. Sombra, D. Zhou, R.B. Jotania, C. Singh, A.V. Trukhanov, Exploration of crystal structure, magnetic and dielectric properties of titanium-barium hexaferrites Mater Sci Eng, B, 272 (2021), 10.1016/j.mseb.2021.115345
  • Vinnik D.A., Starikov A.Y., Zhivulin V.E., Astapovich K.A., Turchenko V.A., Zubar T.I., Trukhanov S.V., Kohout J., Kmječ T., Yakovenko O., Matzui L., Sombra A.S.B., Zhou D., Jotania R.B., Singh C., Yang Y., Trukhanov A.V. Changes in the structure, magnetization, and resistivity of BaFe12–xTixO19 ACS Appl Electron Mater, 3 (4) (2021), pp. 1583-1593, 10.1021/acsaelm.0c01081
  • Vladimir E. Zhivulin 1 , Evgeniy A. Trofimov 1 , Olga V. Zaitseva 1 , Daria P. Sherstyuk 1 , Natalya A. Cherkasova 1 , Sergey V. Taskaev 2, Denis A. Vinnik 1 11 , Yulia A. Alekhina 3 4 , Nikolay S. Perov 3 4 , Kadiyala C.B. Naidu 5 , Halima I. Elsaeedy 6 , Mayeen U. Khandaker 7 8 , Daria I. Tishkevich 9 , Tatiana I. Zubar 9 , Alex V. Trukhanov 9 10 , Sergei V. Trukhanov Preparation, phase stability, and magnetization behavior of high entropy hexaferrites iScience, 26 (7) (2023), 10.1016/j.isci.2023.107077

I am not working in that field, so I cannot really judge whether these papers are unrelated to the current paper, and we cannot know for sure the identities of the reviewers. Thus, this would require further investigation from the journal to determine what happened in the review process of that paper, and if really the reviewers are the one who asked to cite these papers or not. Thus, we should not jump to conclusions too quickly about this.

Having said that, this whole situation raises questions about the editorial process of the International Journal of Hydrogen Energy, and in particular: (1) why the handling editor did not notice that apparently some reviewers had unethical requests and (2) why the editor and reviewers did not notice the sentence that the author added to their manuscript to complain about the review process before it was published?

I can understand that some problems are unnoticed by the editorial process because there are probably many papers to handle. But now that the problem has been discovered, I guess that the journal should do some investigation around this issue, if it has not been done yet, or if an investigation is not in process.

This blog post was just to share what I found about this story on social media. I am not involved in this paper, nor in this field, in this journal. But I know that there are many unethical reviewers in academia that ask to cite their papers or papers from their friends. And it is still quite surprising to see that this is mentioned directly in a paper like this. Thus, I decided to share the story on this blog.

Posted in Academia | Tagged , , , , | Leave a comment

End of my term as associate editor for Array

I have been associate editor for the Array journal of Elsevier for 4 years. I have been happy to do this work but it is now the time for me to move on and focus on other things.

What is Array? It is a quite new journal that is multi-disciplinary and focus on computer science, and is open-access. It can be viewed as a journal similar to IEEE Access but less well-known. It is currently indexed in ESCI (Emerging SCI index) in the second quartile but still has a low volume of published papers (about 50 per years according to statistics from LetPub).

I have joined that journal in 2020 as one of the associate editors as I thought that it is a promising journal and I wanted to get more experience in this type of work. I did get some good experience with this journal but it did not grow as much as I would have expected. However, that is not the reason for me to leave this job. I am leaving simply because I am more busy than before and I want to focus on other things that I think are more important.

At this stage of my career, I find that the work of associate editor is less interesting because let’s be honest, it is extra work that I don’t need to do, it takes time, and it is unpaid. In general, a publisher can earn quite a lot of money from authors, especially for open-acess journals, and typically only the editor-in-chief might receive a part of that money. That’s just how it work.

Having said that, I wish the best for the Array journal, though, and am happy to have worked with them for a few years.

Posted in Academia | Tagged , , | Leave a comment

Computer Science Journals and Conferences with the most withdrawals in 2023

Today, we will look at Computer Science Journals and Conferences that have the largest number of retracted or withdrawn articles in 2023. A paper can be withdrawn for various reasons such as plagiarism, papers with fake results, conflict of interests, or a compromised review process.

Methodology

To check the number of withdrawn or retracted papers from computer science journals for a given year, I searched in the DBLP database using the following query (access = withdrawn, year = 2023):
https://dblp.org/search/publ?q=access%3Awithdrawn%3A%20year%3A2023%3A

Thus, I only consider journal and conferences that are indexed in DBLP, which covers the best journals and conferences in computer science.

Results for 2023

For 2023, there are 911 withdrawn papers, with 494 papers from ArxIv. The papers from ArxIv do not count as formal publications and are thus ignored. From the remaining papers, the results are shown in the table below for journal or conferences with at least 2 withdrawn papers.

Number Journal Name Publication Count
1Soft Comput.133
2Multim. Tools Appl.37
3J. Comb. Optim.30
4Int. J. Comput. Intell. Syst.21
5IET Softw.21
6EURASIP J. Wirel. Commun. Netw.17
7J. Robotics16
8Int. J. Syst. Assur. Eng. Manag.10
9Inf. Syst. E Bus. Manag.10
10IET Commun.9
11Electron. Commer. Res.9
12J. Electronic Imaging8
13J. Electr. Comput. Eng.7
14Pers. Ubiquitous Comput.6
15IET Circuits Devices Syst.5
16Distributed Parallel Databases4
17J. Supercomput.4
18EURASIP J. Inf. Secur.3
19IACR Cryptol. ePrint Arch.3
20Neural Comput. Appl.3
21Neural Process. Lett.3
22Intell. Serv. Robotics3
23CVPR3
24J. Comput. Virol. Hacking Tech.3
25Clust. Comput.2
26J. Intell. Fuzzy Syst.2
27Ann. Oper. Res.2
28J. Inf. Sci.2
29NeuroImage2
30Appl. Intell.2
31Remote. Sens.2
32Stud Logica2
33Int. J. Comput. Math.2
34Behav. Inf. Technol.2
35Eng. Appl. Artif. Intell.2
36Educ. Inf. Technol.2
37Wirel. Pers. Commun.2

From this table, we can make a few interesting observations.

First, the largest number of withdrawn papers is by far in the Soft Computing journal of Springer with 133 papers. To try to understand what is going on with this journal, I clicked the first 5 withdrawn journal papers from Soft Computing to see the reasons. The reasons were basically all the same:

The publisher has retracted this article in agreement with the Editor-in-Chief. The article was submitted to be part of a guest-edited issue. An investigation by the publisher found a number of articles, including this one, with a number of concerns, including but not limited to compromised editorial handling and peer review process, inappropriate or irrelevant references or not being in scope of the journal or guest-edited issue. Based on the investigation’s findings the publisher no longer has confidence in the results and conclusions of this article.”

In other words, several withdrawn papers from that journal seems to have been in guest special issues.

To go further in this analysis, we can compare to the previous years. I see that there were only 6 withdrawn papers in Soft Computing in 2022, 22 in 2021 but 76 in 2020. Thus, there seems to be quite a large amount in this journal with some big variations over the years, but in 2023, it has truly sky rocketed with 133.

By searching further, I can see that the Soft Computing journal is now “On hold” in terms of indexing in the SCI list of journals as of 2024/09/09. This means that this journal has been temporarily expelled from the list as a kind of punishment.

Now, let’s look at the second journal with the most withdrawn papers, which is Multimedia Tools and Applications with 37 papers, and is also published by Springer. The first five retracted papers were withdrawn for the following reasons:

  • An investigation by the publisher found a number of concerns, including but not limited to citations which do not support claims made in the text, non-standard phrasing, and image irregularities.” (4 times)
  • The article was submitted to be part of a guest-edited issue. An investigation by the publisher found a number of articles, including this one, with a number of concerns” (1 time)

So, in this case, again the problem of papers in special issues occur, but also the problem of papers with irregularities.

I can see that the Multimedia Tools and Applications has also been put “On hold” in terms of indexing in the SCI list of journals as of 2024/09/09.

Now after that, the third journal with the most withdrawn papers is Journal of Combinatorial Optimization, also published by Springer with 30 papers, then the International Journal of Computational Intelligence Systems of Springer with 21 papers.

Thus, first four journals with the most retracted papers are from Springer. Then, the fifth journal is from Wiley.

Another interesting observation that we can make from this analysis is that CVPR is perhaps the only conference that shows up in this list with 3 withdrawn papers. This is quite surprising as CVPR is a top conference in the field of computer vision.

Thus, I have checked to see what are the reason(s) and it is “IEEE removed this article due to lack of legal authorization for its publication”. I think that this means that the authors did not fill the copyright form for this conference.

Conclusion

I hope that this blog post has been interesting.

For researchers, I recommend to avoid submitting to the journals that have been put “On hold” until the situation for these journals returns to normal. This is to ensure that your papers are indexed in SCI.

Please share your thoughts in the comment section below!

Posted in Academia | Tagged , , , | Leave a comment

Reducing the cost of web hosting…

I am hosting my websites for more than 15 years on a web hosting service called IONOS (formerly 1and1), which is not free. I do this because it gives me a lot of flexibility for making websites. However, today, I received an e-mail from Ionos telling me that they think I am using too much traffic (over 50 GB per month), and thus they deactivate the CDN (Content Delivery Network) optimization that they offer for my websites. This won’t prevent my websites from being accessed but it may hurt the speed a little bit.

So why is there so much traffic on my websites? I do not have that many visitors. I think that the reason is that there are some large files that are hosted on my websites such as videos and datasets.

  • For videos, it might be a better idea to host them on an external website such as Youtube. I think that could reduce traffic by a good amount. I might change this. However, I like to have my videos on my own website so that they can be downloaded even from geographical locations where platforms like YouTube are not available.
  • For datasets, I will keep them on my websites as they are important, but I think that I will restrict the downloads to only the real visitors from my websites.

And as for IONOS, I have to say that I am not happy with IONOS. They charge a relatively high price, which they keep increasing, with not much improvement to their services. In fact, I think they almost tripled their price since I started using their services. And I also had other problems with that company in the last few years, such (1) four days of downtimes for a SQL database, and (3) a database being reverted to three years before and losing a good amount of data. Thus, I am thinking to move to another web hosting provider next year. But it requires some time to set up my websites properly, so I will have to think about it.

That’s all I wanted to talk about for today.

Posted in Website | Tagged , , , , , | Leave a comment

SPMF 2.62 is released!

This is a short blog post to announce that SPMF 2.62 is released, and can be downloaded from the SPMF website‘s download page.

The previous version of SPMF (2.60) introduced a lot of new features, also with some code refactoring, and many modifications to the user interface (including some that are not easily visible). Thus, the version SPMF 2.62 has fixed some bugs and problems that had been noticed after the release of 2.60. Moreover, I have done some further fine-tuning of the graphical user interface to improve some tools that were introduced in 2.60.

Besides that, the main novelty of 2.62 is the inclusion of a dozen new tools to calculate statistics about different types of datasets. Here is the details about changes in SPMF 2.62:

By the way, if you are wondering, what about version 2.61? Was it skipped? Actually, there was a version 2.61 between 2.60 and 2.62 but it was only released to a few people, as I wanted to wait a little bit more before to release more features together. So this is why, 2.61 appears to have been skipped!

If you have any comments, feel free to leave a comment below, or to e-mail me. An e-mail is a faster way of reaching out to me.

Posted in Data Mining, Data science, open-source, spmf | Tagged , , , , , , , | Leave a comment

My research is open-source

Since the last decade, I have taken the decision to publish most of the algorithms that I develop in my research, freely, as part of the open-source SPMF data mining software that I have founded. The goal is that anyone can benefit from most of my research work, either from academia and the industry. In particular, I am happy to see that many researchers have used the code and data that I provide in SPMF in their own research projects. As of today, over 1000 research papers from all over the world have applied SPMF in a very wide range of applications ranging from chemistry, music analysis to restaurant recommendation and e-learning. This success is thanks to the users of SPMF and also its several contributors who have provided code.

I believe that publishing research code and data as open-source can be greatly beneficial for young researchers. It can help to make the research more visible and increase the chance that it is used, cited, and thus that it has an impact!

This what just a short blog post for today! By the way, if you are developing Java data mining algorithms, feel free to contact with me for integrating them in SPMF!


Philippe Fournier-Viger is a computer science professor and founder of the SPMF open-source data mining library, which offers more than 170 algorithms for analyzing data, implemented in Java.

Posted in open-source | Tagged , , , | Leave a comment