Today, I will talk about some features of the upcoming version of SPMF2.63 that I am working on and is planned for release next month.
Improved Pattern Visualizer window
The first feature is an improvement of the Pattern Viewer tool so that it can display the richer SPMF format where items have names (strings) in patterns. Previously, a pattern file like this:
I am also working on adding other tools for visualization such as a Rule Viewer to visualize association rules and sequential rules like this:
The above screenshot is an early version. I will improve the appearance of this window and I still have to think about how to best integrate it in the software.
Today, I just wanted to show you some ideas of new features. If you have any ideas or comments, send me an e-mail or leave a comment below!
We are now in 2025, and like last year, I will analyze the list of the computer science journals and conferences with the most withdrawal for the previous year (see this blog post for 2023).
Thus, I only consider journal and conferences that are indexed in DBLP, which covers the best journals and conferences in computer science. I also only consider venues with at least 2 withdrawn papers.
Results
Rank
Journal Name
Withdrawn paper Count
1 ▲
J. Intell. Fuzzy Syst.
45 ▲ (last year: 2)
2 –
Multim. Tools Appl.
27 ▼ (last year: 37)
3 ▲
Ann. Oper. Res.
23▲ (last year: 2)
4
Expert Syst. J. Knowl. Eng.
20▲ (new)
5
Trans. Emerg. Telecommun. Technol.
11 ▲ (new)
6
Int. J. Speech Technol.
9▲(new)
7 ▲
Pers. Ubiquitous Comput.
9▲ (last year: 6)
8
Comput. Intell.
9▲ (new)
9
Int. J. Pervasive Comput. Commun.
7▲ (new)
10 ▲
EURASIP J. Inf. Secur.
6▼(last year 5)
11
J. Ambient Intell. Humaniz. Comput.
4▲ (new)
12 ▲
Neural Comput. Appl.
3– (last year: 3)
13
Biomed. Signal Process. Control.
3▲(new)
14
Int. J. Hum. Comput. Interact.
3▲ (new)
15
Phys. Commun.
2▲(new)
16
Int. J. Commun. Syst.
2▲(new)
17 ▲
Wirel. Pers. Commun.
2– (last year: 2)
By looking at this table and comparing with year 2023, some observations can be made.
First, the first position in the table has 45 withdrawn articles, which is much less than the 133 articles in Soft Computing from last year. In fact, Soft Computing is not in the table this year.
The top 3 positions in the table (J. Intell. Fuzzy Syst., Multim. Tools, Appl., and Ann. Oper Res.) were also in the table last year. Let’s analyse these three journals in more details:
For the Multimedia Tools and Applications journal, the number of withdrawn papers decreased from 37 to 27. But according to LetPub, the journal was expelled from the SCIE index on October 22, 2024.
The number of papers in the Journal of Intelligent Fuzzy Systems increased from 2 to 45. According to LetPub, this journal is still on hold.
As for the Annals of operations research Journal, the number of withdrawn papers increased from 2 to 23.
In the rest of the table, most of the entries are new. It is interesting that this year, there are no conferences in the ranking. Last year, it was quite surprising to see that the CVPR conference had three withdrawn papers.
Conclusion
This is short blog post to give some update about his metric of withdrawn papers per journals and conferences.
Hi all! Today, I write a blog post to announce that I have decided to close the data mining forum, which was hosted on this website at http://forum2.philippe-fournier-viger.com/ . The forum was a small website that was connected to my other websites. The forum was used for discussing about data mining topics and was powered by a version of PhPBB. I will now explain why I decided to close the forum and what happened.
First, let’s go back in time to a few weeks ago, in early March 2025. I was trying to access my main website, and I noticed that my website periodically became unavailable with this error: Error 500: Internal Server Error.
I would try to connect to any pages of my website and this error would sometimes occur and sometimes it would not. I first thought that there was a problem with the server hosting my website. I pay a webhosting company to host my websites. So, I logged into the administration panel of my websites and looked. I did not found anything suspicious. Then, after a few hours, my website started to work again, so I thought that it was a temporary problem and that it was solved.
But no! Today, on April 7th 2025, the website went down again and was barely accessible for the whole morning. Then, I started to investigate again. I decided to download all the access logs from the server to see if I could get some idea about what was going on.
Here is what I found. First, I looked at the summary of the HTTP requests to my websites by months:
As you can see above, the number of requests was around 3 million per monthin 2024 but suddenly in March it increased 10 times to around 32 million requests per month, which is extremely suspicious.
Then, I looked at the data for the first days of April, and I found that the number of requests even peaked at 6 million per day, which is a ridiculously high number for my small website.
Then, I looked into the detailed log and found that more than 90 % of the requests were coming from Brazil and were made to access different pages from my forum. Here is a sample of some of those requests:
As can be seen in the screenshot above, dozen of requests were sent from multiple IP addresses, mainly from Brazil with the same timestamp.
I then did a reverse lookup of some of these IP addresses to find where it came from and found that these IP addresses belong to some internet providers in Brazil.
It is not clear why this unusual traffic happened. But the most likely explanation is that some bots decided to try to spam my forum with advertisements and repeatedly tried to login and post. In my forum, the bots were unable to post since I required the manual approval for all new users. However, this did not discourage bots from accessing my webpage millions of times to the point of causing all my websites to go down.
Facing this situation, I had to decide whether to try to block all requests from Brazil, or to improve the security of the website or of the forum itself. But since all the requests were coming from different IPs, it is not simple. And I do not want to pay for some extra security service.
Thus, for this reason and because few people were using the forum in recent years, I have decided to just close it. As few people were using it, I think that it is not a big issue. In the future, I might prepare an alternative to the forum that will be more modern perhaps like a Reddit group or a WhatsApp group. If you have suggestions, feel free to let me know below in the comment section
And since I have closed the forum, the speed of this website and all my other websites has greatly increased!
So that’s the story about this! Hope this blog post has been interesting.
Update1(9th April – 1 day later) – traffic decrease: The number of HTTP requests has largely dropped after closing the forum:
This confirms that the forum was a magnet for bots and spam, and it was a good decision to close it.
Update 2 (12th April) – robots traffic, and CDN
I have done further analysis on the traffic to my websites, and it is also interesting to see that much traffic is by these bots:
And some bots do not bring any meaningful benefit to my websites. For instance, AhrefsBot and AwarioBot are primarily used for SEO monitoring and competitor analysis. Since I do not use these services, allowing their bots to crawl my website only consumes bandwidth without offering any benefits. Similarly, TurnitinBot index content for proprietary systems. Hence, to prevent these bots from crawling my website, I’ve added the following rewrite rule to my .htaccess file:
RewriteEngine On RewriteCond %{HTTP_USER_AGENT} (GPTBot|AwarioBot|TurnitinBot|AhrefsBot|SemrushBot|DotBot) [NC] RewriteRule .* - [F,L]
This rule ensures that these bots receive a 403 Forbidden response and are effectively blocked from accessing any part of the website. This should improve a little bit more the website performance.
Besides, today I also reactivated the CDN (Content Delivery Network) with CloudFare for this website to boost the speed.
Genomic data is growing at an unprecedented rate, but storing and transmitting it efficiently remains a challenge. Several solutions have been proposed in the literature for compressing genome sequences in recent years such as JARVIS3, GeCo3, and NUHT. However, several of these methods face challenges such as high computational complexity, extended runtimes, limited generalization, interpretability issues, overfitting, and sensitivity to hyperparameters. In particular, deep-learning-based methods can have very long runtimes and operates as black-boxes
Thus, in this blog post, I want to introduce a new algorithm called HMG (2025) that we just published in the Information Fusion journal to mitigate these limitations of previous work:
Nawaz, M. Z., Nawaz, M. S., Fournier-Viger, P., Nawaz, S., Lin, J. C.-W., Tseng, V.S. (2025). Efficient Genome Sequence Compression via the Fusion of MDL-Based Heuristics. Information Fusion. Volume 120, 103083 DOI: https://doi.org/10.1016/j.inffus.2025.103083
The key idea of HMG is to use a pattern-mining approach, where patterns are extracted based on the MDL (Minimum Description Length) principle. More precisely, HMG tries to find the set of patterns (k-mers) in genome sequence that most succinctly describe them. Then, HMG uses these patterns to compress the genome sequences. Due to the very large search space of possible k-mers sets, a heuristic approach is used. More precisely, HMG consists of two algorithms called HMG-GA and HMG-SA, that respectively employ a genetic algorithm (GA) and simulated annealing (SA) to rapidly find a near optimal solution. Here is the flowchart of HMG (quite complex, but you may see the paper for details):
This novel approach for genome sequence compression has several advantages. In particular, it is very fast and achieves low bit-per-base compression.
In the paper, several experiments are presented on some benchmark datasets called D1, D2, D3 and D4 (see the paper for details). To give a glimpse about the results, the figure below from the paper show results for the bit-per-base (BPB) and compression ratios(CR) against several state-of-the-art genome sequence compressors:
In general, a lower BPB is better, and a higher CR is better. It can be seen in this figure that HMG-GA achieves very low BPB and comparable or high CR on all datasets.
But more importantly, HMG is very fast. Here is a figure comparing the compression and decompression times of various methods:
In this figure, the results are split into two sections: the compressors to the left of the blue vertical line are those that produced compression and decompression times for all four datasets. In contrast, the compressors to the right of the blue vertical line generated results for a subset of datasets (JARVIS3 and NUHT for DS1 only). It is found that HMG-GA(CC/SM) outperforms JARVIS2 and GeCo3 in both compression and decompression tasks. And among the methods that provide results for a subset of datasets, JARVIS3 emerges as the fastest, followed by NUHT.
Besides that, a very interesting advantage of the proposed HMG method is that it has multiple uses unlike several other genome sequence compressors. Because the set of patterns discovered by HMG are interpretable, they can also be used for the classification of genome sequences. In particular, in the HMG paper, we show different experiments about how the patterns that are generated for compression can be then reused to classify genome sequences. Here is one table for example that show classification accuracy using the patterns for various datasets (more details in the paper):
If you are interested by categorical data clustering, I am glad to announce that a new and up-to-date survey paper named “Categorical data clustering: 25 years beyond K-modes” will appear on this topic in the Expert Systems with Applications journal.
I am glad to have participated as co-author to this paper, which is the project of Prof. Tai Dinh, the main author. The survey paper provides an extensive coverage of categorical clustering, which includes for example algorithms such as k-means and others. There is also a Github repository with code that can be found in the paper.
The final version of the paper will be published soon by the journal. But you can already read the preprint version on Arxiv at this link: https://arxiv.org/abs/2408.17244
Update: here is the reference to the published paper:
Tai Dinh, Wong Hauchi, Philippe Fournier-Viger, Daniil Lisik, Minh-Quyet Ha, Hieu-Chi Dam, Van-Nam Huynh: Categorical data clustering: 25 years beyond K-modes. Expert Syst. Appl. 272: 126608 (2025)
Itemset mining is a data mining task for discovering patterns that appear frequently in transaction databases. In this context, a pattern, also called a frequent itemset, is a set of values that frequently occur together in transactions (records) of a database. Frequent itemset mining has many applications in various fields, but the traditional application is for the analysis of customer transactions. Given a database of customer transactions (sets of purchased items), applying an itemset mining algorithm can reveal the frequent itemsets, that is the set of items that are frequently bought together by customers.
In itemset mining research, two terms are often used, which are the concept of horizontal database and vertical database. They refer to how the data is represented. In this blog post, my goal is to explain what these terms mean.
There are two main ways of representing data for itemset mining. One way is to use a horizontal data format, where each row is a transaction, described by a transaction ID and a set of items in that transaction. For example, this is a horizontal database:
TID
Items
1
A,B,C
2
B,C,D
3
A,C,D
4
A,B,D
Here, the first row has the Transaction ID (TID) 1 and indicates that a customer has purchased some items A, B, and C, which could for example represent Apple, Bread and Cake. The other rows follow the same format.
The other main way of representing data for itemset mining is to use a vertical data format, where each column corresponds to an item and gives the list of transaction IDs that contain that item. For example, this is a vertical database:
Item
TID_set
A
1,3,4
B
1,2,4
C
1,2,3
D
2,3,4
For example, the first two indicates that the item A appears in the first, third and fourth transactions (that is the transactions with IDs 1, 3 and 4).
These two database formats are two different ways of representing the same data. In other words, it is possible to convert from one format to the other, that is to transform the first table into the second table, or transform the second table to obtain the first table.
The choice of the data format can affect the performance and scalability of itemset mining algorithms. In general, the horizontal data format is more suitable for breadth-firstsearch algorithms, such as the Apriori algorithm, which generates candidate itemsets level by level and scans the database multiple times to count their support. On the other hand, the vertical data format is more suitable for depth-first algorithms, such as the ECLAT algorithm, which generates candidate itemsets by intersecting the sets of transactions containing different itemsets.
Let me explain a little bit more about how the database format influences the design of itemset mining algorithms with an example. Lets say that we want to count the support (the number of occurrences of the itemset {A, B} (the purchase of items A and B together).
To count the support of {A, B} in the first table (using the horizontal format), we need to scan all the transactions from the database and count each one that contains A and B together. After reading the four lines of the first table, we find that {A, B} appears two times (has a support of 2).
Now if we want to count the support of {A, B} in the second table (using the vertical format, instead), it is more simple. We only need to do the intersection of the row of A and the row of B. Let me explain this in detail. In the row of A, we have the list of transactions 1, 3, and 4. In the row of B, we have the list of transactions 1, 2, and 4. By doing the intersection of these two lists, we find that {A,B} appears in transactions 1 and 2, and thus that the support of {A, B} is 2. Thus, if we use the vertical database format, it can be more efficient for counting the support of {A, B} because we do not need to read the whole database but just to look at the rows of A and B. This is one reason why vertical itemset mining algorithms can perform quite well in some situations (but not always).
Besides the horizontal and vertical data format, there of course exists other ways of representing the data from a database. Another way is to use prefix-trees, which is for example the internal data representation adopted by the FP-Growth algorithm.
That is all for day. I just wanted to explain the difference between horizontal and vertical database formats. Hope that this was informative and helpful.
— Philippe Fournier-Viger is a professor of computer science and founder of the SPMF data mining library.
Today, I would like to announce that we are creating a new workshop for PAKDD 2025 called the 1st Workshop on Pattern mining and Machine learning for Bioinformatics (PM4B 2025).
The goal of the workshop is to establish a collaborative platform for researchers and practitioners to share theoretical advancements and practical applications of pattern mining (PM) and machine learning (ML), with a focus on bioinformatics.
The dates are as follows:
Paper submission due: February 22, 2025
Acceptance Notification: March 15, 2025
Camera Ready due: March 29, 2025
Workshop Date: June, 10, 2025
For the workshop, there will be a best paper award, and the accepted papers will be invited for a special issue in the lEEE Journal of Biomedical and Health Informatics (SCI Q1, Impact factor: 6.7). Welcome to submit your papers!
Today, it is just a short blog post to wish happy holidays, merry X-mas and Happy New Year to all users and developers of SPMF, from all around the world!
More surprises will come for 2025. I am currently ending the teaching semester.
After grading the final projects of students, I will start to work on finalizing the next version of SPMF!
Recently, I have attended the launch of The Blue Book on the Development of International Academic Conferences in china 2024 (中国国际学术会议发展蓝皮书 2024) published by the AEIC (哎思科蓝)company. This report is very interesting as it gives many insights about academic conferences in China over the last few years (2019-2023).
Here are a few interesting pictures from this report. This is the number of conferences organized by administrative regions in China from 2019 to 2023:
This is the number of conferences organized by province-level regions in China from 2019 to 2023:
Among the top, we can see that Guangdong province had 1,150 conferences and Beijing had 1,154 conferences.
This is the number of conferences organized by countries from 2019 to 2023, where China has 10,137 conferences and the USA has 10,360 conferences:
And those are the main topics of the conferences in China (the English translation follows):
At the center, the main topic is artificial intelligence, which is not surprising! We also see Big data.
There is much more interesting information in the report. It has over 50 pages. I just share a few pictures from the report that I think are interesting.
Today, I saw something quite incredible in the Elsevier journal “International Journal of Hydrogen Energy“, which you can see in the screen shot below:
The authors wrote:
As strongly requested by the reviewers, here we cite some references [[35], [36], [37], [38], [39], [40], [41], [42], [43], [44], [45], [46], [47]] although they are completely irrelevant to the present work.
which apparently means that the reviewers forced the authors to cite a dozen unrelated papers, presumably for boosting their citations counts. ]
If we look more closely at these citations, it is not surprising that most of these papers appear to all have a few researchers in common:
M. El-Ghobashy, H. Hashim, M. Darwish, ... SV. Trukhanov * Eco-friendly NiO/polydopamine nanocomposite for efficient removal of dyes from wastewater Nanomaterials, 12 (7) (2022), 10.3390/nano12071103
Ahmed Maher Henaish a b , Moustafa A. Darwish a , Osama M. Hemeda a , Ilya A. Weinstein b , Tarek S. Soliman d e , Alex V. Trukhanov f g , Sergei V. Trukhanov f g , Di Zhou h , Ali M. Dorgham Structure and optoelectronic properties of ferroelectric PVA-PZT nanocomposites Opt Mater, 138 (2023), 10.1016/j.optmat.2022.113402
Marwa M. Hussein,*a Samia A. Saafan,a H. F. Abosheiasha,b Amira A. Kamal, … Sergei V. Trukhanov, Tatiana I. Zubar,f Alex V. Trukhanov ef and Moustafa A. Darwish*a Structural and dielectric characterization of synthesized nano-BSTO/PVDF composites for smart sensor applications Mater Adv, 4 (22) (2023), pp. 5605-5617, 10.1039/D3MA00437F
Marwa M. Hussein a , Samia A. Saafan a , Hatem F. Abosheiasha b , Di Zhou c , Daria I. Tishkevich d e , Nikita V. Abmiotka e , Ekaterina L. Trukhanova d e , Alex V. Trukhanov d e , Sergei V. Trukhanov d e , M. Khalid Hossain f , Moustafa A. Darwish a Preparation, structural, magnetic, and AC electrical properties of synthesized CoFe2O4 nanoparticles and its PVDF composites Mater Chem Phys, 317 (2024), 10.1016/j.matchemphys.2024.129041
Dmitry B. Migas, Vitaliy A. Turchenko,c A. V. Rutkauskas,c Sergey V. Trukhanov,de Tatiana I. Zubar,Daria I. Tishkevich,Alex V. Trukhanov and Natalia V. Skorodumovagh Temperature induced structural and polarization features in BaFe12O19 J Mater Chem C, 11 (36) (2023), pp. 12406-12414, 10.1039/D3TC01533E
D.I. Tishkevich, S.S. Grabchikov, S.B. Lastovskii, S.V. Trukhanov, D.S. Vasin, T.I. Zubar, A.L. Kozlovskiy, M.V. Zdorovets, V.A. Sivakov, T.R. Muradyan, A.V. Trukhanov, Function composites materials for shielding applications: correlation between phase separation and attenuation properties J Alloys Compd, 771 (2019), pp. 238-245, 10.1016/j.jallcom.2018.08.209
S.V. Trukhanov, V.V. Fedotova, A.V. Trukhanov, et al. Cation ordering and magnetic properties of neodymium-barium manganites Tech Phys, 53 (1) (2008), pp. 49-54, 10.1134/S106378420801009X
A.V. Trukhanov, D.I. Tishkevich, A.V. Timofeev, et al. Structural and electrodynamic characteristics of the spinel-based composite system Ceram Int, 50 (12) (2024), pp. 21311-21317, 10.1016/j.ceramint.2024.03.241
A.V. Trukhanov, V.O. Turchenko, I.A. Bobrikov, et al. Crystal structure and magnetic properties of the BaFe12-xAlxO19 (x=0.1–1.2) solid solutions J Magn Magn Mater, 393 (2015), pp. 253-259, 10.1016/j.jmmm.2015.05.076
S.V. Trukhanov, A.V. Trukhanov, A.N. Vasiliev, et al. Frustrated exchange interactions formation at low temperatures and high hydrostatic pressures in La0.70Sr0.30MnO2.85 J Exp Theor Phys, 111 (2) (2010), pp. 209-214, 10.1134/S106377611008008X
S.V. Trukhanov, T.I. Zubar, V.A. Turchenko, An.V. Trukhanov, T. Kmječ, J. Kohout, L. Matzui, O. Yakovenko, D.A. Vinnik, A.Yu. Starikov, V.E. Zhivulin, A.S.B. Sombra, D. Zhou, R.B. Jotania, C. Singh, A.V. Trukhanov, Exploration of crystal structure, magnetic and dielectric properties of titanium-barium hexaferrites Mater Sci Eng, B, 272 (2021), 10.1016/j.mseb.2021.115345
Vinnik D.A., Starikov A.Y., Zhivulin V.E., Astapovich K.A., Turchenko V.A., Zubar T.I., Trukhanov S.V., Kohout J., Kmječ T., Yakovenko O., Matzui L., Sombra A.S.B., Zhou D., Jotania R.B., Singh C., Yang Y., Trukhanov A.V. Changes in the structure, magnetization, and resistivity of BaFe12–xTixO19 ACS Appl Electron Mater, 3 (4) (2021), pp. 1583-1593, 10.1021/acsaelm.0c01081
Vladimir E. Zhivulin 1 , Evgeniy A. Trofimov 1 , Olga V. Zaitseva 1 , Daria P. Sherstyuk 1 , Natalya A. Cherkasova 1 , Sergey V. Taskaev 2, Denis A. Vinnik 1 11 , Yulia A. Alekhina 3 4 , Nikolay S. Perov 3 4 , Kadiyala C.B. Naidu 5 , Halima I. Elsaeedy 6 , Mayeen U. Khandaker 7 8 , Daria I. Tishkevich 9 , Tatiana I. Zubar 9 , Alex V. Trukhanov 9 10 , Sergei V. Trukhanov Preparation, phase stability, and magnetization behavior of high entropy hexaferrites iScience, 26 (7) (2023), 10.1016/j.isci.2023.107077
I am not working in that field, so I cannot really judge whether these papers are unrelated to the current paper, and we cannot know for sure the identities of the reviewers. Thus, this would require further investigation from the journal to determine what happened in the review process of that paper, and if really the reviewers are the one who asked to cite these papers or not. Thus, we should not jump to conclusions too quickly about this.
Having said that, this whole situation raises questions about the editorial process of the International Journal of Hydrogen Energy, and in particular: (1) why the handling editor did not notice that apparently some reviewers had unethical requests and (2) why the editor and reviewers did not notice the sentence that the author added to their manuscript to complain about the review process before it was published?
I can understand that some problems are unnoticed by the editorial process because there are probably many papers to handle. But now that the problem has been discovered, I guess that the journal should do some investigation around this issue, if it has not been done yet, or if an investigation is not in process.
Update: The paper has been retracted and Elsevier has published the following statement at https://www.sciencedirect.com/science/article/pii/S0360319924043957#:~:text=In%20the%20present%20work%2C%20the%20origin%20of%20the,but%20the%20tetrahedral%20interstice%20in%20Zr%20and%20Hf.:
This blog post was just to share what I found about this story on social media. I am not involved in this paper, nor in this field, in this journal. But I know that there are many unethical reviewers in academia that ask to cite their papers or papers from their friends. And it is still quite surprising to see that this is mentioned directly in a paper like this. Thus, I decided to share the story on this blog.