The Data Blog | A blog about data mining, data science, machine learning and big data, by Philippe Fournier-Viger

How to propose a special issue for a journal?

Posted on 2023-03-08 by Philippe Fournier-Viger

Today, I will talk about how to propose a special issue for a journal.

What is a special issue?

A special issue is a collection of articles on a specific topic or theme that is published in a journal. Special issues can provide an opportunity to present the latest research, highlight emerging trends, or address gaps in the literature. If you have an idea for a special issue, you may want to apply to be a guest editor and organize it.

How to organize a special issue?

You can follow these steps:

1. Select a journal that is relevant to your topic and has a good reputation or is good according to other criteria that matter to (e.g., a high impact factor, high citation rate, etc.). You should choose a journal that is within the scope of your field and reaches a large audience. You can consult the websites of big publishers to search for journals in your field and compare their metrics and scope.

2. Consult the journal’s website for the guidelines and requirements for special issue proposals. Some journals may have a specific format, template, or online submission system for proposals. Others may ask you to email your proposal to the editor-in-chief or a designated contact person.

3. Prepare a proposal that outlines the rationale, objectives, scope, and expected outcomes of your special issue. You should also include a tentative title, a list of potential topics and keywords, a timeline for submission and publication, and a brief introduction of yourself and any co-editors. You should also demonstrate the significance, relevance, and timeliness of your special issue, and how it will benefit the readers and the field. You may also need to provide some sample papers or abstracts to demonstrate the quality and relevance of your special issue.

4. Submit your proposal to the journal and wait for their response. The journal editors will review your proposal and decide whether to accept, reject, or request revisions. They may also ask you to provide more information, such as the names and affiliations of potential authors and reviewers, or a detailed budget and funding sources.

5. If your proposal is accepted, you will be responsible for managing the editorial process of your special issue. This includes inviting authors, soliciting submissions, coordinating peer review, editing and proofreading manuscripts, and ensuring that the special issue meets the journal’s standards and deadlines. You will also need to communicate with the journal editors and staff, and follow their instructions and policies.

Applying for a special issue can be a rewarding and challenging experience. It can help you showcase your expertise, expand your network, and contribute to the advancement of your field. However, it also requires a good amount of time, effort, and commitment. Therefore, you should carefully consider your motivation, resources, and expectations before applying for a special issue.

Conclusion

That is all for that. It was just a short blog post about how to propose a special issue for a journal.

==
Philippe Fournier-Viger is a full professor and the founder of the open-source data mining software SPMF, offering more than 250 data mining algorithms.

Top 10 mistakes made by young researchers

On the quality of peer-review for academic journals (updated)

What are the milestones in the career of an academic researcher?

Posted in Academia | Tagged academia, academic journal, journal, special issue | Leave a comment

ASCII Art for SPMF

Posted on 2023-03-07 by Philippe Fournier-Viger

Today, just for fun, I generated some ASCII arts for the SPMF data mining library using this tool: http://textkool.com/en/ascii-art-generator?hl=default&vl=default&font=Red%20Phoenix&text=Your%20text%20here%20

I think the result is not bad but I am not sure if I would use it:

   _____ _____  __  __ ______ 
  / ____|  __ \|  \/  |  ____|
 | (___ | |__) | \  / | |__   
  \___ \|  ___/| |\/| |  __|  
  ____) | |    | |  | | |     
 |_____/|_|    |_|  |_|_|     

 ___    ___           ___   
(  _`\ (  _`\ /'\_/`\(  _`\ 
| (_(_)| |_) )|     || (_(_)
`\__ \ | ,__/'| (_) ||  _)  
( )_) || |    | | | || |    
`\____)(_)    (_) (_)(_)    

  ____  ____  __  __ _____ 
 / ___||  _ \|  \/  |  ___|
 \___ \| |_) | |\/| | |__   
  ___) |  __/| |  | |  __|  
 |____/|_|   |_|  |_|_|   

   ______..______  .___  ___.  ______ 
  /      ||   _  \ |   \/   | |   ___|
  |  (---`|  |_)  ||  \  /  | |  |__   
   \   \  |   ___/ |  |\/|  | |   __|  
.---)   | |  |     |  |  |  | |  |     
|______/  | _|     |__|  |__| |__|

Which of the above do you prefer? Let me know in the comment section, below!
For me, I think I prefer the third one.

Posted in spmf | Leave a comment

Efficiency problems with using java.util.Hashmap

Posted on 2023-02-20 by Philippe Fournier-Viger

Today, I will talk about some efficiency problems when using the java.util.Hashmap data structure provided by Java. This is an important topics for programmers that aim at implementing efficient software that rely on maps.

Brief review about HashMap

First, let’s review why HashMap is important and what is special about this class. Hashmaps are important because they allow us to store and retrieve data efficiently and conveniently. More precisely, they store data in the form of key-value pairs, where each key is unique. The main feature of a HashMap is to allow efficient retrieval of a value in almost constant time O(1) if we know the key. Key-values pairs are stored in a hash table using a hash function. The most important operations of a HashMap are:
– put(key value) add a key-value pair to the map,
– get(key) retrieve a value from the map,
– remove(key) remove a key value pair from the map using its key
Moreover, some other operations are:
– contains(): to check if a key is in the map

Efficiency problems

Now, I will talk about some efficiency problems when using the java.util.HashMap implementation provided by Java.

(1) Handling of primitive types. First, many people will use a HashMap to store values from variable that are primitive types such as int, long, double, float, short. For example, one could write:

int x =0;
int y =5;
HashMap<Integer,Integer> map = new HashMap<Integer,Integer>();
map.put(x,y);

This will create a new key-value pair (0,5) in a HashMap for storing integers. This works but what is the problem with this? It is that Java will have to convert values of type int to Integer and we might have to convert again from Integer to int after retrieving a value from the map. This process called boxing and unboxing in Java is done automatically but it results in the unnecessary creation of more objects and thus a higher memory consumption.

Solution: A solution is to use an alternative implementation of HashMap that allows to directly use primitive types. There exists such implementations online. And this will be the case with some new data structures proposed in the next version of the SPMF software for example. In some cases, I have observed that using a custom map for primitive types can reduce memory use by up to 50%.

(2) Rehashing. A lot of interesting things can be learnt by looking at the source code of java.util.Hashmap. One thing that we can learn is that it is implemented as an array of buckets where collisions are handled using a linked-list for each bucket. And when the load factor (the number of elements stored in the map) exceed some threshold, the map is automatically resized to twice the size. This process of doubling the size of a HashMap allows to reduce the number of collision and thus maintain fast access to the key-values in the map. However, this process requires to do rehashing, that is to recompute the hash code of each key and reinsert all elements again in the map. Luckily, this is done quite efficiently as the objects representing the entries of the HashMap are reused when it is resized rather than creating new Entry objects. However, resizing a map still has a certain cost.

Solution: To reduce the cost of rehashing when using Java.util.HashMap, it is recommended to initialize the Hashmap to a proper size using a constructor such as:

    /**
     * Constructs an empty HashMap with the specified initial
     * capacity and load factor.
     *
     * @param  initialCapacity the initial capacity
     * @param  loadFactor      the load factor

     */
    public HashMap(int initialCapacity, float loadFactor);

For example, if we expect an hashmap to contain 10,000 elements, we could initially set the size to a value greater than 10,000 and choose a load factor accordingly so as to reduce the cost of resizing().

(3) Handling collisions. As said above, the HashMap handles collisions using a linked-list for each bucket. And when there are too many collisions, HashMap can also switch to a tree structure for a bucket to reduce the time complexity of accessing entries. But generally, we should aim at reducing collisions.

Solution: To reduce collisions, one way is to choose an appropriate hash function that will assign keys to different buckets. To do this, we can define a custom hashCode() method. For example, if we want to put objects in a map that are instances of a class called Book as keys. Then, we could define an appropriate hashCode() function in the class Book to compute hash codes. Then, the HashMap will use that function to determine where the key,value pairs are stored. A good function will reduce collisions.

(4) Inefficiency on how the Hashmap is used. A big problem about HashMap is not about the structure itself but how it is used by the programmer to store and access key-value pairs. Let me show you a first example of that problem. Here is some code to store key-value pairs in a HashMap that are of type Integer,Integer:

HashMap<Integer,Integer> map = new HashMap<Integer,Integer>();
// Put some value 6 in the map for the key 5
map.put(5,6);
...
// Retrieve the value for key 5 and increase it by 1.
Integer value = map.get(5);
if(value == null){
   map.put(5, 1);
}else{
   map.put(5, value+1);
}

This code first puts a key-value pair (5,6) in the map. Then, later it retrieves the value for 5 and increase it by 1, and put it again in the map as (5,7).

What is wrong with this code? Well, it is that when map.get() is called, the hash code of 5 will be computed and the value will be searched in the HashMap. Then, when map.put() is called, the hash code of 5 will be computed again and it will be searched again in the HashMap. Doing this work twice is unecessary.

Solution: A better solution would be to have a method that would directly get the value in the map and increase it by the required value so as to avoid searching twice. In the early versions of Java ( < 1.8) I did not see an efficient way of doing this with the HashMap class provided by java.util. In the most recent Java versions, this can be achieved using the computeIfPresent() method.
But this can be also achieved if we use a custom Hashmap class. This is what is done for example, in the HashMap class that will be included in the next version of the SPMF software, with a method to do this directly:

void getAndIncreaseValueBy(int key, int valueToAdd);  // Not in HashMap.

Here is a second example of inefficient programming with HashMap that I often see in the code of programmers that are not familiar with Java. The problem is to call several times the contains(…), get, and put methods. For example:

HashMap<Integer,Integer> map = new HashMap<Integer,Integer>();
...
if(map.contains(5) == false){
    map.put(5,6);
}

What is the problem here? Well, by calling contains() and put(), the program will search twice for the value 5 in the Hashmap.

Solution: In this case, the solution is simple. We can just rewrite like this:

HashMap<Integer,Integer> map = new HashMap<Integer,Integer>();
...
 map.put(5,6);

This code will do the same thing. If the key value pair (5,6) does not exist in the Map, it will be created, and if it does exist, then it will be replaced by (5,6). How do I know this? Because I have read the source code of HashMap to understand how it works 😉

A third example of inefficient programming with Hashmap is when someone needs to iterate over the key and values of a HashMap. For example, let’s say that we have a map called map. Then someone could fore example call map.entrySet().iterator() to obtain an iterator over all the entries of the Map. Or someone could call map.keySet().iterator() to obtain an iterator over all the keys of the map or map.valueSet().iterator() to obtain an iterator over all the values of the map. This can be OK if we are careful about how to use it. Some example of inefficient code is this:

HashMap<Integer,Integer> map = new HashMap<Integer,Integer>();
...
for(Integer key:  map.keySet()){
    Integer value = map.get(key);
     ...
    System.out.println(" key-value: " + key + " - " + value);
}

What is the problem in this code? It is that it iterates on the keys and then call the get() method to obtain the corresponding value of each key. This is not necessary.

Solution.To avoid calling get, we should instead just do a loop using map.entrySet().iterator().

HashMap<Integer,Integer> map = new HashMap<Integer,Integer>();
for(Entry<Integer, Integer> entry:  map.entrySet()){
 System.out.println(" key-value: " + entry.getKey() + " - " + entry.getValue());
}

Using entrySet() we immediately obtain the corresponding value of each key without calling get().

Conclusion

java.util.HashMap is a very useful structure in Java but it should be used carefully. For this, I recommend to read the source code of HashMap to see how it is implemented (or at least to read the Javadoc) rather than to just use the methods that it provides without thinking. By reading the source code or documentation, we can understand how it works and write more efficient code.

Besides, the HashMap class is powerful but it also has some limitations. An example of limitation that I highlighted above is the inability to handle primitive values without auto-boxing (e.g. transforming int to Integer, or short to Short). To solve this problem, the solution is to use custom HashMap classes that handle primitive types.

Another issue with HashMap is that it provides a lot of features but sometimes all these features are not needed. Thus, sometimes a custom HashMap implementation that is lightweight would actually be more worthwhile when the goal is efficiency. For example, in the next version of the SPMF software that will be released soon, some custom HashMap classes will be provided for primitive types that are well-optimized to increase performance. I have observed that in some cases, it can reduce memory usage by over 50% for some data mining algorithms that do some intensive use of maps. But there are also drawbacks to using custom classes for HashMaps instead of the default one provided with one. One of them is that it may make the code a little harder to understand.

==
Philippe Fournier-Viger is a full professor and the founder of the open-source data mining software SPMF, offering more than 250 data mining algorithms.

SPMF 2.60 is released!

How to Improve Work Efficiency for Researchers?

SPMF: upcoming feature: The Memory Viewer

Posted in Java, Programming | Tagged complexity, efficiency, efficiency problems, hashmap, implementation, java, map, memory, performance, programming, time | Leave a comment

SPMF’s architecture (5) The Graphical User Interface

Posted on 2023-02-14 by Philippe Fournier-Viger

This is the fifth post of a series of blog posts about the architecture of the SPMF data mining library. Today, I will specifically talk about the graphical user interface of SPMF, its main components and how it interacts with other components from SPMF.

An overview of the architecture of SPMF

Before, I start talking about the graphical user interface of SPMF, let’s look at the overall architecture of SPMF:

SPMF is implemented using the Java language. It can be used in three ways: (1) using a simple graphical user interface, (2) using its command line interface or (3) by integrating it into other software directly as a Java library or using some wrappers. When running the SPMF software as a standalone program, the Main class is the entry point of the software, which then calls the graphical interface or the command line interface. The user can use these interfaces to select some algorithm to be run with some parameters. Running an algorithm is then done by calling a module call the Command Processor, which obtains information about the available algorithms from another module called the Algorithm Manager.

Now for today’s topic, I will talk in more details about the graphical interface.

The Graphical Interface

The main class of the Graphical Interface in SPMF is called MainWindow and is located in the package ca.pfv.spmf.gui. The class MainWindow contains the code for the main Window of SPMF, which looks like that:

This Window is implemented using the Java Swing library for building a graphical user interface using Java.

In the package ca.pfv.spmf.gui, there are also several other classes that are used for other aspects of the user interface, such as for displaying visualizations of datasets and results from data mining algorithms. I will talk more about this later.

What are the relationships between the Graphical Interface and other components of SPMF?

I would like now to explain how the graphical interface interacts with other components of SPMF. Here is a small illustration:

As said, previously, when running SPMF, the Main class is starting the graphical interface by creating an instance of the MainWindow class. The MainWindow and graphical user interface of SPMF in general reads and write user preferences through another module called the PreferenceManager. The type of user preferences that are stored by SPMF are for example, the last directory that the user has used for reading an input file using SPMF and the preferences of using the text editor of SPMF or the system’s text editor to open output files. The PreferenceManager stores these preferences permanently, using for example the Window Registry if SPMF is run using the Windows operating system.

Besides that, the graphical user interface is also interacting with the the Algorithm Manager module of SPMF to obtain the list of available algorithms and information about them such as the required parameters. This allows the graphical interface to display the list of algorithms to the user:

Now, after the user selects and algorithm to run it, the action of the running the algorithm is done by calling the Command Processor module. That module takes care of verifying that the parameters provided by the user are appropriate for the algorithm that has been selected, obtain information about how to run the algorithm from the Algorithm Manager, and then run the algorithm. Besides, the Command Processor can also automatically convert some special file types before calling an algorithm. For example, the Command Processor can automatically transform ARFF and TEXT files to the SPMF format so that algorithms from SPMF can also be run on ARFF files and TEXT documents with a .text extension.

What are the main classes of the Graphical Interface?

It would take a long time to describe all the classes that make up the graphical interface of SPMF. But mainly, all the classes are located in the package ca.pfv.spmf.gui and there is at least one class for each window used by SPMF. Here is an overview:

In the middle, you can see the main classes for the different windows from the graphical interface:

AlgorithmExplorer: This windows lets the user browse the different algorithms offered in SPMF and get detailed information about them.
ClusterViewer: This windows is used to display clusters found by clustering algorithms such as K-Means.
GraphViewer: This windows is used to display frequent subgraphs found by graph mining algorithms or to display graph datasets
InstanceViewer: This windows is used to display the content of input data files for clustering. These files are vectors of numbers
PatternsVisualizer:This windows can display the patterns found by a data mining algorithm that is applied by the user, using a table view. It offers various functions such as to search, sort and filter patterns.
TimeSeriesViewer: This windows is used for displaying time series data visually.
SPMFTextEditor: This is a simple text editor that is provided with SPMF. It can be used to display the output files produced by the algorithms.
ExperimenterWindow: This is a window to run performance experiments to test some algorithms. For example, it allows to automatically compare the performance of multiple algorithms on a dataset while varying a parameter.
AboutWindow: This window presents general information about SPMF.

Several windows such as the InstanceViewer, ClusterViewer and TimeSeriesViewer are displaying plots. This is done using the Plot class from the ca.pfv.spmf.gui.plot package.

How the graphical user interface was implemented?

A large part of the user interface of SPMF was coded by hand in Java. But there also exists some tools that can help in building a user interface more quickly when developing Java programs. For programming SPMF, I mostly used the Eclipse IDE (Integrated Development Environment), and there exists a plugin called the WindowBuilder Editor, which allows to create user interfaces visually. This can save quite some time for user interface programming. For example, here is a screenshot of that plugin for designing the TimeSeriesViewer of SPMF:

Using the WindowBuilder Editor, I can directly place different components using the mouse such as buttons, text fields, and combo boxes. Then, after that, I can connect these components to events such as mouse clicks and write Java code for handling the events.

Conclusion

Today, I have given an overview about how the user interface of SPMF is designed, its main components and how it interacts with other components from the SPMF software. Hope it has been interesting!

==
Philippe Fournier-Viger is a full professor and the founder of the open-source data mining software SPMF, offering more than 250 data mining algorithms.

Big Problems only found in Big Data?

Papers without code (and the problem of non-reproducible research)

Sneak peak at the new user interface of SPMF (part 3)

Posted in Data Mining, Data science, Java, open-source, spmf | Tagged architecture, data mining, data science, graphical user interface, gui, java, open source, pattern mining, software, spmf | Leave a comment

SPMF 3.0: Towards even more efficiency

Posted on 2023-02-09 by Philippe Fournier-Viger

In this blog post, I will talk about the future of SPMF. I do this once in a while. Today, I will discuss on the performance of SPMF.

Performance

A main focus of SPMF is and has always been efficiency in memory and time. This is something that is different from many other data mining software, which do not focus much on efficiency. For example, a few years ago, I did a performance comparison of some popular algorithms implemented in SPMF with those implemented in two other software: Weka and Coron, also coded in Java, and the difference in terms of runtime for the same algorithms was huge (the comparison can be found on that page). For example, here is the difference between the FP-Growth algorithm implemented by SPMF and by Weka on some benchmark datasets called “Chess” and “Mushrooms”:

As we can see, SPMF’s version of FP-Growth can be more than 100 times faster than Weka. This shows that, at that time at least, the version of Weka was poorly optimized. And for Coron, we found the gap to be smaller, but there was sometimes quite a big difference as well in some cases. For example, here is a performance comparison for the Eclat algorithm:

Some other people have also done performance comparison between SPMF and other data mining software. For example, I have found a paper “dbscan: Fast Density-based Clustering with R” where the performance of SPMF’s version of the DBScan and Optics algorithms for clustering was compared with many other libraries such as scikit, weka, elki, and pycluster. The results from that paper:

As can be seen in these figures, SPMF is very fast. It is generally the second fastest and generally very close to the first. This is very good considering that clustering is not the main focus of SPMF and that SPMF is implemented in Java, while the first one is implemented in C++. Thus, despite being implemented in Java, SPMF can pretty much match the speed of the C++ version. In these results, we can also see that SPMF is faster than Weka (again) and also that SPMF is generally faster than Eki, although that latter is also implemented in Java and specialized in clustering.

Why SPMF is efficient?

For SPMF, from the beginning, I have paid much attention to optimizing the code as much as possible for all aspects. This has been done by making careful decisions when implementing algorithms, and to use appropriate techniques. Generally, most algorithms are well optimized, and especially the most popular ones. But also a major reason is that SPMF is very lightweight as it does not depend on external libraries. Besides, the structure of SPMF has always been kept simple to ensure high performance. For example, it would have been possible to use a complex software architecture with lot of classes, interfaces, and many libraries that provide many functions that are not needed, but that would have impacted the performance.

For the future? More efficiency!

Now, let me talk about the future of SPMF. A part of it will be yet more efficiency. Currently, I am working on a major improvement of SPMF that will increase even more the performance. How? One way is that I have recently developed a new set of data structures that are highly optimized for SPMF to replace the default ArrayList, HashMap and HashSet classes offered by Java. Moreover, some additional specialized data structures will be added such as cache structures, efficient bitsets, etc, with custom iterators, comparators, etc. for higher speed and lower memory usage.

I will not share the details today as I am still working on polishing the code and testing as I do not want to release unstable code. Moreover, I need to replace the data structures from existing algorithms in SPMF by the new one, which may take several weeks as there are hundreds of algorithms. But this should be released in the next major version of SPMF, which should be called SPMF 3.0.

And there will be a considerable performance gain for some algorithms, especially in terms of memory usage. For example, just by replacing the Java HashMap and ArrayList data structures by the new optimized versions of HashMap and ArrayList from SPMF, I have quickly made a modified version of the ULB-Miner algorithm yesterday and the memory usage was reduced by 73%. For example, here is the result on the Chess dataset for minutil = 500000:

ULB-MINER BEFORE, CHESS, minutil = 500000
Total time ~ 11471 ms
Memory ~ 99.1240234375 MB
High-utility itemsets count : 24979

ULB-MINER AFTER, CHESS, minutil = 500000
Total time ~ 21449 ms
Memory ~ 27.316810607910156 MB (73 % less memory usage)
High-utility itemsets count : 24979

Here we can see about a 73% reduction of memory usage, which is very significant. And I actually did this in about 30 min. I could improved it further and do more improvements. I think that this type of optimization can make a significant performance difference for several algorithms.

If you are a researcher and you are interested in testing the new optimized data structures of SPMF before they are released (maybe in 1 or 2 months), you may leave me a message and I could share that with you, if you want to provide me some feedback to help me improve that before the release. I think from a research perspective this is also quite interesting, as it can help boost the performance of a new algorithm.

That is all for today. It was just a quick update about SPMF.

==
Philippe Fournier-Viger is a full professor and the founder of the open-source data mining software SPMF, offering more than 250data mining algorithms.

SPMF 2.60 is released!

A Map of Data Mining Algorithms (offered in SPMF v092c)

What is a good pattern mining algorithm?

Posted in Big data, Data Mining, Data science, Java, open-source, spmf | Tagged data mining, data science, software, spmf | Leave a comment

SPMF’s architecture (4) The MemoryLogger

Posted on 2023-02-07 by Philippe Fournier-Viger

Today, I will continue the series of blog posts to explain the architecture of the SPMF data mining library, and I will talk in particular about a module in SPMF called the MemoryLogger. This module is responsible for recording the memory usage of the different algorithms that are run for statistics purpose.

The MemoryLogger module is located in the package ca.pfv.spmf.tools of SPMF. Here is a picture of the four functions offered by the MemoryLogger:

Descriptions of the functions offered by the MemoryLogger

The function getInstance() must be used to obtain access to the only instance of the MemoryLogger.

Then, after obtaining an instance of the MemoryLogger, we can call the function reset() to reset the recorded memory usage to zero.

Then, if we call the method checkMemory(), the MemoryLogger will check the current memory usage of the Java Virtual Machine. If it is greater than the previously recorded memory usage, it will keep the new value so as to keep the maximum memory usage until now.

Finally, we can call the method getMaxMemory() to obtain the maximum memory usage recorded until now by the memory logger, in Megabytes.

Example of how to use the MemoryLogger

Now let me show you some example code of how to use the Memory Logger tool to compare the memory usage before and after creating some very big integer arrays:

import ca.pfv.spmf.tools.MemoryLogger;

public class MemoryLoggerTest {

	public static void main(String[] args) {
		// Reset the recorded memory usage
		MemoryLogger.getInstance().reset();
		
		// Check the memory usage
		MemoryLogger.getInstance().checkMemory();
		
		// Print the maximum memory usage until now.
		System.out.println("Max memory : " + MemoryLogger.getInstance().getMaxMemory());
		
		int[][] array = new int[99999][9999];
		
		// Check the memory usage
		MemoryLogger.getInstance().checkMemory();
		
		// Print the maximum memory usage until now.
		System.out.println("Max memory : " + MemoryLogger.getInstance().getMaxMemory());
	}

}

The result of running this program is:

This indicates that the memory usage was about 1.8 MB before creating the integer array and about 3848 MB after that.

How the MemoryLogger is used in SPMF?

The Memory Logger is used in SPMF by almost all data mining algorithms to measure their performance. For example, when running an algorithm such as RuleGrowth, it will use the MemoryLogger to record its memory usage and finally display its maximum memory usage when the algorithm terminates as show below:

Source code of the MemoryLogger

If you are curious, here is the code of the MemoryLogger. It is very simple:

Conclusion

In this short blog post, I have described a useful tool offered in SPMF called the MemoryLogger. In upcoming blog posts, I will continue explaining other key components of the SPMF software.

==
Philippe Fournier-Viger is a full professor and the founder of the open-source data mining software SPMF, offering more than 250data mining algorithms.

What are Generator Itemsets?

The PAKDD 2020 conference (a brief report)

Report about the ICGEC 2018 conference

Posted in Data Mining, Data science, open-source, Pattern Mining, spmf | Tagged data mining, data science, GPL, java, open source, pattern mining, software, spmf | Leave a comment

Unethical services in academia

Posted on 2023-02-05 by Philippe Fournier-Viger

Today, some people contacted me on LinkedIn to ask me if I have any papers where I could sell the authorship. I was quite amazed that someone would ask me that… But I have heard that it is something that is happening nowadays in Academia. Of course, I would never sell or buy authorship. This is something that is unethical and very bad for academia.

Here is a screenshot:

I would not disclose the name of the person. But there are some people like that on LinkedIn, trying to earn money in that way…

This was just a short blog post to talk about this briefly.

Unethical Reviewers in Academia!

Papers without code (and the problem of non-reproducible research)

How to write an academic book?

Posted in Academia, Research | Tagged academia, articles, authorship, ethic, papers, publishing | Leave a comment

SPMF’s architecture (3) The Preference Manager

Posted on 2023-01-31 by Philippe Fournier-Viger

This is the third of a series of blog posts about the architecture of the SPMF data mining library. Today, I will explain the role of another an internal module in SPMF, called the Preference Manager. This module is used to store the preferences of the user so that every time that a user utilizes SPMF his preferences will be saved for the next time.

The Architecture of SPMF

Before going into details, let’s have a look at the overall architecture of SPMF, depicted in the picture below.

SPMF is basically a Java program that provides a graphical user interface as well as a command line user interface for the user. These two interfaces are used to call algorithms offered by SPMF, which are of three types: (1) data preprocessing algorithms and tools, (2) data mining algorithms, and (3) visualizations. There is also a module called the Algorithm Manager, which takes care of managing all algorithms offered in SPMF, and another module called the Command Processor which is called by the graphical or command line interface to run the algorithms. These modules have been discussed in the previous posts about the architecture of SPMF (links above).

Today, I will talk about the Preference Manager. It is a module that is designed to save information about the user preferences.

What kind of user preferences are saved by SPMF?

Here is a list of some user preferences that are saved by SPMF:

The last directory that has been used for reading an input file.
The last directory that has been used to save an output file.
The last directory that has been used to save results from an experiment.
The user prefers to run algorithms using a separated process from the graphical user interface or not
The user prefers to open output files using the SPMF text editor or using the system’s text editor.
The font size and other preferences set by the user for the SPMF text editor.
etc.

Where are all these preferences saved?

The Preference Manager save the user preferences on the local computer through the Java.util.prefs package. If you are using Windows, it means that the preferences from SPMF will be saved in the Windows Registry.

How is the Preference Manager working?

The Preference manager has several methods. The main method is called getInstance(). It is a static method that allows to access the instance of the Preference Manager to then be able to call its other methods.

The Preference Manager has several methods for reading and saving various user preferences. These methods have names that start with set and get, respectively. For example, to read and write the user preference for the input file directory, there are two methods in the Preference Manager called setInputFilePath() and getInputFilePath(). There are other similar methods for the other preferences.

Here is thus a visual representation of the class Preference Manager with its methods:

If the method getInputFilePath() is called on a Windows computer, the Preference Manager will read the values in the Windows Registry for a key “ca.pfv.spmf.gui.input”. If this key does not exist, the value null will be returned.

For example, if I check in the Registry Editor of Windows (regedit) on my computer, I can see that all preferences of SPMF are stored here in the registry:

The exact location in the Registry may vary on different computers.

If you look at the code of the Preference Manager, you will see the definitions for all these registry keys.

And if you want to see the code of the methods to read and write the user preferences about the input file in the Preference Manager, here it is:

As you can see, the Preference Manager class is very simple! It is simply designed to read and write various user preferences to different locations. But nevertheless, the Preference Manager plays an important role in the software. This is why I wanted to talk about that today.

Conclusion

Today, I have explained the role of the Preference Manager, a simple but important module in SPMF. Hope that this has been interesting. Next time, I will explain more about the architecture of SPMF.

==
Philippe Fournier-Viger is a distinguished professor and the founder of the open-source data mining software SPMF, offering more than 250 data mining algorithms.

(Video) Sequence prediction with the CPT and CPT+ Models

Correlation does not imply causation

SPMF’s architecture (2) The Main class and the Command Processor

Posted in Data Mining, Data science, Java, open-source, Pattern Mining, spmf | Tagged architecture, data mining, java, open source, pattern mining, software, spmf | Leave a comment

A Glossary of Pattern Mining

Posted on 2023-01-31 by Philippe Fournier-Viger

Pattern mining is a popular research area in data mining, which aims at discovering interesting and useful patterns in data. It is a field of research that has been active for over 25 years and there is a lot of technical terms related to this field. Thus, in this blog post, I will provide a short glossary of key terms found in pattern mining papers.

Antecedent: The left side of an association rule.
Apriori Algorithm: Apriori is a frequent itemset mining algorithm used to identify frequent itemsets in a dataset. It is the first algorithm for that task.
Association Rule: A rule that expresses the dependence between two itemsets.
Association Rule Mining: A technique for discovering associations and relationships between items in a dataset
Closed Episode: An episode that is not a proper subset of any other episode.
Closed Frequent Itemsets: A set of itemsets that are frequent and contain no supersets that are also frequent.
Closed Sequential Patterns: A set of sequences of items that are frequent and contain no supersets that are also frequent.
Consequent: The right side of an association rule
Eclat Algorithm: A frequent itemset mining algorithm used to identify frequent itemsets in a dataset.
Episode: A collection of one or more items or events that appear in a sequence.
Episode Rule: A rule that expresses the dependence between two episodes, or between events.
Episode Rule Mining: A process of discovering patterns of relationships between events in a sequence, which have the form of rules.
Episode Mining: The process of discovering patterns that appear in a single long sequence of events with timestamps
Frequent Episode: An episode that appears in a dataset with a support greater than a given threshold.
Frequent Itemset: An itemset (set of items) that appears in a dataset with a support greater than a given threshold.
FP-Growth Algorithm: A frequent itemset mining algorithm used to identify frequent itemsets in a dataset.
GSP Algorithm: A sequential pattern mining algorithm used to identify frequent patterns in a sequence of items. It is the first algorithm for that problem.
Graph Database: A database that stores data in the form of graphs (multiple graphs).
Graph Mining: The process of discovering patterns, trends, and relationships in graphs.
High-Utility Itemsets: A set of itemsets with a high total profit associated with them.
High-Utility Sequential Patterns: A set of sequences of items with a high total profit associated with them.
Itemset Mining: The process of discovering patterns and relationships between items in a dataset.
Itemset: A collection of one or more items that appear in a sequence.
Lift: A measure of the strength of an association rule.
Minimum Support: A parameter used to specify the minimum number of occurrences of an itemset or pattern for it to be considered frequent.
Minimum Confidence: A parameter used to specify the minimum confidence of an association rule for it to be considered valid.
Maximal Episode: An episode that is as long as possible in a sequence.
Maximal Frequent Itemsets: A set of itemsets that are frequent and contain no subsets that are also frequent.
Maximum Gap: A parameter used to specify the maximum gap between two items in a sequence for it to be considered a valid pattern.
Maximum Length: A parameter used to specify the maximum length of a pattern for it to be considered valid.
Maximal Sequential Patterns: A set of sequences of items that are frequent and contain no subsets that are also frequent.
Maximum Window Size: A parameter used to specify the maximum size of a sliding window for it to be used for pattern mining.
Periodicity Constraint: A parameter used to specify the minimum periodicity of an itemset or pattern for it to be considered frequent.
Periodic Itemsets: A set of itemsets that occur frequently and have a consistent period of occurrence.
Periodic Pattern Mining: The process of finding patterns that are regularly appearing over time in a sequence of events. This can be done using algorithms such as PFPM.
Periodic Sequential Patterns: A set of sequences of items that occur frequently and have a consistent period of occurrence.
Prefix Span Algorithm: A sequential pattern mining algorithm used to identify frequent patterns in a sequence of items. It is an important algorithm but faster algorithms have been developed such as CM-SPAM and CM-SPADE (2014), and others.
Prefix-tree: A tree-like data structure used by algorithms such as FP-Growth to store information. The information can be transactions, itemsets or other information.
Sequence Database: A collection of sequences that can be used for sequence rule mining.
Sequential Patterns: A set of sequences of items that occur frequently in a dataset.
Sequential Pattern Mining: The process of discovering patterns and relationships between sequences of items.
Sequential Rule Mining: The task of finding relationships between events or symbols in sequences that have the form of rules.
Subgraph: A graph that is part of another graph.
Subgraph Mining: The process of finding subgraphs that are interesting in a single graph or a graph database.
Subsequence: A subset of a sequence that appears in the same order.
Supersequence: A sequence that contains all the elements of a sequence.
Support: A measure of how often an itemset appears in a dataset.
Temporal Sequence Mining: A process of discovering patterns in time-stamped sequences.
Time-Gap Constraint: A parameter used to limit the maximum gap between two items in a sequence for it to be considered a valid pattern.
Window Constraint: A parameter used to limit the size of a sliding window used to identify sequential patterns.

==
Philippe Fournier-Viger is a professor and the founder of the open-source data mining software SPMF, offering more than 250 data mining algorithms.

Posted in Data Mining, Data science, Pattern Mining | Leave a comment

SPMF’s architecture (2) The Main class and the Command Processor

Posted on 2023-01-19 by Philippe Fournier-Viger

In this blog post, I will continue explaining the architecture of the SPMF data mining library. In the previous post, I have introduced a key component of SPMF called the Algorithm Manager, which manages all the algorithms offered in SPMF.

Today, I will move on to talk about two other key components in the architecture of SPMF. In particular, I will focus on how SPMF can be run from both a graphical interface and a command line interface. How does these two interfaces can work seamlessly with the rest of the code in SPMF? The short answer is that it is thanks to two modules called the Main class and the Command Processor, that I will explain in this article. Briefly, the Main class is the entry point for running SPMF, which detects whether SPMF is started from the command line or not, and then launches the graphical interface or the command line interface. And the Command Processor is the module that take care of running a command (e.g. executing a data mining algorithm or launching a visualization). A command is either launched by the command line or graphical interface.

A brief overview of SPMF’s architecture

Before explaining this in details, let’s briefly review the overall architecture of SPMF.

SPMF is a Java software, and it is distributed as a JAR file, as most software implemented in Java:

SPMF is designed to be used in three ways:

as a Java library that can be imported in other Java project
as a standalone program with a simple graphical user interface
as a standalone program that can be run from the command line

The architecture of SPMF is presented in the figure below:

As can be seen in the top of this figure, the SPMF software can be called by other Java software or by the user using the library API, graphical interface or command line interface. Then, all these interfaces rely on a class called the Algorithm Manager to obtain information about the available algorithms and how to run them. There are three types of algorithms: (1) preprocessing algorithms and tools, (2) data mining algorithms and (3) visualizations.

The Main Class

Now let’s get into details. As I said previously, the SPMF software is packaged and distributed as a JAR file. To make a Jar file that can be executed as a program, it is necessary to choose a Main class that will be the entry point for the program.The Main class play this role. It is located in the package ca.pfv.spmf.gui of SPMF.

When the user launches SPMF program from the command line or by double-clicking on the JAR file to start the graphical user interface, the method main() of the class Main is called.

Here is the code of the main() method:

Briefly, the method checks if some arguments have been passed to the program. If there aresome arguments, it means that SPMF is executed from the command line. Thus, the method processCommandLineArguments(args) is called to execute the command that is received from the command line. Otherwise, if there is no argument, it means that the user wants to launch the graphical interface. In that case, the main window of SPMF is created which is called MainWindow in the current version of SPMF and it is displayed to the user.

This is the MainWindow:

The Command Processor

Another important module in SPMF is the Command Processor. It is a class that is shared by both the command line interface and graphical user interface. The Command Processor is used to run algorithms that the user has selected either from the command line or graphical user interface.

Everytime that the user calls an algorithm from either the command line or graphical interface, the method runAlgorithm() of the Command Processor is called as illustrated below:

The runAlgorithm() method of the command processor takes as parameters: (1) an algorithm name, (2) a path to an input file (or null), (3) a path to an output file (or null), and (4) an array of parameters to be passed to the algorithm. Here is the declaration of this method:

What does the Command Processor do when the runAlgorithm() method is called? It does the following:

First, the Command Processor calls the method getDescriptionOfAlgorithm() of the Algorithm Manager to obtain information about the algorithm that the user wants to run, as illustrated below.

After obtaining information about the algorithm, the Command Processor compares this information to the parameters provided by the user. If the algorithm does not exist in SPMF, if the input or output file path are incorrect, or if the parameters provided by the user do not match the description of the algorithm, an error is thrown. This error will be displayed to the user either through the graphical or command line interface.

After that, the Command Processor will run the algorithm() by calling the algorithm based on its description. The description of an algorithm is a subclass of the class DescriptionOfAlgorithm and must have a runAlgorithm() method. The Command Processor call this method to run the algorithm. This is illustrated below:

So until now, I have explained the main idea about the Main class and the Command Processor. Now, I will explain a bit more details.

The Command Processor can also automatically convert some file format

Another feature of the Command Processor is that it can automatically convert some special file formats to the SPMF format so that SPMF algorithms can be run on other file formats and that it is totally transparent to the user, and that algorithms dont need to be modified to support other formats.

This is achieved as follow. If the Command Processor is called with a special file type such as files having the extensions .ARFF or .TEXT files, then the Command Processor will call some tools to convert these files to the SPMF format. This will produce some temporary file. Then the Command Processor will call the requested algorithm on this temporary file. And finally, the Command Processor will delete the temporary file, and convert the output of the algorithm back to the format requested by the user. I might explain this in more details in a future blog post.

A more accurate picture of SPMF’s architecture

So after what I have explained today, we can get a more clear picture of the architecture of SPMF as follows, where I have added the Main class and the Command Processor:

SPMF version number

By the way, the Main class is also where the version number of SPMF is stored in the code:

Conclusion

In this blog post, I have explained more about the architecture of SPMF. In particular, I have described the role of the Main class and the Command Processor, which are key to run SPMF both from a command line interface and graphical interface.

Hope that it has been interesting. If so, please leave some comments below 🙂

==
Philippe Fournier-Viger is a distinguished professor and the founder of the open-source data mining software SPMF, offering more than 250 data mining algorithms.

Towards SPMF v3.0...

An interview with P. Fournier-Viger about AI and data mining

The future of pattern mining

Posted in Data Mining, Data science, open-source, spmf | Tagged architecture, data mining, data science, GPL, machine learning, open source, pattern mining, software, spmf | 1 Comment

Related posts:

Related posts:

An overview of the architecture of SPMF

The Graphical Interface

What are the relationships between the Graphical Interface and other components of SPMF?

What are the main classes of the Graphical Interface?

How the graphical user interface was implemented?

Conclusion

Related posts:

Related posts:

Descriptions of the functions offered by the MemoryLogger

Example of how to use the MemoryLogger

How the MemoryLogger is used in SPMF?

Source code of the MemoryLogger

Conclusion

Related posts:

Related posts:

The Architecture of SPMF

What kind of user preferences are saved by SPMF?

Where are all these preferences saved?

How is the Preference Manager working?

Related posts:

A brief overview of SPMF’s architecture

The Main Class

The Command Processor

The Command Processor can also automatically convert some file format

A more accurate picture of SPMF’s architecture

SPMF version number

Conclusion

Related posts:

Archives

Categories

Recent Posts

Recent Comments

Tag cloud

Number of visitors: