Introducing PSAC-PDB: A Novel Approach to Protein Structure Analysis and Classification

In this blog post, I will give a short overview of our recent research paper about protein structure analysis and classification by my team (M. S. Nawaz, P. Fournier-Viger, Y. He and Q. Zhang).

Proteins are essential molecules for life, as they perform a variety of functions in living organisms, such as catalysis, transport, signaling, defense, and regulation. The structure of proteins determines their function and interactions with other molecules. Therefore, understanding the structure of proteins is crucial for advancing biological and medical research. In biophysics and computational biology, predicting 3D structure of protein from its amino acid sequence is an important research problem.

However, protein structure analysis and classification is a challenging task, as proteins have complex and diverse shapes that can be affected by various factors, such as environment, mutations, and interactions. Moreover, the number of protein structures available in databases such as PDB and EMDB is increasing rapidly, making it difficult to manually annotate and compare them.

Thus, we recently published a research paper, which proposes a novel method for protein structure analysis and classification. The paper is titled “PSAC-PDB: Analysis and Classification of Protein Structures” and it will appear in the journal Computers in Biology and Medicine, published by Elsevier. The reference of the paper is:

Nawaz, S. M., Fournier-Viger, P., He, Y., Zhang, Q. (2023). PSAC-PDB: Analysis and Classification of Protein Structures. Computers in Biology and Medicine, Elsevier, Volume 158, May 2023, 106814

The main contribution of the paper is to propose a new framework called PSAC-PDB (Protein Structure Analysis and Classification using Protein Data Bank).

The main contribution of the paper is to propose a new framework called PSAC-PDB. A sechma of the overall method can be found below:

Briefly, PSAC-PDB is a computational method that uses a protein structure comparison tool (DALI) to find similar protein structures to a query structure in PDB, and then uses amino acid sequences, aligned amino acids, aligned secondary structure elements, and frequent amino acid patterns to perform classification. PSAC-PDB applies eleven classifiers and compares their performance using six evaluation metrics.

PSAC-PDB also uses sequential pattern mining (SPM) to discover frequent amino acid patterns that can improve the classification accuracy. SPM is a data mining technique that finds subsequences that appear frequently in a set of sequences. SPM can capture the sequential ordering and co-occurrence of amino acids in protein sequences, which can reflect their structural and functional properties. Some examples of frequent sequential patterns of Amino Acids (AA) extracted by the TKS and CM-SPAM algorithms are shown as example in the table below for three families: the S protein structure of SARS-CoV-2 (SSC2), the S protein structures of other viruses and organisms (SO), and Protein (enzyme) structures for others (O).

PSAC-PDB was tested on the case study of SARS-CoV-2 spike protein structures, which are responsible for the entry of the virus into host cells. PSAC-PDB finds 388 similar protein structures to the query structure in PDB, and divides them into three families: S protein structures of SARS-CoV-2, S protein structures of other viruses and organisms, and other protein structures. PSAC-PDB uses four datasets based on amino acid sequences, aligned amino acids, aligned secondary structure elements, and frequent amino acid patterns for classification.

Results have shown that PSAC-PDB achieves high accuracy, precision, recall, F1-score, MCC and AUC values for all three families of protein structures, than state-of-the-art approaches for genome sequence classification. Here are some of the results showing this:

But what is also interesting is that PSAC-PDB shows that using frequent amino acid patterns or aligned amino acids can improve the classification performance compared to using only amino acid sequences or aligned secondary structure elements. Thus, PSAC-PDB can benefit the research community in structural biology and bioinformatics.

For more information, please see the paper. Besides, the datasets from this paper can also be found on Github: https://github.com/saqibdola/PSAC-PDB and implementations of sequential pattern mining algorithms used in the paper can be found in the SPMF data mining software.

Posted in Bioinformatics | Leave a comment

DSSBA 2023, 2nd Special Session on Data Science for Social and Behavioral Analytics @DSAA

I am excited to announce that the 2nd Special Session on Data Science for Social and Behavioral Analytics (DSSBA 2023) will be held at the 10th IEEE International Conference on Data Science and Advanced Analytics (IEEE DSAA 2023) from October 9-13, 2023.

DSSBA 2023 is a forum for researchers and practitioners to present and discuss the latest advances on the design of efficient, scalable and effective solutions for analyzing social and behavioral data. Social and behavioral data are collected from various sources such as social networks, e-learning systems, e-commerce platforms, health care systems, etc. Analyzing these data can help us gain a better understanding of human behavior and social interactions, which can support decision making, personalized recommendation, behavior prediction, behavior change, etc.

The topics of interest for DSSBA 2023 include, but are not limited to:

  • Efficient and scalable algorithms for behavioral and social analytics
  • Evaluation of behavioral analytic models
  • Privacy-preserving techniques for behavioral and social analytics
  • Cognitive and social aspects of behavior analysis
  • Intelligent systems and services powered by behavioral and social data models
  • Applications of behavioral and social analytics in various domains such as education, health care, business, etc.
  • Pattern mining and machine learning models, etc.

We invite you to submit your original research papers to DSSBA 2023 by June 1, 2023. The submission guidelines and more information can be found on the DSSBA 2023 website. The accepted papers will be published in the DSAA 2023 proceedings.

We look forward to receiving your submissions and seeing you at DSSBA 2023!

Posted in cfp, Conference, Data Mining, Data science | Tagged , , , , , , , , , | Leave a comment

Discovering the Top-K Stable Periodic Patterns in a Sequence of Events

In this blog post, I will give a brief introduction to the TSPIN paper about how to find stable periodic patterns in a sequence of events. This algorithm was presented in this research paper:

Fournier-Viger, P., Wang Y., Yang, P., Lin, J. C.-W., Yun, U. (2021). TSPIN: Mining Top-k Stable Periodic Patterns. Applied Intelligence. [source code & data]

What is a periodic pattern? Periodic patterns are sets of values or symbols that appear repeatedly in a sequence of events at more or less regular intervals. For example, in a customer transaction database, a periodic pattern may indicate that some customers buy milk every week or bread every two days. Discovering periodic patterns can thus help to understand customer behavior, optimize inventory management, forecast sales, and so on. For example, here is an illustration of a periodic pattern {a,c} that appears every two hours in a sequence of events, where a is for apple and c is for cake:

The idea of finding periodic pattern is interesting as it can provide insights about the data. However, a major problem is that traditional algorithms for periodic pattern mining have two important limitations:

First, they use a very strict definition of periodicity that requires a pattern to always appear again within a fixed time limit. As a result, some meaningful patterns may be discarded if they are slightly irregular. For example, if a user wants to find periodic patterns that appear weekly, a traditional algorithm will discard a pattern if it does not appear for a week even though it may appear weekly for the rest of the year.

Second, several algorithms use a minimum support threshold to filter out infrequent patterns. This parameter is useful but generally users don’t know how to set it to find patterns and thus adjust it by trial and error to find a suitable value.

To overcome these limitations, the TSPIN (Top-k Stable Periodic pattern mINer) algorithm was designed. It combines two key ideas:

1) Stability: A measure of stability is used in TSPIN to evaluate how consistent a pattern is in terms of its periodicity. A stable pattern has relatively small variations in its occurrence intervals, while an unstable pattern has large variations. TSPIN finds stable periodic patterns but it is designed to be less strict than traditional algorithms so as to also find patterns that are slightly irregular.

2) Top-k: A way of ranking patterns based on their frequency and selecting only the k most frequent ones. This avoids the need to specify a minimum support threshold.

The TSPIN algorithm is an efficient algorithm that integrates several optimizations. Those are explained in the TSPIN research paper. Also that paper describes several experiments on synthetic and real datasets. The results show that TSPIN has good performance and can reveal interesting patterns in real-life data.

There are several potential applications of TSPIN. For example, example, applying TSPIN on a dataset of web clickstream data may reveal that some users visit certain websites periodically with high stability, such as news portals or social media platforms. This information could then be used for personalized recommendation or advertising.

To try TSPIN, the code and datasets are available in the SPMF pattern mining library, which is a popular Java software for pattern mining, which can also be used from other programming languages. SPMF also offers implementations of a dozen other algorithms for discovering periodic patterns and hundreds of algorithms to find other types of patterns such as sequential rules, sequential patterns, frequent itemsets and high utility itemsets.

Conclusion

In conclusion, TSPIN is an innovative algorithm for mining top-k stable periodic patterns in discrete sequences. It addresses some limitations of traditional algorithms by using a more flexible definition of periodicity and avoiding parameter tuning. It can discover meaningful patterns that reflect regular behaviors or habits in various domains.

Hope this blog post has been interesting. If you like this topic, you may also check my list of key papers on periodic pattern mining on this blog.

==
Philippe Fournier-Viger is a professor, data mining researcher and the founder of the SPMF data mining software, which includes more than 150 algorithms for pattern mining.

Posted in Big data, Data Mining, Pattern Mining, spmf | Tagged , , , , , , , , | Leave a comment

Data disasters and more…

Today, I will not talk about a serious topic. Recently, I have been experimenting a little bit with generative AI tools for pictures during my free time. Some services can generate some pretty nice pictures based on text prompts. For fun, I have combined the concept of “data” with different types of disasters such as “volcano”, “tsunami”, thunderstorm” and other concepts. Here are a few generated pictures.

Data tsunami

Data volcano

Data thunderstorm

Data tornado

Data mountain

Data vortex

Data waterfall

Interesting?

Actually, it did not take much time to generate these pictures and it is possible to generate many other variations by using the same prompts. But if you use the above pictures as they are, please give credit to this blog and link to this page.

==
Philippe Fournier-Viger is a professor, data mining researcher and the founder of the SPMF data mining software, which includes more than 250 algorithms for pattern mining.

Posted in Data Mining, Data science, General, Machine Learning | Leave a comment

UDML 2023 workshop!

I’m excited to announce that I am co-organizing a new edition of the Utility-Driven Mining and Learning (UDML) workshop, which will be colocated with the IEEE International Conference on Data Mining (ICDM) 2023 in Shanghai, China.

The UDML workshop aims to bring together researchers and practitioners who are interested in developing and applying utility-driven methods for data mining and machine learning. Utility-driven methods are those that consider not only the accuracy or interestingness of the results, but also their usefulness or value for specific applications or users. For example, utility-driven methods can take into account user preferences, constraints, costs, benefits, risks, or trade-offs when mining or learning from data.

The UDML workshop will feature invited talk by a leading expert in the field, as well as contributed papers that showcase novel research ideas, methods, systems, applications, and challenges related to utility-driven mining and learning. The workshop will also provide a platform for networking and discussion among researchers and practitioners who share a common interest in this topic.

If you are working on or interested in utility-driven mining and learning, I invite you to submit your paper or poster to the UDML workshop by September 1st 2023. The submission guidelines and topics of interest can be found on the workshop website: https://www.philippe-fournier-viger.com/utility_mining_workshop_2023/

I also encourage you to attend the UDML workshop on December 4th 2023 at ICDM 2023 in Shanghai. You will have the opportunity to learn from the presentation, interact with the authors of the accepted papers and posters, and network with other participants.

I hope to see you at UDML 2023!

==
Philippe Fournier-Viger is a professor, data mining researcher and the founder of the SPMF data mining software, which includes more than 250 algorithms for pattern mining.

Posted in Big data, Conference, Data Mining, Data science, Pattern Mining | Tagged , , , , , , , | Leave a comment

How to Detect and Classify Metamorphic Malware with Sequential Pattern Mining (MalSPM)

Malware are malicious software that can harm computers and networks by stealing data, encrypting files, or damaging devices. Malware are a serious threat to cybersecurity, especially when they can change their appearance to evade detection by antivirus software. This is called metamorphic malware, and it is a challenging problem for malware analysis and classification.

In this blog post, I will describe a new method called MalSPM (Metamorphic Malware Behavior Analysis and Classification using Sequential Pattern Mining) that can detect and classify metamorphic malware based on their behavior during execution. The MalSPM method was presented in a research paper that you can read for more details:


Nawaz, M. S., Fournier-Viger, P., Nawaz, M. Z., Chen, G., Wu, Y. (2022) MalSPM: Metamorphic Malware Behavior Analysis and Classification using Sequential Pattern Mining. Computers & Security, Elsever, to appear

I will now explain what are the main features of metamorphic malware, how MalSPM analyzes them using sequential pattern mining (SPM), and what are the advantages of using MalSPM.

What are metamorphic malware?

Metamorphic malware are malware that can modify their code or structure without changing their functionality. This means that they can produce different variants of themselves that look different but behave the same. For example, a metamorphic virus can change its encryption algorithm or insert junk code into its body to avoid being recognized by signature-based antivirus software.

Metamorphic malware pose a serious challenge for malware detection and classification because they can bypass static analysis techniques that rely on code similarity or predefined patterns. Therefore, dynamic analysis techniques that monitor the behavior of malware during execution are more suitable for dealing with metamorphic malware.

How does MalSPM analyze metamorphic malware?

MalSPM is a method that uses sequential pattern mining to analyze and classify metamorphic malware based on their behavior during execution. SPM is a data mining task that consists of finding frequent subsequences in a dataset of sequences. In the case of MalSPM, SPM was applied to a dataset that contains sequences of API calls made by different malware on the Windows operating system (OS). This allows to extract patterns representing the characteristics of different families of malware. API calls are functions provided by the OS that allow applications to perform various tasks such as accessing files, creating processes, or sending network packets. API calls are an attractive and distinguishable feature for malware analysis and detection because they can reflect the actions of executable files.

MalSPM first applies SPM algorithms to find patterns indicating frequent API calls in the dataset. These patterns can be of different types such as sequential rules between API calls as well as maximal and closed sequences of API calls. These patterns represent common behaviors of different types of malware such as ransomware, trojan, and worm. For example, here are a few sequential patterns that were extracted by MalSPM from the dataset of malware API calls:

Each line in this table is a pattern. For example, the first line indicates that a frequent pattern for a malware is to call the API NtClose, followed by NtQueryvalueKey, and then followed by NtClose, and that this pattern appears in 919 malware sequences.

Then after extracting the patterns, MalSPM uses them for the classification of different malware. This is done by using the discovered patterns as feature to train classifiers. In this paper, the performance of seven classifier was compared using various metrics. Moreover, the performance of MalSPM was compared with state-of-the-art malware detection methods and it was found that MalSPM outperformed these methods.

Here is a picture that illustrates the overall process of malware detection using MalSPM.

What are the benefits of using MalSPM?

MalSPM has several benefits for malware detection and classification.

First, it can handle metamorphic malware that can change their appearance by focusing on their behavior rather than their code.

Second, it can discover common and specific behaviors of different types of malware by using SPM techniques.

Third, it can achieve high accuracy and efficiency by using effective pruning strategies and database projection. Fourth, it can provide interpretable results by using sequential rules and patterns that can explain the logic behind the classification.

Code and datasets

For more details, please see the research paper of MalSPM. The datasets can be found here. And if you want to try the algorithms for extracting sequential patterns from that paper, please see the SPMF data mining software, which offers very fast implementations of those algorithms.

Conclusion

In conclusion, MalSPM is a novel method that uses sequential pattern mining to analyze and classify metamorphic malware based on their behavior during execution. It can deal with the challenges posed by metamorphic malware and provide useful insights for cybersecurity researchers and practitioners. For more details, please see the research paper.

==
Philippe Fournier-Viger is a professor, data mining researcher and the founder of the SPMF data mining software, which includes more than 150 algorithms for pattern mining.

Posted in Data Mining, Industry, Pattern Mining, spmf | Tagged , , , , , , , , | 1 Comment

How to propose a special issue for a journal?

Today, I will talk about how to propose a special issue for a journal.

What is a special issue?

A special issue is a collection of articles on a specific topic or theme that is published in a journal. Special issues can provide an opportunity to present the latest research, highlight emerging trends, or address gaps in the literature. If you have an idea for a special issue, you may want to apply to be a guest editor and organize it.

How to organize a special issue?

You can follow these steps:

1. Select a journal that is relevant to your topic and has a good reputation or is good according to other criteria that matter to (e.g., a high impact factor, high citation rate, etc.). You should choose a journal that is within the scope of your field and reaches a large audience. You can consult the websites of big publishers to search for journals in your field and compare their metrics and scope.

2. Consult the journal’s website for the guidelines and requirements for special issue proposals. Some journals may have a specific format, template, or online submission system for proposals. Others may ask you to email your proposal to the editor-in-chief or a designated contact person.

3. Prepare a proposal that outlines the rationale, objectives, scope, and expected outcomes of your special issue. You should also include a tentative title, a list of potential topics and keywords, a timeline for submission and publication, and a brief introduction of yourself and any co-editors. You should also demonstrate the significance, relevance, and timeliness of your special issue, and how it will benefit the readers and the field. You may also need to provide some sample papers or abstracts to demonstrate the quality and relevance of your special issue.

4. Submit your proposal to the journal and wait for their response. The journal editors will review your proposal and decide whether to accept, reject, or request revisions. They may also ask you to provide more information, such as the names and affiliations of potential authors and reviewers, or a detailed budget and funding sources.

5. If your proposal is accepted, you will be responsible for managing the editorial process of your special issue. This includes inviting authors, soliciting submissions, coordinating peer review, editing and proofreading manuscripts, and ensuring that the special issue meets the journal’s standards and deadlines. You will also need to communicate with the journal editors and staff, and follow their instructions and policies.

Applying for a special issue can be a rewarding and challenging experience. It can help you showcase your expertise, expand your network, and contribute to the advancement of your field. However, it also requires a good amount of time, effort, and commitment. Therefore, you should carefully consider your motivation, resources, and expectations before applying for a special issue.

Conclusion

That is all for that. It was just a short blog post about how to propose a special issue for a journal.

==
Philippe Fournier-Viger is a full professor  and the founder of the open-source data mining software SPMF, offering more than 250 data mining algorithms. 

Posted in Academia | Tagged , , , | Leave a comment

ASCII Art for SPMF

Today, just for fun, I generated some ASCII arts for the SPMF data mining library using this tool: http://textkool.com/en/ascii-art-generator?hl=default&vl=default&font=Red%20Phoenix&text=Your%20text%20here%20

I think the result is not bad but I am not sure if I would use it:

   _____ _____  __  __ ______ 
  / ____|  __ \|  \/  |  ____|
 | (___ | |__) | \  / | |__   
  \___ \|  ___/| |\/| |  __|  
  ____) | |    | |  | | |     
 |_____/|_|    |_|  |_|_|     

 ___    ___           ___   
(  _`\ (  _`\ /'\_/`\(  _`\ 
| (_(_)| |_) )|     || (_(_)
`\__ \ | ,__/'| (_) ||  _)  
( )_) || |    | | | || |    
`\____)(_)    (_) (_)(_)    

  ____  ____  __  __ _____ 
 / ___||  _ \|  \/  |  ___|
 \___ \| |_) | |\/| | |__   
  ___) |  __/| |  | |  __|  
 |____/|_|   |_|  |_|_|   

   ______..______  .___  ___.  ______ 
  /      ||   _  \ |   \/   | |   ___|
  |  (---`|  |_)  ||  \  /  | |  |__   
   \   \  |   ___/ |  |\/|  | |   __|  
.---)   | |  |     |  |  |  | |  |     
|______/  | _|     |__|  |__| |__|     
                                           

Which of the above do you prefer? Let me know in the comment section, below!
For me, I think I prefer the third one.

Posted in spmf | Leave a comment

Efficiency problems with using java.util.Hashmap

Today, I will talk about some efficiency problems when using the java.util.Hashmap data structure provided by Java. This is an important topics for programmers that aim at implementing efficient software that rely on maps.

Brief review about HashMap

First, let’s review why HashMap is important and what is special about this class. Hashmaps are important because they allow us to store and retrieve data efficiently and conveniently. More precisely, they store data in the form of key-value pairs, where each key is unique. The main feature of a HashMap is to allow efficient retrieval of a value in almost constant time O(1) if we know the key. Key-values pairs are stored in a hash table using a hash function. The most important operations of a HashMap are:
put(key value) add a key-value pair to the map,
get(key) retrieve a value from the map,
remove(key) remove a key value pair from the map using its key
Moreover, some other operations are:
contains(): to check if a key is in the map

Efficiency problems

Now, I will talk about some efficiency problems when using the java.util.HashMap implementation provided by Java.

(1) Handling of primitive types. First, many people will use a HashMap to store values from variable that are primitive types such as int, long, double, float, short. For example, one could write:

int x =0;
int y =5;
HashMap<Integer,Integer> map = new HashMap<Integer,Integer>();
map.put(x,y);

This will create a new key-value pair (0,5) in a HashMap for storing integers. This works but what is the problem with this? It is that Java will have to convert values of type int to Integer and we might have to convert again from Integer to int after retrieving a value from the map. This process called boxing and unboxing in Java is done automatically but it results in the unnecessary creation of more objects and thus a higher memory consumption.

Solution: A solution is to use an alternative implementation of HashMap that allows to directly use primitive types. There exists such implementations online. And this will be the case with some new data structures proposed in the next version of the SPMF software for example. In some cases, I have observed that using a custom map for primitive types can reduce memory use by up to 50%.

(2) Rehashing. A lot of interesting things can be learnt by looking at the source code of java.util.Hashmap. One thing that we can learn is that it is implemented as an array of buckets where collisions are handled using a linked-list for each bucket. And when the load factor (the number of elements stored in the map) exceed some threshold, the map is automatically resized to twice the size. This process of doubling the size of a HashMap allows to reduce the number of collision and thus maintain fast access to the key-values in the map. However, this process requires to do rehashing, that is to recompute the hash code of each key and reinsert all elements again in the map. Luckily, this is done quite efficiently as the objects representing the entries of the HashMap are reused when it is resized rather than creating new Entry objects. However, resizing a map still has a certain cost.

Solution: To reduce the cost of rehashing when using Java.util.HashMap, it is recommended to initialize the Hashmap to a proper size using a constructor such as:

    /**
     * Constructs an empty HashMap with the specified initial
     * capacity and load factor.
     *
     * @param  initialCapacity the initial capacity
     * @param  loadFactor      the load factor

     */
    public HashMap(int initialCapacity, float loadFactor);

For example, if we expect an hashmap to contain 10,000 elements, we could initially set the size to a value greater than 10,000 and choose a load factor accordingly so as to reduce the cost of resizing().

(3) Handling collisions. As said above, the HashMap handles collisions using a linked-list for each bucket. And when there are too many collisions, HashMap can also switch to a tree structure for a bucket to reduce the time complexity of accessing entries. But generally, we should aim at reducing collisions.

Solution: To reduce collisions, one way is to choose an appropriate hash function that will assign keys to different buckets. To do this, we can define a custom hashCode() method. For example, if we want to put objects in a map that are instances of a class called Book as keys. Then, we could define an appropriate hashCode() function in the class Book to compute hash codes. Then, the HashMap will use that function to determine where the key,value pairs are stored. A good function will reduce collisions.

(4) Inefficiency on how the Hashmap is used. A big problem about HashMap is not about the structure itself but how it is used by the programmer to store and access key-value pairs. Let me show you a first example of that problem. Here is some code to store key-value pairs in a HashMap that are of type Integer,Integer:

HashMap<Integer,Integer> map = new HashMap<Integer,Integer>();
// Put some value 6 in the map for the key 5
map.put(5,6);
...
// Retrieve the value for key 5 and increase it by 1.
Integer value = map.get(5);
if(value == null){
   map.put(5, 1);
}else{
   map.put(5, value+1);
}

This code first puts a key-value pair (5,6) in the map. Then, later it retrieves the value for 5 and increase it by 1, and put it again in the map as (5,7).

What is wrong with this code? Well, it is that when map.get() is called, the hash code of 5 will be computed and the value will be searched in the HashMap. Then, when map.put() is called, the hash code of 5 will be computed again and it will be searched again in the HashMap. Doing this work twice is unecessary.

Solution: A better solution would be to have a method that would directly get the value in the map and increase it by the required value so as to avoid searching twice. In the early versions of Java ( < 1.8) I did not see an efficient way of doing this with the HashMap class provided by java.util. In the most recent Java versions, this can be achieved using the computeIfPresent() method.
But this can be also achieved if we use a custom Hashmap class. This is what is done for example, in the HashMap class that will be included in the next version of the SPMF software, with a method to do this directly:

void getAndIncreaseValueBy(int key, int valueToAdd);  // Not in HashMap.

Here is a second example of inefficient programming with HashMap that I often see in the code of programmers that are not familiar with Java. The problem is to call several times the contains(…), get, and put methods. For example:

HashMap<Integer,Integer> map = new HashMap<Integer,Integer>();
...
if(map.contains(5) == false){
    map.put(5,6);
}

What is the problem here? Well, by calling contains() and put(), the program will search twice for the value 5 in the Hashmap.

Solution: In this case, the solution is simple. We can just rewrite like this:

HashMap<Integer,Integer> map = new HashMap<Integer,Integer>();
...
 map.put(5,6);

This code will do the same thing. If the key value pair (5,6) does not exist in the Map, it will be created, and if it does exist, then it will be replaced by (5,6). How do I know this? Because I have read the source code of HashMap to understand how it works 😉

A third example of inefficient programming with Hashmap is when someone needs to iterate over the key and values of a HashMap. For example, let’s say that we have a map called map. Then someone could fore example call map.entrySet().iterator() to obtain an iterator over all the entries of the Map. Or someone could call map.keySet().iterator() to obtain an iterator over all the keys of the map or map.valueSet().iterator() to obtain an iterator over all the values of the map. This can be OK if we are careful about how to use it. Some example of inefficient code is this:

HashMap<Integer,Integer> map = new HashMap<Integer,Integer>();
...
for(Integer key:  map.keySet()){
    Integer value = map.get(key);
     ...
    System.out.println(" key-value: " + key + " - " + value);
}

What is the problem in this code? It is that it iterates on the keys and then call the get() method to obtain the corresponding value of each key. This is not necessary.

Solution.To avoid calling get, we should instead just do a loop using map.entrySet().iterator().

HashMap<Integer,Integer> map = new HashMap<Integer,Integer>();
for(Entry<Integer, Integer> entry:  map.entrySet()){
 System.out.println(" key-value: " + entry.getKey() + " - " + entry.getValue());
}

Using entrySet() we immediately obtain the corresponding value of each key without calling get().

Conclusion

java.util.HashMap is a very useful structure in Java but it should be used carefully. For this, I recommend to read the source code of HashMap to see how it is implemented (or at least to read the Javadoc) rather than to just use the methods that it provides without thinking. By reading the source code or documentation, we can understand how it works and write more efficient code.

Besides, the HashMap class is powerful but it also has some limitations. An example of limitation that I highlighted above is the inability to handle primitive values without auto-boxing (e.g. transforming int to Integer, or short to Short). To solve this problem, the solution is to use custom HashMap classes that handle primitive types.

Another issue with HashMap is that it provides a lot of features but sometimes all these features are not needed. Thus, sometimes a custom HashMap implementation that is lightweight would actually be more worthwhile when the goal is efficiency. For example, in the next version of the SPMF software that will be released soon, some custom HashMap classes will be provided for primitive types that are well-optimized to increase performance. I have observed that in some cases, it can reduce memory usage by over 50% for some data mining algorithms that do some intensive use of maps. But there are also drawbacks to using custom classes for HashMaps instead of the default one provided with one. One of them is that it may make the code a little harder to understand.

==
Philippe Fournier-Viger is a full professor  and the founder of the open-source data mining software SPMF, offering more than 250 data mining algorithms. 

Posted in Java, Programming | Tagged , , , , , , , , , , | Leave a comment

SPMF’s architecture (5) The Graphical User Interface

This is the fifth post of a series of blog posts about the architecture of the SPMF data mining library. Today, I will specifically talk about the graphical user interface of SPMF, its main components and how it interacts with other components from SPMF.

An overview of the architecture of SPMF

Before, I start talking about the graphical user interface of SPMF, let’s look at the overall architecture of SPMF:

SPMF is implemented using the Java language. It can be used in three ways: (1) using a simple graphical user interface, (2) using its command line interface or (3) by integrating it into other software directly as a Java library or using some wrappers. When running the SPMF software as a standalone program, the Main class is the entry point of the software, which then calls the graphical interface or the command line interface. The user can use these interfaces to select some algorithm to be run with some parameters. Running an algorithm is then done by calling a module call the Command Processor, which obtains information about the available algorithms from another module called the Algorithm Manager.

Now for today’s topic, I will talk in more details about the graphical interface.

The Graphical Interface

The main class of the Graphical Interface in SPMF is called MainWindow and is located in the package ca.pfv.spmf.gui. The class MainWindow contains the code for the main Window of SPMF, which looks like that:

This Window is implemented using the Java Swing library for building a graphical user interface using Java.

In the package ca.pfv.spmf.gui, there are also several other classes that are used for other aspects of the user interface, such as for displaying visualizations of datasets and results from data mining algorithms. I will talk more about this later.

What are the relationships between the Graphical Interface and other components of SPMF?

I would like now to explain how the graphical interface interacts with other components of SPMF. Here is a small illustration:

As said, previously, when running SPMF, the Main class is starting the graphical interface by creating an instance of the MainWindow class. The MainWindow and graphical user interface of SPMF in general reads and write user preferences through another module called the PreferenceManager. The type of user preferences that are stored by SPMF are for example, the last directory that the user has used for reading an input file using SPMF and the preferences of using the text editor of SPMF or the system’s text editor to open output files. The PreferenceManager stores these preferences permanently, using for example the Window Registry if SPMF is run using the Windows operating system.

Besides that, the graphical user interface is also interacting with the the Algorithm Manager module of SPMF to obtain the list of available algorithms and information about them such as the required parameters. This allows the graphical interface to display the list of algorithms to the user:

Now, after the user selects and algorithm to run it, the action of the running the algorithm is done by calling the Command Processor module. That module takes care of verifying that the parameters provided by the user are appropriate for the algorithm that has been selected, obtain information about how to run the algorithm from the Algorithm Manager, and then run the algorithm. Besides, the Command Processor can also automatically convert some special file types before calling an algorithm. For example, the Command Processor can automatically transform ARFF and TEXT files to the SPMF format so that algorithms from SPMF can also be run on ARFF files and TEXT documents with a .text extension.

What are the main classes of the Graphical Interface?

It would take a long time to describe all the classes that make up the graphical interface of SPMF. But mainly, all the classes are located in the package ca.pfv.spmf.gui and there is at least one class for each window used by SPMF. Here is an overview:

In the middle, you can see the main classes for the different windows from the graphical interface:

  • AlgorithmExplorer: This windows lets the user browse the different algorithms offered in SPMF and get detailed information about them.
  • ClusterViewer: This windows is used to display clusters found by clustering algorithms such as K-Means.
  • GraphViewer: This windows is used to display frequent subgraphs found by graph mining algorithms or to display graph datasets
  • InstanceViewer: This windows is used to display the content of input data files for clustering. These files are vectors of numbers
  • PatternsVisualizer:This windows can display the patterns found by a data mining algorithm that is applied by the user, using a table view. It offers various functions such as to search, sort and filter patterns.
  • TimeSeriesViewer: This windows is used for displaying time series data visually.
  • SPMFTextEditor: This is a simple text editor that is provided with SPMF. It can be used to display the output files produced by the algorithms.
  • ExperimenterWindow: This is a window to run performance experiments to test some algorithms. For example, it allows to automatically compare the performance of multiple algorithms on a dataset while varying a parameter.
  • AboutWindow: This window presents general information about SPMF.

Several windows such as the InstanceViewer, ClusterViewer and TimeSeriesViewer are displaying plots. This is done using the Plot class from the ca.pfv.spmf.gui.plot package.

How the graphical user interface was implemented?

A large part of the user interface of SPMF was coded by hand in Java. But there also exists some tools that can help in building a user interface more quickly when developing Java programs. For programming SPMF, I mostly used the Eclipse IDE (Integrated Development Environment), and there exists a plugin called the WindowBuilder Editor, which allows to create user interfaces visually. This can save quite some time for user interface programming. For example, here is a screenshot of that plugin for designing the TimeSeriesViewer of SPMF:

Using the WindowBuilder Editor, I can directly place different components using the mouse such as buttons, text fields, and combo boxes. Then, after that, I can connect these components to events such as mouse clicks and write Java code for handling the events.

Conclusion

Today, I have given an overview about how the user interface of SPMF is designed, its main components and how it interacts with other components from the SPMF software. Hope it has been interesting!

==
Philippe Fournier-Viger is a full professor  and the founder of the open-source data mining software SPMF, offering more than 250 data mining algorithms.

Posted in Data Mining, Data science, Java, open-source, spmf | Tagged , , , , , , , , , | Leave a comment