What is a Closed Itemset and Why is it Useful?

In this blog post, I will explain in simple terms what is a closed itemset and give some examples. I will also mention a few algorithms that can be used to find closed itemsets and that they can be found in the SPMF data mining library in Java.

Frequent itemset mining is a popular technique for discovering interesting relationships between items in data represented as a table. The classical application of frequent itemset mining is to analyze data from a transactional database to find correlations between the items that customers purchase.

An itemset is a set of items that appear together in a transaction or a record. For example, if we have a database of customer transactions, an itemset could be {bread, butter,jam}, which means that some customers have bought bread, butter and jam together.

A frequent itemset is an itemset that appears frequently in a database, that is, its occurrence is above a minimum threshold (minsup) set by the user. For example, if minsup is set to 10%, then a frequent itemset is an itemset that appears in at least 10% of the transactions made by customers. Frequent itemsets have many applications. In the context of analyzing customer transactions, they can be used for marketing, recommendation systems, or cross-selling.

Though frequent itemset mining has many applications, a problem is that it can generate a very large number of frequent itemsets, especially when the database contains many items and transactions. This can make it difficult for users to analyze and interpret the results, and also consume a lot of memory and computational resources.

To address this problem, one can use the concept of frequent closed itemset. A frequent closed itemset is a frequent itemset that has no frequent superset with the same support (appear in the same number of transactions). For example, if {bread, milk} has a support of 15% and {bread, milk, cheese} has a support of 15% as well, then {bread, milk} is not a frequent closed itemset because it has a frequent superset with the same support.

Let’s take an example. Suppose we have a database with four customer transactions, denoted as T1, T2, T3 and T4:

T1: {a,b,c,d}
T2: {a,b,c}
T3: {a,b,d}
T4: {a,b}

where the letters a, b, c, d indicate the purchase of items apple, bread, cake and dattes.

If we set the minimum support threshold to 50% (which means that we want to find itemsets appearing in at least two transactions), a frequent itemset mining algorithm will output the following frequent itemsets:

{a}, {b}, {c}, {d}, {a,b}, {a,c}, {a,d}, {b,c}, {b,d}, {a,b,c}, {a,b,d}

But among these frequent itemsets, only four are frequent closed itemsets:

{a,b}, {a,b,c}, {a,b,d}, {a,b,c,d}

The reason is that these itemsets have no frequent supersets with the same support. For example, {a,b} has a support of 100%, and none of its supersets such as ({a,b,c}, {a,b,d}, {a,b,c,d}) have the same support. On the other hand, {a,c} is not closed because it has a superset ({a,b,c}) with the same support (50%).

Why are closed itemsets useful?

Closed itemsets are important because they can reduce the number of frequent itemsets presented to the user, without losing any information. Frequent itemsets can be very large and redundant, especially when the minsup is low or the database is dense. By mining closed itemsets, generally only a very small set of itemsets is obtained, and still all the other frequent itemsets can be directly derived from the closed itemsets. In other words, any frequent itemset can be derived from a closed itemset by removing some items. For example, we can obtain {a} from {a,b} by removing b, or we can obtain {b,d} from {a,b,c,d} by removing a and c. Therefore, by using closed itemsets, we can reduce the number of itemsets presented to the user without missing any important information.

How to find closed itemsets?

There are several algorithms that have been proposed for this task, such as Apriori-Close, CHARM, FP-Close and LCM. These algorithms are based on different strategies, such as pruning, merging, prefix trees or bitmap representations, to efficiently find all closed itemsets without missing any or generating any false positives. Some of these algorithms have also been adapted to mine maximal itemsets, which are closed itemsets that have no frequent supersets at all.

If you want to try these algorithms on your own data, you can use the SPMF data mining library in Java. SPMF is an open-source library that provides very efficient implementations of various data mining algorithms, including those for closed itemset mining. You can download SPMF from its website (http://www.philippe-fournier-viger.com/spmf/) and follow the documentation and examples to run the algorithms on your data. You can also use SPMF as a command line tool or integrate it with other Java applications, or program written in other programming languages such as C#, Python and R (see the website).

A video

If you are interested by this topic, you can also watch a 50 minute video where I explain in more details what are closed itemsets and their properties, but also maximal itemsets and generator itemsets. This video is part of my free pattern mining course, and you can watch it here: Maximal, Closed and Generator itemsets or from my youtube channel here.

Hope that this blog post has been interesting 🙂

==
Philippe Fournier-Viger is a professor, data mining researcher and the founder of the SPMF data mining software, which includes more than 150 algorithms for pattern mining.

Posted in Data Mining, Data science, Pattern Mining | Tagged , , , , , , , , , , | 5 Comments

Introducing PSAC-PDB: A Novel Approach to Protein Structure Analysis and Classification

In this blog post, I will give a short overview of our recent research paper about protein structure analysis and classification by my team (M. S. Nawaz, P. Fournier-Viger, Y. He and Q. Zhang).

Proteins are essential molecules for life, as they perform a variety of functions in living organisms, such as catalysis, transport, signaling, defense, and regulation. The structure of proteins determines their function and interactions with other molecules. Therefore, understanding the structure of proteins is crucial for advancing biological and medical research. In biophysics and computational biology, predicting 3D structure of protein from its amino acid sequence is an important research problem.

However, protein structure analysis and classification is a challenging task, as proteins have complex and diverse shapes that can be affected by various factors, such as environment, mutations, and interactions. Moreover, the number of protein structures available in databases such as PDB and EMDB is increasing rapidly, making it difficult to manually annotate and compare them.

Thus, we recently published a research paper, which proposes a novel method for protein structure analysis and classification. The paper is titled “PSAC-PDB: Analysis and Classification of Protein Structures” and it will appear in the journal Computers in Biology and Medicine, published by Elsevier. The reference of the paper is:

Nawaz, S. M., Fournier-Viger, P., He, Y., Zhang, Q. (2023). PSAC-PDB: Analysis and Classification of Protein Structures. Computers in Biology and Medicine, Elsevier, Volume 158, May 2023, 106814

The main contribution of the paper is to propose a new framework called PSAC-PDB (Protein Structure Analysis and Classification using Protein Data Bank).

The main contribution of the paper is to propose a new framework called PSAC-PDB. A sechma of the overall method can be found below:

Briefly, PSAC-PDB is a computational method that uses a protein structure comparison tool (DALI) to find similar protein structures to a query structure in PDB, and then uses amino acid sequences, aligned amino acids, aligned secondary structure elements, and frequent amino acid patterns to perform classification. PSAC-PDB applies eleven classifiers and compares their performance using six evaluation metrics.

PSAC-PDB also uses sequential pattern mining (SPM) to discover frequent amino acid patterns that can improve the classification accuracy. SPM is a data mining technique that finds subsequences that appear frequently in a set of sequences. SPM can capture the sequential ordering and co-occurrence of amino acids in protein sequences, which can reflect their structural and functional properties. Some examples of frequent sequential patterns of Amino Acids (AA) extracted by the TKS and CM-SPAM algorithms are shown as example in the table below for three families: the S protein structure of SARS-CoV-2 (SSC2), the S protein structures of other viruses and organisms (SO), and Protein (enzyme) structures for others (O).

PSAC-PDB was tested on the case study of SARS-CoV-2 spike protein structures, which are responsible for the entry of the virus into host cells. PSAC-PDB finds 388 similar protein structures to the query structure in PDB, and divides them into three families: S protein structures of SARS-CoV-2, S protein structures of other viruses and organisms, and other protein structures. PSAC-PDB uses four datasets based on amino acid sequences, aligned amino acids, aligned secondary structure elements, and frequent amino acid patterns for classification.

Results have shown that PSAC-PDB achieves high accuracy, precision, recall, F1-score, MCC and AUC values for all three families of protein structures, than state-of-the-art approaches for genome sequence classification. Here are some of the results showing this:

But what is also interesting is that PSAC-PDB shows that using frequent amino acid patterns or aligned amino acids can improve the classification performance compared to using only amino acid sequences or aligned secondary structure elements. Thus, PSAC-PDB can benefit the research community in structural biology and bioinformatics.

For more information, please see the paper. Besides, the datasets from this paper can also be found on Github: https://github.com/saqibdola/PSAC-PDB and implementations of sequential pattern mining algorithms used in the paper can be found in the SPMF data mining software.

Posted in Bioinformatics | Leave a comment

DSSBA 2023, 2nd Special Session on Data Science for Social and Behavioral Analytics @DSAA

I am excited to announce that the 2nd Special Session on Data Science for Social and Behavioral Analytics (DSSBA 2023) will be held at the 10th IEEE International Conference on Data Science and Advanced Analytics (IEEE DSAA 2023) from October 9-13, 2023.

DSSBA 2023 is a forum for researchers and practitioners to present and discuss the latest advances on the design of efficient, scalable and effective solutions for analyzing social and behavioral data. Social and behavioral data are collected from various sources such as social networks, e-learning systems, e-commerce platforms, health care systems, etc. Analyzing these data can help us gain a better understanding of human behavior and social interactions, which can support decision making, personalized recommendation, behavior prediction, behavior change, etc.

The topics of interest for DSSBA 2023 include, but are not limited to:

  • Efficient and scalable algorithms for behavioral and social analytics
  • Evaluation of behavioral analytic models
  • Privacy-preserving techniques for behavioral and social analytics
  • Cognitive and social aspects of behavior analysis
  • Intelligent systems and services powered by behavioral and social data models
  • Applications of behavioral and social analytics in various domains such as education, health care, business, etc.
  • Pattern mining and machine learning models, etc.

We invite you to submit your original research papers to DSSBA 2023 by June 1, 2023. The submission guidelines and more information can be found on the DSSBA 2023 website. The accepted papers will be published in the DSAA 2023 proceedings.

We look forward to receiving your submissions and seeing you at DSSBA 2023!

Posted in cfp, Conference, Data Mining, Data science | Tagged , , , , , , , , , | Leave a comment

Discovering the Top-K Stable Periodic Patterns in a Sequence of Events

In this blog post, I will give a brief introduction to the TSPIN paper about how to find stable periodic patterns in a sequence of events. This algorithm was presented in this research paper:

Fournier-Viger, P., Wang Y., Yang, P., Lin, J. C.-W., Yun, U. (2021). TSPIN: Mining Top-k Stable Periodic Patterns. Applied Intelligence. [source code & data]

What is a periodic pattern? Periodic patterns are sets of values or symbols that appear repeatedly in a sequence of events at more or less regular intervals. For example, in a customer transaction database, a periodic pattern may indicate that some customers buy milk every week or bread every two days. Discovering periodic patterns can thus help to understand customer behavior, optimize inventory management, forecast sales, and so on. For example, here is an illustration of a periodic pattern {a,c} that appears every two hours in a sequence of events, where a is for apple and c is for cake:

The idea of finding periodic pattern is interesting as it can provide insights about the data. However, a major problem is that traditional algorithms for periodic pattern mining have two important limitations:

First, they use a very strict definition of periodicity that requires a pattern to always appear again within a fixed time limit. As a result, some meaningful patterns may be discarded if they are slightly irregular. For example, if a user wants to find periodic patterns that appear weekly, a traditional algorithm will discard a pattern if it does not appear for a week even though it may appear weekly for the rest of the year.

Second, several algorithms use a minimum support threshold to filter out infrequent patterns. This parameter is useful but generally users don’t know how to set it to find patterns and thus adjust it by trial and error to find a suitable value.

To overcome these limitations, the TSPIN (Top-k Stable Periodic pattern mINer) algorithm was designed. It combines two key ideas:

1) Stability: A measure of stability is used in TSPIN to evaluate how consistent a pattern is in terms of its periodicity. A stable pattern has relatively small variations in its occurrence intervals, while an unstable pattern has large variations. TSPIN finds stable periodic patterns but it is designed to be less strict than traditional algorithms so as to also find patterns that are slightly irregular.

2) Top-k: A way of ranking patterns based on their frequency and selecting only the k most frequent ones. This avoids the need to specify a minimum support threshold.

The TSPIN algorithm is an efficient algorithm that integrates several optimizations. Those are explained in the TSPIN research paper. Also that paper describes several experiments on synthetic and real datasets. The results show that TSPIN has good performance and can reveal interesting patterns in real-life data.

There are several potential applications of TSPIN. For example, example, applying TSPIN on a dataset of web clickstream data may reveal that some users visit certain websites periodically with high stability, such as news portals or social media platforms. This information could then be used for personalized recommendation or advertising.

To try TSPIN, the code and datasets are available in the SPMF pattern mining library, which is a popular Java software for pattern mining, which can also be used from other programming languages. SPMF also offers implementations of a dozen other algorithms for discovering periodic patterns and hundreds of algorithms to find other types of patterns such as sequential rules, sequential patterns, frequent itemsets and high utility itemsets.

Conclusion

In conclusion, TSPIN is an innovative algorithm for mining top-k stable periodic patterns in discrete sequences. It addresses some limitations of traditional algorithms by using a more flexible definition of periodicity and avoiding parameter tuning. It can discover meaningful patterns that reflect regular behaviors or habits in various domains.

Hope this blog post has been interesting. If you like this topic, you may also check my list of key papers on periodic pattern mining on this blog.

==
Philippe Fournier-Viger is a professor, data mining researcher and the founder of the SPMF data mining software, which includes more than 150 algorithms for pattern mining.

Posted in Big data, Data Mining, Pattern Mining, spmf | Tagged , , , , , , , , | Leave a comment

Data disasters and more…

Today, I will not talk about a serious topic. Recently, I have been experimenting a little bit with generative AI tools for pictures during my free time. Some services can generate some pretty nice pictures based on text prompts. For fun, I have combined the concept of “data” with different types of disasters such as “volcano”, “tsunami”, thunderstorm” and other concepts. Here are a few generated pictures.

Data tsunami

Data volcano

Data thunderstorm

Data tornado

Data mountain

Data vortex

Data waterfall

Interesting?

Actually, it did not take much time to generate these pictures and it is possible to generate many other variations by using the same prompts. But if you use the above pictures as they are, please give credit to this blog and link to this page.

==
Philippe Fournier-Viger is a professor, data mining researcher and the founder of the SPMF data mining software, which includes more than 250 algorithms for pattern mining.

Posted in Data Mining, Data science, General, Machine Learning | Leave a comment

UDML 2023 workshop!

I’m excited to announce that I am co-organizing a new edition of the Utility-Driven Mining and Learning (UDML) workshop, which will be colocated with the IEEE International Conference on Data Mining (ICDM) 2023 in Shanghai, China.

The UDML workshop aims to bring together researchers and practitioners who are interested in developing and applying utility-driven methods for data mining and machine learning. Utility-driven methods are those that consider not only the accuracy or interestingness of the results, but also their usefulness or value for specific applications or users. For example, utility-driven methods can take into account user preferences, constraints, costs, benefits, risks, or trade-offs when mining or learning from data.

The UDML workshop will feature invited talk by a leading expert in the field, as well as contributed papers that showcase novel research ideas, methods, systems, applications, and challenges related to utility-driven mining and learning. The workshop will also provide a platform for networking and discussion among researchers and practitioners who share a common interest in this topic.

If you are working on or interested in utility-driven mining and learning, I invite you to submit your paper or poster to the UDML workshop by September 1st 2023. The submission guidelines and topics of interest can be found on the workshop website: https://www.philippe-fournier-viger.com/utility_mining_workshop_2023/

I also encourage you to attend the UDML workshop on December 4th 2023 at ICDM 2023 in Shanghai. You will have the opportunity to learn from the presentation, interact with the authors of the accepted papers and posters, and network with other participants.

I hope to see you at UDML 2023!

==
Philippe Fournier-Viger is a professor, data mining researcher and the founder of the SPMF data mining software, which includes more than 250 algorithms for pattern mining.

Posted in Big data, Conference, Data Mining, Data science, Pattern Mining | Tagged , , , , , , , | Leave a comment

How to Detect and Classify Metamorphic Malware with Sequential Pattern Mining (MalSPM)

Malware are malicious software that can harm computers and networks by stealing data, encrypting files, or damaging devices. Malware are a serious threat to cybersecurity, especially when they can change their appearance to evade detection by antivirus software. This is called metamorphic malware, and it is a challenging problem for malware analysis and classification.

In this blog post, I will describe a new method called MalSPM (Metamorphic Malware Behavior Analysis and Classification using Sequential Pattern Mining) that can detect and classify metamorphic malware based on their behavior during execution. The MalSPM method was presented in a research paper that you can read for more details:


Nawaz, M. S., Fournier-Viger, P., Nawaz, M. Z., Chen, G., Wu, Y. (2022) MalSPM: Metamorphic Malware Behavior Analysis and Classification using Sequential Pattern Mining. Computers & Security, Elsever, to appear

I will now explain what are the main features of metamorphic malware, how MalSPM analyzes them using sequential pattern mining (SPM), and what are the advantages of using MalSPM.

What are metamorphic malware?

Metamorphic malware are malware that can modify their code or structure without changing their functionality. This means that they can produce different variants of themselves that look different but behave the same. For example, a metamorphic virus can change its encryption algorithm or insert junk code into its body to avoid being recognized by signature-based antivirus software.

Metamorphic malware pose a serious challenge for malware detection and classification because they can bypass static analysis techniques that rely on code similarity or predefined patterns. Therefore, dynamic analysis techniques that monitor the behavior of malware during execution are more suitable for dealing with metamorphic malware.

How does MalSPM analyze metamorphic malware?

MalSPM is a method that uses sequential pattern mining to analyze and classify metamorphic malware based on their behavior during execution. SPM is a data mining task that consists of finding frequent subsequences in a dataset of sequences. In the case of MalSPM, SPM was applied to a dataset that contains sequences of API calls made by different malware on the Windows operating system (OS). This allows to extract patterns representing the characteristics of different families of malware. API calls are functions provided by the OS that allow applications to perform various tasks such as accessing files, creating processes, or sending network packets. API calls are an attractive and distinguishable feature for malware analysis and detection because they can reflect the actions of executable files.

MalSPM first applies SPM algorithms to find patterns indicating frequent API calls in the dataset. These patterns can be of different types such as sequential rules between API calls as well as maximal and closed sequences of API calls. These patterns represent common behaviors of different types of malware such as ransomware, trojan, and worm. For example, here are a few sequential patterns that were extracted by MalSPM from the dataset of malware API calls:

Each line in this table is a pattern. For example, the first line indicates that a frequent pattern for a malware is to call the API NtClose, followed by NtQueryvalueKey, and then followed by NtClose, and that this pattern appears in 919 malware sequences.

Then after extracting the patterns, MalSPM uses them for the classification of different malware. This is done by using the discovered patterns as feature to train classifiers. In this paper, the performance of seven classifier was compared using various metrics. Moreover, the performance of MalSPM was compared with state-of-the-art malware detection methods and it was found that MalSPM outperformed these methods.

Here is a picture that illustrates the overall process of malware detection using MalSPM.

What are the benefits of using MalSPM?

MalSPM has several benefits for malware detection and classification.

First, it can handle metamorphic malware that can change their appearance by focusing on their behavior rather than their code.

Second, it can discover common and specific behaviors of different types of malware by using SPM techniques.

Third, it can achieve high accuracy and efficiency by using effective pruning strategies and database projection. Fourth, it can provide interpretable results by using sequential rules and patterns that can explain the logic behind the classification.

Code and datasets

For more details, please see the research paper of MalSPM. The datasets can be found here. And if you want to try the algorithms for extracting sequential patterns from that paper, please see the SPMF data mining software, which offers very fast implementations of those algorithms.

Conclusion

In conclusion, MalSPM is a novel method that uses sequential pattern mining to analyze and classify metamorphic malware based on their behavior during execution. It can deal with the challenges posed by metamorphic malware and provide useful insights for cybersecurity researchers and practitioners. For more details, please see the research paper.

==
Philippe Fournier-Viger is a professor, data mining researcher and the founder of the SPMF data mining software, which includes more than 150 algorithms for pattern mining.

Posted in Data Mining, Industry, Pattern Mining, spmf | Tagged , , , , , , , , | 1 Comment

How to propose a special issue for a journal?

Today, I will talk about how to propose a special issue for a journal.

What is a special issue?

A special issue is a collection of articles on a specific topic or theme that is published in a journal. Special issues can provide an opportunity to present the latest research, highlight emerging trends, or address gaps in the literature. If you have an idea for a special issue, you may want to apply to be a guest editor and organize it.

How to organize a special issue?

You can follow these steps:

1. Select a journal that is relevant to your topic and has a good reputation or is good according to other criteria that matter to (e.g., a high impact factor, high citation rate, etc.). You should choose a journal that is within the scope of your field and reaches a large audience. You can consult the websites of big publishers to search for journals in your field and compare their metrics and scope.

2. Consult the journal’s website for the guidelines and requirements for special issue proposals. Some journals may have a specific format, template, or online submission system for proposals. Others may ask you to email your proposal to the editor-in-chief or a designated contact person.

3. Prepare a proposal that outlines the rationale, objectives, scope, and expected outcomes of your special issue. You should also include a tentative title, a list of potential topics and keywords, a timeline for submission and publication, and a brief introduction of yourself and any co-editors. You should also demonstrate the significance, relevance, and timeliness of your special issue, and how it will benefit the readers and the field. You may also need to provide some sample papers or abstracts to demonstrate the quality and relevance of your special issue.

4. Submit your proposal to the journal and wait for their response. The journal editors will review your proposal and decide whether to accept, reject, or request revisions. They may also ask you to provide more information, such as the names and affiliations of potential authors and reviewers, or a detailed budget and funding sources.

5. If your proposal is accepted, you will be responsible for managing the editorial process of your special issue. This includes inviting authors, soliciting submissions, coordinating peer review, editing and proofreading manuscripts, and ensuring that the special issue meets the journal’s standards and deadlines. You will also need to communicate with the journal editors and staff, and follow their instructions and policies.

Applying for a special issue can be a rewarding and challenging experience. It can help you showcase your expertise, expand your network, and contribute to the advancement of your field. However, it also requires a good amount of time, effort, and commitment. Therefore, you should carefully consider your motivation, resources, and expectations before applying for a special issue.

Conclusion

That is all for that. It was just a short blog post about how to propose a special issue for a journal.

==
Philippe Fournier-Viger is a full professor  and the founder of the open-source data mining software SPMF, offering more than 250 data mining algorithms. 

Posted in Academia | Tagged , , , | Leave a comment

ASCII Art for SPMF

Today, just for fun, I generated some ASCII arts for the SPMF data mining library using this tool: http://textkool.com/en/ascii-art-generator?hl=default&vl=default&font=Red%20Phoenix&text=Your%20text%20here%20

I think the result is not bad but I am not sure if I would use it:

   _____ _____  __  __ ______ 
  / ____|  __ \|  \/  |  ____|
 | (___ | |__) | \  / | |__   
  \___ \|  ___/| |\/| |  __|  
  ____) | |    | |  | | |     
 |_____/|_|    |_|  |_|_|     

 ___    ___           ___   
(  _`\ (  _`\ /'\_/`\(  _`\ 
| (_(_)| |_) )|     || (_(_)
`\__ \ | ,__/'| (_) ||  _)  
( )_) || |    | | | || |    
`\____)(_)    (_) (_)(_)    

  ____  ____  __  __ _____ 
 / ___||  _ \|  \/  |  ___|
 \___ \| |_) | |\/| | |__   
  ___) |  __/| |  | |  __|  
 |____/|_|   |_|  |_|_|   

   ______..______  .___  ___.  ______ 
  /      ||   _  \ |   \/   | |   ___|
  |  (---`|  |_)  ||  \  /  | |  |__   
   \   \  |   ___/ |  |\/|  | |   __|  
.---)   | |  |     |  |  |  | |  |     
|______/  | _|     |__|  |__| |__|     
                                           

Which of the above do you prefer? Let me know in the comment section, below!
For me, I think I prefer the third one.

Posted in spmf | Leave a comment

Efficiency problems with using java.util.Hashmap

Today, I will talk about some efficiency problems when using the java.util.Hashmap data structure provided by Java. This is an important topics for programmers that aim at implementing efficient software that rely on maps.

Brief review about HashMap

First, let’s review why HashMap is important and what is special about this class. Hashmaps are important because they allow us to store and retrieve data efficiently and conveniently. More precisely, they store data in the form of key-value pairs, where each key is unique. The main feature of a HashMap is to allow efficient retrieval of a value in almost constant time O(1) if we know the key. Key-values pairs are stored in a hash table using a hash function. The most important operations of a HashMap are:
put(key value) add a key-value pair to the map,
get(key) retrieve a value from the map,
remove(key) remove a key value pair from the map using its key
Moreover, some other operations are:
contains(): to check if a key is in the map

Efficiency problems

Now, I will talk about some efficiency problems when using the java.util.HashMap implementation provided by Java.

(1) Handling of primitive types. First, many people will use a HashMap to store values from variable that are primitive types such as int, long, double, float, short. For example, one could write:

int x =0;
int y =5;
HashMap<Integer,Integer> map = new HashMap<Integer,Integer>();
map.put(x,y);

This will create a new key-value pair (0,5) in a HashMap for storing integers. This works but what is the problem with this? It is that Java will have to convert values of type int to Integer and we might have to convert again from Integer to int after retrieving a value from the map. This process called boxing and unboxing in Java is done automatically but it results in the unnecessary creation of more objects and thus a higher memory consumption.

Solution: A solution is to use an alternative implementation of HashMap that allows to directly use primitive types. There exists such implementations online. And this will be the case with some new data structures proposed in the next version of the SPMF software for example. In some cases, I have observed that using a custom map for primitive types can reduce memory use by up to 50%.

(2) Rehashing. A lot of interesting things can be learnt by looking at the source code of java.util.Hashmap. One thing that we can learn is that it is implemented as an array of buckets where collisions are handled using a linked-list for each bucket. And when the load factor (the number of elements stored in the map) exceed some threshold, the map is automatically resized to twice the size. This process of doubling the size of a HashMap allows to reduce the number of collision and thus maintain fast access to the key-values in the map. However, this process requires to do rehashing, that is to recompute the hash code of each key and reinsert all elements again in the map. Luckily, this is done quite efficiently as the objects representing the entries of the HashMap are reused when it is resized rather than creating new Entry objects. However, resizing a map still has a certain cost.

Solution: To reduce the cost of rehashing when using Java.util.HashMap, it is recommended to initialize the Hashmap to a proper size using a constructor such as:

    /**
     * Constructs an empty HashMap with the specified initial
     * capacity and load factor.
     *
     * @param  initialCapacity the initial capacity
     * @param  loadFactor      the load factor

     */
    public HashMap(int initialCapacity, float loadFactor);

For example, if we expect an hashmap to contain 10,000 elements, we could initially set the size to a value greater than 10,000 and choose a load factor accordingly so as to reduce the cost of resizing().

(3) Handling collisions. As said above, the HashMap handles collisions using a linked-list for each bucket. And when there are too many collisions, HashMap can also switch to a tree structure for a bucket to reduce the time complexity of accessing entries. But generally, we should aim at reducing collisions.

Solution: To reduce collisions, one way is to choose an appropriate hash function that will assign keys to different buckets. To do this, we can define a custom hashCode() method. For example, if we want to put objects in a map that are instances of a class called Book as keys. Then, we could define an appropriate hashCode() function in the class Book to compute hash codes. Then, the HashMap will use that function to determine where the key,value pairs are stored. A good function will reduce collisions.

(4) Inefficiency on how the Hashmap is used. A big problem about HashMap is not about the structure itself but how it is used by the programmer to store and access key-value pairs. Let me show you a first example of that problem. Here is some code to store key-value pairs in a HashMap that are of type Integer,Integer:

HashMap<Integer,Integer> map = new HashMap<Integer,Integer>();
// Put some value 6 in the map for the key 5
map.put(5,6);
...
// Retrieve the value for key 5 and increase it by 1.
Integer value = map.get(5);
if(value == null){
   map.put(5, 1);
}else{
   map.put(5, value+1);
}

This code first puts a key-value pair (5,6) in the map. Then, later it retrieves the value for 5 and increase it by 1, and put it again in the map as (5,7).

What is wrong with this code? Well, it is that when map.get() is called, the hash code of 5 will be computed and the value will be searched in the HashMap. Then, when map.put() is called, the hash code of 5 will be computed again and it will be searched again in the HashMap. Doing this work twice is unecessary.

Solution: A better solution would be to have a method that would directly get the value in the map and increase it by the required value so as to avoid searching twice. In the early versions of Java ( < 1.8) I did not see an efficient way of doing this with the HashMap class provided by java.util. In the most recent Java versions, this can be achieved using the computeIfPresent() method.
But this can be also achieved if we use a custom Hashmap class. This is what is done for example, in the HashMap class that will be included in the next version of the SPMF software, with a method to do this directly:

void getAndIncreaseValueBy(int key, int valueToAdd);  // Not in HashMap.

Here is a second example of inefficient programming with HashMap that I often see in the code of programmers that are not familiar with Java. The problem is to call several times the contains(…), get, and put methods. For example:

HashMap<Integer,Integer> map = new HashMap<Integer,Integer>();
...
if(map.contains(5) == false){
    map.put(5,6);
}

What is the problem here? Well, by calling contains() and put(), the program will search twice for the value 5 in the Hashmap.

Solution: In this case, the solution is simple. We can just rewrite like this:

HashMap<Integer,Integer> map = new HashMap<Integer,Integer>();
...
 map.put(5,6);

This code will do the same thing. If the key value pair (5,6) does not exist in the Map, it will be created, and if it does exist, then it will be replaced by (5,6). How do I know this? Because I have read the source code of HashMap to understand how it works 😉

A third example of inefficient programming with Hashmap is when someone needs to iterate over the key and values of a HashMap. For example, let’s say that we have a map called map. Then someone could fore example call map.entrySet().iterator() to obtain an iterator over all the entries of the Map. Or someone could call map.keySet().iterator() to obtain an iterator over all the keys of the map or map.valueSet().iterator() to obtain an iterator over all the values of the map. This can be OK if we are careful about how to use it. Some example of inefficient code is this:

HashMap<Integer,Integer> map = new HashMap<Integer,Integer>();
...
for(Integer key:  map.keySet()){
    Integer value = map.get(key);
     ...
    System.out.println(" key-value: " + key + " - " + value);
}

What is the problem in this code? It is that it iterates on the keys and then call the get() method to obtain the corresponding value of each key. This is not necessary.

Solution.To avoid calling get, we should instead just do a loop using map.entrySet().iterator().

HashMap<Integer,Integer> map = new HashMap<Integer,Integer>();
for(Entry<Integer, Integer> entry:  map.entrySet()){
 System.out.println(" key-value: " + entry.getKey() + " - " + entry.getValue());
}

Using entrySet() we immediately obtain the corresponding value of each key without calling get().

Conclusion

java.util.HashMap is a very useful structure in Java but it should be used carefully. For this, I recommend to read the source code of HashMap to see how it is implemented (or at least to read the Javadoc) rather than to just use the methods that it provides without thinking. By reading the source code or documentation, we can understand how it works and write more efficient code.

Besides, the HashMap class is powerful but it also has some limitations. An example of limitation that I highlighted above is the inability to handle primitive values without auto-boxing (e.g. transforming int to Integer, or short to Short). To solve this problem, the solution is to use custom HashMap classes that handle primitive types.

Another issue with HashMap is that it provides a lot of features but sometimes all these features are not needed. Thus, sometimes a custom HashMap implementation that is lightweight would actually be more worthwhile when the goal is efficiency. For example, in the next version of the SPMF software that will be released soon, some custom HashMap classes will be provided for primitive types that are well-optimized to increase performance. I have observed that in some cases, it can reduce memory usage by over 50% for some data mining algorithms that do some intensive use of maps. But there are also drawbacks to using custom classes for HashMaps instead of the default one provided with one. One of them is that it may make the code a little harder to understand.

==
Philippe Fournier-Viger is a full professor  and the founder of the open-source data mining software SPMF, offering more than 250 data mining algorithms. 

Posted in Java, Programming | Tagged , , , , , , , , , , | Leave a comment