How to design memory-efficient data mining algorithms in Java?

A while ago, I had written a blog post about How to measure the memory usage of algorithms in Java. Today, I will discuss the topic of optimizing the memory usage of algorithms written in Java to design memory-efficient data mining algorithms. The reason for writing about this topic is that  memory consumption is often a challenge in Java for implementing data mining algorithms.

memory efficiency

Brief overview of Java memory management

I will first give a brief overview of Java memory management model. In Java unlike in many other language such as C++, the user generally does not have the power to finely control how the memory is managed. The user can allocate memory by create some objects or variables. However, there is no simple way to control when the memory will be freed. In Java, the Garbage Collector (GC) is the process responsible for automatically freeing the memory. Using a GC has the advantage of making programming in Java easier and to avoid memory leaks and other memory related problems. However, using a GC makes the the memory usage much less predictable. The reason is that there  is absolutely no guarantee about when the GC will free the memory used by a Java program. The GC periodically checks references to objects and when an object is not referenced anymore, it may be freed. There is a common myth that if someone calls System.gc() the garbage collector will immediately free the memory. However it is not the case.

So how to optimize memory usage in Java?

There many ways to optimize memory usage.  I will discuss a few principles for optimizing memory usage in Java below, and I will then provide a detailed example with some Java code.

1) Using memory-efficient data structures. A first principle for optimizing memory usage is to use memory efficient data structures when possible.  For example, one may consider using an array of integers instead of an ArrayList because ArrayList introduces some memory overhead. Another example: one may uses int instead of Integer.  But there is sometimes a trade-off between memory usage and execution time. Thus, one should not just think about memory when choosing a data structure but also about execution time.

2) Reducing object creation. An important thins to know in Java is that garbage collection is VERY costly. In particular, if a Java program reaches the memory limit of Java, then the program may suddenly become very slow because of the GC (see my previous blog post explaining this issue). Thus, a very good strategy to optimize Java algorithms is to design the algorithms such that variables/objects  are reused as much as possible (when it makes sense) rather than creating new variables/objects. If less variables/objects are created, then less memory will be used and the GC will have to work less, which may also improves speed For example, imagine a for loop that is repeatedly creating objects. It is sometimes possible to declare a single variable/object outside the for loop and reuse the same object. Again, whether it should be done or not depends on the overall picture. In general, one should focus on optimizations that are meaningful and not do micro-optimizations. Moreover, performing optimizations should ideally not decrease the code readability or maintainability.

A variable or object that is reused can be called a buffer object.

A detailed example.

I will now present a detailed example showing how the two above principles can be used to improve the memory efficiency of some code. The example that I will provide is abstract and can be applied to most depth-first search pattern mining algorithms.  The solution that I will present was applied data mining algorithm implementations of the SPMF data mining library written in java..

Consider the following code. A List of Integer is first created. Then a recursive method is called. This recursive methods copy the list, add an element to the list and then recursively call itself. This method is not memory efficient since every time that it is called it will create a new List object.  This can be quite slow because of object creation time. But moreover, the GC will have to free all these objects, which will also slow down the program.

static public void  main(String[] args){
		List<Integer> list = new ArrayList<Integer>();
		recursiveMethod(list);
	}


	static public void recursiveMethod(List<Integer> list) {
		// make a copy of the list
		List<Integer> copyOfList = new ArrayList<Integer>();
		copyOfList.addAll(list);
		

		// Add some new integer to the list
		int integer = ... ;
		// ...
		copyOfList.add(integer);
		
		//.... the method is called recursively in a for loop
		recursiveMethod(copyOfList);
	}

Now, let’s look at a better solution, below.  First, instead of using a List of Integer, an array of integer is used. This is already better in terms of memory since it is a more simple data structure. Second, the array of integers is used as a buffer object. Instead of repeatedly creating List objects, the same buffer is always reused.  The buffer is initialized with a large enough size (for example 500 entries).  This version of the code is much faster because (1) it is not necessary to always create objects, (2) we don’t copy list anymore, (3) a single item is copied for each recursive call!

static public void  main(String[] args){
		int[] buffer = new int[500];
		int currentSize = 0;
		recursiveMethod(buffer, currentSize);
	}


	static public void recursiveMethod(int[] buffer, int bufferSize) {
		// Add some new integer to the list
		int integer = ...;
		buffer[bufferSize++] = integer;
		
		//....  the method is called recursively in a for loop
		recursiveMethod(buffer, bufferSize);
	}

The above solution is extensively used in  algorithm implementations of the SPMF data mining library. In some cases, this allowed to reduce memory usage by half.

Conclusion

In this blog post, I have discussed the problem of designing memory-efficient algorithms in Java. I have presented a few principles and then presented a detailed example of how to optimize data mining algorithms in Java. Hope you have enjoyed that post. In future blog post, I will discuss more examples of memory optimizations, if there is enough interest on this topic!

==

That is all I wanted to write for now. If you like this blog, you can tweet about it and/or subscribe to my twitter account @philfv to get notified about new posts.

Philippe Fournier-Viger is a professor of Computer Science and also the founder of the open-source data mining software SPMF, offering more than 52 data mining algorithms.

Related posts:

This entry was posted in Academia, Big data, Data Mining, Data science, Programming. Bookmark the permalink.

Leave a Reply

Your email address will not be published. Required fields are marked *