Choosing data structures according to what you want to do

Today, I write a post about programming. I want to share a simple but important idea for writing optimized code. The idea is to choose data structures according to what you want to do instead of what you want to store. This idea is simple. But I write this post because it addresses a common beginner’s misconception which is to think of data structures solely in terms of what they can store.

For example, a beginner programmer may think that he should use an array or a list because s/he want to store some items in a given order. Or simply because s/he wants to store a set of single values.   To store two dimensional data, a simple idea is to use a two dimensional array, etc. That is a simple reasoning that is fine for implementing a program that works.

However, to write an optimized program, it is important to think further about how the data will be used. For example, consider that you need to write a program where you have to store a long list of integer values that is updated periodically (add and remove) and where you want to quickly find the minimum and maximum value.  If a programmer thinks about what he need to store, s/he may decide to use an array. If the programmer thinks in terms of what he want to do with the data, s/he may decide to use a list (an array that is dynamically resized) because add and remove operations will be performed periodically.  This could be a better solution. However, if the programmer thinks further in terms of what he want to do with the data, he may decide to use a red-black tree, which guarantees a O(log(n)) worst-case time cost for the four operations add, remove, minimum and maximumThis could be a much better solution!

Is it therefore important to take the time to find appropriate data structures if one’s wants to write optimized code.  Also note that the execution time is important but the memory usage is also sometimes very important.

To show you an example of what is the impact of choosing appropriate data structures on performance, I here compare three versions of TopKRules, an algorithm for mining top-k association rules in a transaction database. TopKRules needs to store a list of candidates and a list of k best rules and perform add, remove, minimum and maximum operations.  Furthermore, it needs to be able to quickly perform the intersection of two sets of integers.  The next chart shows a performance comparison in terms of execution times of three versions of TopKRules when a parameter k increases and the problem become more difficult, for a dataset called mushrooms.

  • Version A is TopKRules implemented with lists.
  • Version B is TopKRules implemented with bitsets to quickly perform the intersection by doing the logical AND operation.
  • Version C is TopKRules implemented with bitsets  plus using red-black trees for storing candidates and best k rules for quickly performing add, remove minimum and maximum.
Optimization of TopKRules
Optimization of TopKRules

As you can see from this chart, there is quite a large improvement in performance by using appropriate data structures!

That’s all I wanted to write for today. Hope that you enjoyed this post. If you like this blog, you can subscribe to the RSS Feed or my Twitter account (https://twitter.com/philfv) to get notified about future blog posts.  Also, if you want to support this blog, please tweet and share it!

This entry was posted in Data Mining, Programming and tagged , , , , , . Bookmark the permalink.

Leave a Reply

Your email address will not be published. Required fields are marked *