This week-end, I have attended the WICON 2017 conference in Tianjin, China to present a research paper about the application of data mining to analyze data from water meters installed in the City of Moncton, Canada. In this post, I will give a brief overview of the WICON 2017 conference.
About the conference
This was the 10th edition of theWICONconference (International Conference on Wireless Internet), which is organized by the EAI (European Alliance for Innovation). The conference is about networking and is attended by both researchers from computer sciences and engineering. This year, the conference was held at the Holiday Inn hotel in Tianjin, China which is a major Chinese city having historical importance.
Tianjin, China
The conference was held in a room on the fifth floor of the hotel. There seemed to be 30 to 40 attendees at the conference opening:
The conference opening
Conference proceedings
The conference proceedings are published by Springer, which ensures some visibility and that paper will be indexed in various publication databases. However, at the time of attending the conference, the proceedings were not available on USB or as a book. Only a download link was provided the day before the conference to the attendees. It was said that the reason for not providing a USB or book is that the conference proceedings will be published after the conference. Each attendee receive a conference bag and a booklet indicating the conference program.
Acceptance rate
About the acceptance rate, it was announced that 85 papers were received from 10 countries. From those, 15 were rejected before the review process. And then, 43 papers were accepted. Thus, the acceptance rate is 62%.
The conference is an international conference. But the majority of the papers were from China. In the program there is about 2 papers with a majority of authors from Canada, 1 from Morroco and 1 from Senegal.
The program committee of the conference is quite international though, and the conference has been held in various countries including United states, Hungary, and Portugal. But in recent years, this conference seems to have been more focused on China as four of the last five editions were in China.
Best paper award
Four papers were nominated for the best paper award.
It was announced in the conference booklet that the best paper award would be announced at the conference gala/dinner held in the evening of the first day. But it was actually announced on the morning of the second day. I did not note take note of who the winners were.
Conference gala/dinner
According to the booklet, a conference gala/dinner was to be held from 19:00 to 22:00. However, on the ticket, it was written 18:00 to 20:30. This actually created some confusion. I asked some people at 17:50 (at the end of the technical session) and they told me to go to the hotel restaurant at 18:00. Thus, I was there from 18:00 to 19:20 and I just sat at a table eating while a few people from the conferences were eating at other tables. Then, at 19:00, a few more people from the conference started to arrive (those who thought there would be a gala at 19:00). But there was actually no announcements or gala (at least during the time that I was sitting there), so I finally went back to work in my hotel room after a while.
Topics of the conference
The conference is about networks. It had various tracks. The papers were about topics such as: clustering for spectrum resource allocation, clustering applied to monitor wireless sensor networks, evaluation of multi-channel CSMA, MIMO and mmWave, Vehicular networks (VANETS), routing protocols, security and internet of things, swarm of drones for intrusion detection, analysis of mobile traffic, networking algorithms and protocols, and cloud and big data networking using fuzzy logic, haze prediction and clustering.
There was two keynotes by some good researchers in the field of networking. One of them was about his research on applying deep learning for routing traffic, which is an interesting approach. That approach appeared to give good results. However, there remain some challenges as examples were given with a small network (about 16 nodes), and it was said that the computing cost is very high (something that could maybe useful in 20 to 30 years from now). For the first keynote, I try to recall the title, but cannot find the title or abstract in the conference booklet or website.
Conclusion
It was on overall interesting to attend this conference, although I must say that this conference is a little bit outside my field (I usually attend artificial intelligence and data mining conferences). The conference is not very big, so the opportunities for networking are limited compared to larger conferences. A good point is that this conference is published by Springer.
== Philippe Fournier-Viger is a full professor and the founder of the open-source data mining software SPMF, offering more than 110 data mining algorithms. If you like this blog, you can tweet about it and/or subscribe to my twitter account @philfv to get notified about new posts.
This blog post provides an introduction to the Apriori algorithm, a classic data mining algorithm for the problem of frequent itemset mining. Although Apriori was introduced in 1993, more than 20 years ago, Apriori remains one of the most important data mining algorithms, not because it is the fastest, but because it has influenced the development of many other algorithms.
The problem of frequent itemset mining
The Apriorialgorithm is designed to solve the problem of frequent itemset mining. I will first explain this problem with an example. Consider a retail store selling some products. To keep the example simple, we will consider that the retail store is only selling five types of products: I= {pasta, lemon, bread, orange, cake}. We will call these products “items”.
Now, assume that the retail store has a database of customer transactions:
This database contains four transactions. Each transaction is a set of items purchased by a customer (an itemset). For example, the first transaction contains the items pasta, lemon, bread and orange, while the second transaction contains the items pasta and lemon. Moreover, note that each transaction has a name called its transaction identifier. For example, the transaction identifiers of the four transactions depicted above are T1, T2, T3 and T4, respectively.
The problem offrequent itemset mining is defined as follows. To discover frequent itemsets, the user must provide a transaction database (as in this example) and must set a parameter called the minimum support threshold (abbreviated as minsup). This parameter represents the number of transactions that an itemset should at least appear in to be considered a frequent itemset and be shown to the user. I will explain this with a simple example.
Let’s say that the user sets the minsup parameter to two transactions (minsup = 2 ). This means that the user wants to find all sets of items that are purchased together in at least two transactions. Those sets of items are called frequent itemsets. Thus, for the above transaction database, the answer to this problem is the following set of frequent itemsets:
All these itemsets are considered to be frequent itemsets because they appear in at least two transactions from the transaction database.
Now let’s be a little bit more formal. How many times an itemset is bought is called the support of the itemset. For example, the support of {pasta, lemon} is said to be 3 since it is appears in three transactions. Note that the support can also be expressed as a percentage. For example, the support of {pasta, lemon} could be said to be 75% since pasta and lemon appear together in 3 out of 4 transactions (75 % percent of the transactions in the database).
Formally, when the support is expressed as a percentage, it is called a relative support, and when it is expressed as a number of transactions, it is called an absolute support.
Thus, the goal of frequent itemset mining is to find the sets of items that are frequently purchased in a customer transaction database (the frequent itemsets).
Applications of frequent itemset mining
Frequent itemset mining is an interesting problem because it has applications in many domains. Although, the example of a retail store is used in this blog post, itemset mining is not restricted to analyzing customer transaction databases. It can be applied to all kind of data from biological data to text data. The concept of transactions is quite general and can be viewed simply as a set of symbols. For example, if we want to apply frequent itemset mining to text documents, we could consider each word as an item, and each sentence as a transaction. A transaction database would then be a set of sentences from a text, and a frequent itemset would be a set of words appearing in many sentences.
The problem of frequent itemset mining is difficult
Another reason why the problem of frequent itemset mining is interesting is that it is a difficult problem. The naive approach to solve the problem of itemset mining is to count the support of all possible itemsets and then output those that are frequent. This can be done easily for a small database as in the above example. In the above example, we only consider five items (pasta, lemon, bread, orange, cake). For five items, there are 32 possible itemsets. I will show this with a picture:
In the above picture, you can see all the sets of items that can be formed by using the five items from the example. These itemsets are represented as a Hasse diagram. Among all these itemsets, the following itemsets highlighted in yellow are the frequent itemsets:
Now, a good question is: how can we write a computer program to quickly find the frequent itemsets in a database? In the example, there are only 32 possible itemsets. Thus, a simple approach is to write a program that calculate the support of each itemset by scanning the database. Then, the program would output the itemsets having a support no less than the minsup threshold to the user as the frequent itemsets. This would work but it would be highly inefficient for large databases. The reason is the following.
In general, if a transaction database has x items, there will be 2^x possible itemsets (2 to the power of x). For example, in our case, if we have 5 items, there are 2^5 = 32 possible itemsets. This is not a lot because the database is small. But consider a retail store having 1,000 items. Then the number of possible itemsets would be: 2^1000 = 1.26 E30, which is huge, and it would simply not be possible to use a naive approach to find the frequent itemsets.
Thus, the search space for the problem of frequent itemset mining is very large, especially if there are many itemsets and many transactions. If we want to find the frequent itemsets in a real-life database, we thus need to design some fast algorithm that will not have to test all the possible itemsets. The Apriori alorithm was designed to solve this problem.
The Apriori algorithm
The Apriorialgorithm is the first algorithm for frequent itemset mining. Currently, there exists many algorithms that are more efficient than Apriori. However, Apriori remains an important algorithm as it has introduced several key ideas used in many other pattern mining algorithms thereafter. Moreover, Apriori has been extended in many different ways and used for many applications.
Before explaining the Apriorialgorithm, I will introduce two important properties.
Two important properties
The Apriori algorithms is based on two important properties for reducing the search space. The first one is called the Apriori property (also called anti-monotonicity property). The idea is the following. Let there be two itemsets X and Y such that X is a subset of Y. The support of Y must be less than or equal to the support of X. In other words, if we have two sets of items X and Y such that X is included in Y, the number of transactions containing Y must be the same or less than the number of transactions containing X. Let me show you this with an example:
As you can see above, the itemset {pasta} is a subset of the itemset {pasta, lemon}. Thus, by the Apriori property the support of {pasta,lemon} cannot be more than the support of {pasta}. It must be equal or less than the support of {pasta}.
This property is very useful for reducing the search space, that is to avoid considering all possible itemsets when searching for the frequent itemsets. Let me show you this with some illustration. First, look at the following illustration of the search space:
In the above picture, we can see that we can draw a line between the frequent itemsets (in yellow) and the infrequent itemsets (in white). This line is drawn based on the fact that all the supersets of an infrequent itemset must also be infrequent due to the Apriori property. Let me illustrate this more clearly. Consider the itemset {bread} which is infrequent in our example because its support is lower than the minsup threshold. That itemset is shown in red color below.
Then, based on the Apriori property, because bread is infrequent, all its supersets must be infrequent. Thus we know that any itemset containing bread cannot be a frequent itemset. Below, I have colored all these itemsets in red to make this more clear.
Thus, the Apriori property is very powerful. When an algorithm explores the search space, if it finds that some itemset (e.g. bread) is infrequent, we can avoid considering all itemsets that are supersets of that itemset (e.g. all itemsets containing bread).
A second important property used in the Apriorialgorithm is the following. If an itemset contain a subset that is infrequent, it cannot be a frequent itemset. Let me show an example:
The property say that if we have an itemset such as {bread, lemon} that contain a subset that is infrequent such as {bread}, then the itemset cannot be frequent. In our example, since {bread} is infrequent, it means that {bread, lemon} is also infrequent. You may think that this property is very similar to the first property! Actually, this is true. It is just a different way of writing the same property. But it will be useful for explaining how the Apriorialgorithm works.
The Apriori algorithm
I will now explain how the Apriorialgorithm works with an example, as I want to explain it in an intuitive way. But first, let’s remember what is the input and output of the Apriori algorithm. The input is (1) a transaction database and (2) a minsup threshold set by the user. The output is the set of frequent itemsets. For our example, we will consider that minsup = 2 transactions.
The Apriori algorithm is applied as follows. The first step is to scan the database to calculate the support of all items (itemsets containing a single items). The results is shown below
After obtaining the support of single items, the second step is to eliminate the infrequent itemsets. Recall that the minsup parameter is set to 2 in this example. Thus we should eliminate all itemsets having a support that is less than 2. This is illustrated below:
We thus now have four itemsets left, which are frequent itemsets. These itemsets are thus output to the user. All these itemsets each contain a single item.
Next the Apriorialgorithm will find the frequent itemsets containing 2 items. To do that, the Apriorialgorithm combines each frequent itemsets of size 1 (each single item) to obtain a set of candidate itemsets of size 2 (containing 2 items). This is illustrated below:
Thereafter, Apriori will determine if these candidates are frequent itemsets. This is done by first checking the second property, which says that the subsets of a frequent itemset must also be frequent. For the candidates of size 2, this would be done by checking if the subsets containing 1 items are also frequent. For the candidate itemsets of size 2, it is always true, so the Apriori algorithm does nothing.
Then, the next step is to scan the database to calculate the exact support of the candidate itemsets of size 2, to check if they are reallyfrequent. The result is as follows.
Based on these support values, the Apriorialgorithm next eliminates the infrequent candidate itemsets of size 2. The result is shown below:
As a result, there are only five frequent itemsets left. The Apriorialgorithm will output these itemsets to the user.
Next, the Apriorialgorithm will try to generate candidate itemsets of size 3. This is done by combining pairs of frequent itemsets of size 2. This is done as follows:
Thereafter, Apriori will determine if these candidates are frequent itemsets. This is done by first checking the second property, which says that the subsets of a frequent itemset must also be frequent. Based on this property, we can eliminate some candidates. The Apriorialgorithm checks if there exists a subset of size 2 that is not frequent for each candidate itemset. Two candidates are eliminated as shown below.
For example, in the above illustration, the itemset {lemon, orange, cake} has been eliminated because one of its subset of size 2 is infrequent (the itemset {lemon cake}). Thus, after performing this step, only two candidate itemsets of size 3 are left.
Then, the next step is to scan the database to calculate the exact support of the candidate itemsets of size 3, to check if they are really frequent. The result is as follows.
Based on these support values, the Apriorialgorithm next eliminates the infrequent candidate itemsets of size 3 o obtain the frequent itemset of size 3. The result is shown below:
There was no infrequent itemsets among the candidate itemsets of size 3, so no itemset was eliminated. The two candidate itemsets of size 3 are thus frequent and are output to the user.
Next, the Apriorialgorithm will try to generate candidate itemsets of size 4. This is done by combining pairs of frequent itemsets of size 3. This is done as follows:
Only one candidate itemset was generated. hereafter, Apriori will determine if this candidate is frequent. This is done by first checking the second property, which says that the subsets of a frequent itemset must also be frequent. The Apriorialgorithm checks if there exist a subset of size 3 that is not frequent for the candidate itemset.
During the above step, the candidate itemset {pasta, lemon, orange, cake} is eliminated because it contains at least one subset of size 3 that is infrequent. For example, {pasta, lemon cake} is infrequent.
Now, since there is no more candidate left. The Apriori algorithm has to stop and do not need to consider larger itemsets (for example, itemsets containing five items).
The final result found by the algorithm is this set of frequent itemsets.
Thus, the Apriorialgorithm has found 11 frequent itemsets. The Apriorialgorithm is said to be a recursive algorithm as it recursively explores larger itemsets starting from itemsets of size 1.
Now let’s analyze the performance of the Apriorialgorithm for the above example. By using the two pruning properties of the Apriorialgorithm, only 18 candidate itemsets have been generated. However, there was 31 posible itemsets that could be formed with the five items of this example (by excluding the empty set). Thus, thanks to its pruning properties the Apriorialgorithm avoided considering 13 infrequent itemsets. This may not seems a lot, but for real databases, these pruning properties can make Apriori quite efficient.
It can be proven that the Apriorialgorithm is complete (that it will find all frequent itemsets in a given database) and that it is correct (that it will not find any infrequent itemsets). However, I will not show the proof here, as I want to keep this blog post simple.
Technical details
Now, a good question is how to implement the Apriorialgorithm. If you want to implement the Apriorialgorithm, there are more details that need to be considered. The most important one is how to combine itemsets of a given size k to generate candidate of a size k+1.
Consider an example. Let’s say that we combine frequent itemsets containing 2 items to generate candidate itemsets containing 3 items. Consider that we have three itemsets of size 2 : {A,B}, {A,E} and {B,E}.
A problem is that if we combine {A,B} with {A,E}, we obtain {A,B,E}. But if we combine {A,E} with {B,E}, we also obtain {A,B,E}. Thus, as shown in this example, if we combine all itemsets of size 2 with all other itemsets of size 2, we may generate the same itemset several times and this will be very inefficient.
There is a simple trick to avoid this problem. It is to sort the items in each itemset according to some order such as the alphabetical order. Then, two itemsets should only be combined if they have all the same items except the last one.
Thus, {A,B} and {A,E} can be combined since only the last item is different. But {B,E} and {A,E} cannot be combined since some items are different that are not the last item of these itemsets. By doing this simple strategy, we can ensure that Apriori will never generate the same itemset more than once.
I will not show the proof to keep this blog post simple. But it is very important to use this strategy when implementing the Apriorialgorithm.
How is the performance of the Apriori algorithm?
In general the Apriorialgorithm is much faster than a naive approach where we would count the support of all possible itemsets, as Apriori will avoid considering many infrequent itemsets.
The performance of Apriori can be evaluated in real-life in terms of various criteria such as the execution time, the memory consumption, and also its scalability (how the execution time and memory usage vary when increasing the amount of data). Typically, researchers in the field of data mining will perform numerous experiments to evaluate the performance of an algorithm in comparison to other algorithms. For example, here is a simple experiment that I have done to compare the performance of Apriori with other frequent itemset mining algorithms on a dataset called “Chess“.
In that experiment, I have varied the minimum support threshold to see the influence on the execution time of the algorithms. As the threshold is set lower, more patterns need to be considered and the algorithms become slower. It can be seen that Apriori performs quite well but is still much slower than other algorithms such as Eclat and FPGrowth. This is normal since the Apriorialgorithm actually has some limitations that have been addressed in newer algorithms. For example, Apriori is an algorithm that can generate candidate itemsets that do not exist in the database (have a support of 0). More recent algorithms such as FPGrowth are designed to avoid this problem. Besides, note that here, I just show results on a single dataset. To perform a complete performance comparison, we should consider more than a single dataset. But I just show this as an example in this blog post. The experiment shown here was run with the SPMF data mining software which offers open-source implementations of Apriori and many other pattern mining algorithms in Java.
Source code and more information about Apriori
In this blog post, I have aimed at giving a brief introduction to the Apriori algorithm. I did not discuss optimizations, but there are many optimizations that have been proposed to efficiently implement the Apriori algorithm.
If you want to know more about Apriori, you could read the original paper by Agrawal published in 1993:
Rakesh Agrawal and Ramakrishnan Srikant Fast algorithms for mining association rules. Proceedings of the 20th International Conference on Very Large Data Bases, VLDB, pages 487-499, Santiago, Chile, September 1994.
To try Apriori, you can obtain a fast implementation of Apriori as part of the SPMF data mining software, which is implemented in Java under the GPL3 open-source license. The source code of Apriori in SPMF is easy to understand, fast, and lightweight (no dependencies to other libraries).
On the website of SPMF, examples and datasets are provided for running the Apriorialgorithm, as well as more than 100 other algorithms for pattern mining. The source code of algorithms in SPMF has no dependencies to other libraries and can be easily integrated in other software. The SPMF software also provides a simple user-interface for running algorithms:
Besides, if you want to know more about frequent itemset mining, I recommend to read my recent survey paper about itemset mining . It is easy to read and goes beyond what I have discussed in this blog post. The survey paper is more formal, gives pseudocode of Apriori and other algorithms, and also discusses extensions of the problem of frequent itemset mining and research opportunities.
Hope you have enjoyed this blog post! 😉
—
Philippe Fournier-Viger is a professor of computer science and founder of the SPMF data mining library.
On July 20 2017, I received an e-mail from a company called Clarivate Analytics Trademark Enforcement ( legal@ip-clarivateanalytics.com ) about copyright infringement for the Journal Citation Reports, a product by Thomson Reuters.
They contact with me because a few years ago I have created a webpagethat provides a list of useful links for researchers such as how to check if a journal is indexed by SCI, SCIE, IS, etc. It is a very simple webpage that donot host any copyright data, and is very useful for researchers. You can see a screenshot below:
Copryight claim
So what is the problem with this webpage? Well, the company did not like that I linked to two websites called SCIJ***nal.org and bioxb…com that provide the impactfactors of journals, which is a very useful information for researchers. They asked me to remove these two links as these websites apparently contain copyrighted data (the impact factors).
It should be clear that I donot host this data on my website, and I donot own these two websites either. Thus I am not responsible for their content. However, I have still decided to censor the links as you can see above, although I probably donot have to since linking to a website is generally not considered copyright infridgment.
A problem for academia
So what does this tells us about Thomson Reuters? I think that they would expect every researcher to pay to access the impact factors. But not all universities have the funding to pay to access this data, which is very inconvenient for young researchers or people working in small universities or less developed countries. To avoid such problems, I think that as researchers, we should try to move away from impact factors provided by a company. An impact factor is basically just calculated using some simple math formula using data provided by journals. There should be a way to have alternatives to these impactfactors that are not calculated or owned by a company.
The Streisand effect
Besides, can they really expect to censor links to these two websites all over the Internet? In the past, many people have failed to do this, and this has backfired. For example, there is the well known Streisand effect where the fact of trying to hide something makes it even more popular.
This is all I wanted to write for today. Just wanted to share this story.
—
By the way, all the names mentioned in this blog posts belong to the respective companies.
Many researchers wish to produce high quality papers and have a great research impact. But how? In a previous blog post, I have discussed how the “blue ocean strategy” can be applied to publish in top conference/journal. In this blog post, I will discuss another important concept for producing high impact research, which is to consider the opportunity cost of research.
Opportunity cost
The concept of opportunity cost is widely used in the field of economy. I will explain this concept and then explain how it can be applied to research. Consider a situation where you must choose between several mutually exclusive choices C1, C2, … Cn. If you choose a choice Cx, then you cannot choose the other choices because they are mutually exclusive with Cx. Thus, by making a choice Cx, you may be getting some benefits, but you may miss other benefits that could be obtained when making other choices. In other words, when we make a choice in a given situation, we not only get the rewards or benefits that goes with that choice but we also lose other benefits that we could have obtained if we had made other choices. The opportunity cost for making a choice Cx is thus the lost of benefits caused by not making other alternative choices.
Applying the concept of opportunity cost in research
The concept of opportunity cost is simple but very useful in many domains, and considering it allows to take better decisions. When making a choice between multiple alternatives, we should not only think about the direct benefits of that choice but also at the missed opportunities from other alternative choices (the opportunitycost).
How does that applies to research? Well, in research, a researcher has multiple resources (times, money, students). In terms of time, when a researcher decides to spend time on a given project, he may as a result not have time to work on alternative research projects. Thus, given the limited amount of time that a researcher has, he must carefully choose between multiple research opportunities to maximize his benefits.
For example, it is tempting for several young researchers to write several papers with simple an unoriginal ideas, just to publish papers as quickly as possible. This may seems like a good idea in the short term. However, the hidden opportunitycost of doing so is that the time spent on writing these simple papers cannot be spent to work on better research ideas that take more time to develop but may result in a higher impact on the long term.
Thus, from the perspective of opportunity cost, a researcher should try to choose carefully the research projects that are promising and spend more time to develop these projects and make good papers out of it, rather than to try to write as many papers as possible.
From my experience, even the most simple conference papers requires at least 1 or 2 weeks of work that could be spend on better research projects. The fastest paper that I have written for a conference was done in about 1 week, several years ago. But still, even if it only takes a week, one should not forget that additional time needs to be spent to travel, prepare a PowerPoint, and present the paper at a conference. Thus, totally, the cost in terms of time for a quick conference paper may still add up to two weeks or more. And then, as explained, the opportunitycost of writing a simple paper is that one may not have enough time to work on better or more promising research ideas. Thus, in recent years, I have shifted my focus to target better conferences and journals and focus less on smaller conferences. I write less papers but these papers are of higher quality and can have a greater impact.
Of course, many other factors must be considered to publish high quality papers such as the choice of a good research topic. But in this blog post, I wanted to highlight the concept of opportunitycost.
Conclusion
In this blog post, I have highlighted how opportunitycost is applicable to research in academia. I personally think that it is an important concept, as many researchers are tempted to publish as many papers as possible without focusing much on quality, and often without realizing that that time could be spent on better research ideas. But producing high impact research requires to spend a considerable amount of time. Thus, one should carefully chose to spend more time on promising projects, rather than focusing too much on quantity. Besides time, the concept of opportunitycost is also applicable to other kinds of resources such as funding and students.
If you like this blog, you can subscribe to the RSS Feed or my Twitter account (https://twitter.com/philfv) to get notified about future blog posts
This week, I have attended the PAKDD 2017 conference in Jeju Island, South Korea, this week, from the 23 to 26th May. PAKDDis the top data mining conference for the asia-pacific region. It is held every year in a different pacific-asian country. In this blog post, I will write a briefreport about the conference.
Conference location
The PAKDD conference was held in the city of Seogwipo on Jeju island, a beautiful island in South Korea, which is famous for tourism, especially in Asia. Here is a map of the location.
In particular, the PAKDDconference was held at the Seogwipo KAL hotel.
The hotel was well-chosen. It is about 1 km from the city, beside the sea.
Conference proceedings
The proceedings of the PAKDD 2017 conference are published by Springer in the Lecture Notes in Artificial Intelligence series. This ensures a good visibility to the papers published in the proceedings, which are indexed in the main computer science indexes such as DBLP.
The proceedings were given on a USB drive (4 Gb) rather than as a book, as many other conferences have been doing in recent years. Personally, I like to have proceedings as books, but USB drives are probably more friendly for the environment.
In general the quality of the papers at PAKDD conferences is good. This year, 458 papers were submitted. Among those, 45 papers were accepted as long papers and 84 as short papers. Thus, the global acceptance rate was about 28%.
Below, I present various slides from the opening ceremony presentation, which provides information about the PAKDDconference this year.
The number of papers per category (submitted / accepted) is shown below. It is interesting to see that a large amount of applications and social network papers have been rejected. And for the topic of sequential data, only 1 paper out of 10 was accepted.
2) The number of accepted long and short papers at PAKDD forthe last six years is presented below.
3) The decision criteria for accepting a paper at PAKDD are shown below.
4) There was 283 persons who have registered for PAKDD this year.
5) The acceptance rate of long and short papers at PAKDD during the last six years
6) The number of submitted vs accepted papers by country this year. We can observe that China has the largest number of papers accepted and submitted.
Day 1 – workshops and tutorials, reception
On the first day, the registration started at 8:00 AM.
It was then followed by various workshops and tutorials. I have attended a workshop about Biologically Inspired Data Mining, a popular topic, which covers the applications of algorithms such as neural networks, bee swarm optimization, genetic algorithms, and ant colony optimization, to solve data mining problems. Evolutionary algorithms are quite interesting as they can find approximate solutions to data mining problems that are quite good solutions, while running much faster than traditional algorithms that find an optimal solution. There was also some tutorials that I did not attend on information retrieval, recommender systems and tensor analysis. Besides, there was workshops on security, business process management, and sensor data analytics.
In the evening, there was a reception, which was a good opportunity for discussing with other researchers.
Day 2 – main conference, opening ceremony
There was an opening ceremony, followed by a keynote by Sang Kyun Cha from the Seoul National University of Korea. The keynote was about a potential fourth industrial revolution that would occurs due to the growth of AI-based services and big data technologies. This would lead to a need for more skilled workers such as engineers or “data scientists”. The talk was interesting but personally I prefer talks that are a little bit more technical. After that, there was multiple sessions of paper presentations.
Besides technical sessions, I also discussed with some representatives from Nvidia who were promoting a new supercomputer specially designed for training deep learning neural networks. It is called NVIDIA DGX-1 and costs around 200,000 $ USD. According to the promotional material, this computer has a eight Tesla P100 GPUs, each with 16 GB of memory, a total of 28672 NVIDIAN CUDA cores, and two dual 20-core Intel Xeon E5-2698 v4 2.2 Ghz processors. But what is the most interesting is that this GPU based system is claimed to be 250 times faster than a conventional CPU-only server for deep learning. I saw that there is also a similar product by IBM called the IBM Minsky, also equipped with NVidia GPUs. This is especially interesting for those working on deep learning related topics.
NVidia DGX1
Day 3 – main conference, excursion, banquet
On the third day of the conference, there was a keynote speech by Rakesh Agrawal, a senior researcher, who is one of the founder of data mining. The talk was about the usage of social data.. The main question addressed in this talk was whether social data from websites such as Twitter is garbage or it can be useful for businesses. R. Agrawal presented a project that he carried a few years ago at Microsoft where he analyzed Twitter data to study the opinion of people about Microsoft products. He also described a work where he compared the results of the Bing and Google search engines, and the result obtained when searching using social data rather than traditional search engines. The conclusion was that social data is certainly useful. R. Agrawal also gave some advices that young researchers should try to choose good research topics that are useful and can have an impact rather than just focusing on publishing a paper as quickly as possible.
On the afternoon, there was an excursion to Seopjikoji beach, Seongsan Sunrise Peak and Seongeup Folk Village.
In the evening, there was a banquet at the Seogwipo KAL hotel, with a musical performance.
Day 4 – main conference, closing ceremony
On the fourth day, there was a keynote by Dacheng Tao, from the University of Sydney Australia about current challenges in artificial intelligence. It was followed by several technical sessions, a lunch and a closing ceremony.
Conclusion
The conference was quite interesting. I had the occasion to meet many interesting people from academia and also from industry (e.g. Microsoft, Yahoo, Adobe, Nvidia). PAKDD is not the largest conference in data mining but it is a quite good conference, especially for the asia-pacific region, and the quality of the papers is quite high. It was announced that next year, thePAKDD 2018 conference will be held in Melbourne, Australia. I will certainly try to attend it.
— Philippe Fournier-Viger is a professor of Computer Science and also the founder of the open-source data mining software SPMF, offering more than 120 data mining algorithms.
A question that many young researchers ask is how to get your papers published in top conferences and journal. There are many answers to this question. In this blog post, I will discuss a strategy for carrying research called the “Blue Ocean Strategy”. This strategy was initially proposed in the field of marketing. But in this blog post, I will explain how it is also highly relevant in Academia.
The Blue Ocean Strategy was proposed in a 2007 book by Kim, C. W. and Mauborgne, R. The idea is quite simple. Let’s say that you want to start a new business and get rich. To start your business, you need to choose a market where your business will operate. Let’s say that you decide to start selling pens. However, there are already a lot of pen manufacturers that are well-established and thus this market is extremely competitive and profit margins are very low. Thus, it might be very difficult to become successful in this market if you just try to produce pens like every other manufacturers. It is like jumping in a shark tank!
The Blue Ocean Strategy indicates that rather than fighting for some existing market, it is better to create some new markets (what is called a “blue ocean“). By creating a new market, the competition becomes irrelevant and you may easily get many new customers rather than fighting for a small part of an existing market. Thus, instead of trying to compete with some very well established manufacturer in a very competitive market (a “red ocean“), it is more worthy to start a new market (a “blue ocean”). This could be for example, a new type of pens that has some additional features.
Now let me explain how this strategy is relevant for academia.
In general, there are two main types of research projects:
a researcher try to provide a solution to an existing research problem,
the researcher works on a new research problem.
The first case can be seen as a red ocean, since many researchers may be already working on that existing problem and it may be hard to publish something better. The second case is a blue ocean, since the researcher is the first one to work on a new topic. In that case, it can be easy to publish something since you do not need to do something better than other people, since you are the first one on that topic.
For example, I work in the field of data mining. In this field, many researchers work on publishing faster or more memory efficient algorithms for existing data mining problems. Although this research is needed, it can be viewed in some cases as lacking originality, and it can be very competitive to publish a faster algorithm. On the other hand, if researchers instead work on proposing some new problems, then the research appears more original, and it becomes much more easy to publish an algorithm as it does not need to be more efficient than some previous algorithms. Besides, from my observation, top conferences/journal often prefer papers on new problems to incremental work on existing problems.
Thus, it is not only easier to provide a solution to new research problem, but top conferences in some fields at least put a lot of value on papers that address new research problems. Thus, why fighting to be the best on an existing research problem?
Of course, there are some exceptions to this idea. If a researcher succeeds to publish some exceptional paper in a red ocean (on an existing research problem), his impact may actually be greater. This is especially if the research problem is very popular. But the point is that publishing in a red ocean may be harder than in a blue ocean. And of course, not all blue oceans are equal. It is thus also important to find some good idea for new research topics (good blue oceans).
Personally, for these reasons, I generally try to work on “blue ocean” research projects.
Conclusion
In this blog post, I have discussed how the “Blue Ocean Strategy” and how it can be applied in academia to help in publishing in topconferences/journals. Of course, there are also a lot of other things to consider to write a good paper. You can read the follow up blog post on this topic here, where the opportunity cost of research is discussed.
If you like this blog and want to support it, please share it on social networks (Twitter, LinkedIn, etc.), write some comments, and continue reading other articles on this blog. 🙂
In the data science and data mining communities, several practitioners are applying various algorithms on data, without attempting to visualize the data. This is a big mistake because sometimes, visualizing the data greatly helps to understand the data. Some phenomena are obvious when visualizing the data. In this blog post, I will give a few examples to convince you that visualization can greatly help to understand data.
An example of why using statistical measures may not be enough
The first example that I will give is a the Francis Anscombe Quartet. It is a set of four datasets consisting of X, Y points. These four datasets are defined as follows:
Dataset I
Dataset II
Datset III
Dataset IV
x
y
x
y
x
y
x
y
10.0
8.04
10.0
9.14
10.0
7.46
8.0
6.58
8.0
6.95
8.0
8.14
8.0
6.77
8.0
5.76
13.0
7.58
13.0
8.74
13.0
12.74
8.0
7.71
9.0
8.81
9.0
8.77
9.0
7.11
8.0
8.84
11.0
8.33
11.0
9.26
11.0
7.81
8.0
8.47
14.0
9.96
14.0
8.10
14.0
8.84
8.0
7.04
6.0
7.24
6.0
6.13
6.0
6.08
8.0
5.25
4.0
4.26
4.0
3.10
4.0
5.39
19.0
12.50
12.0
10.84
12.0
9.13
12.0
8.15
8.0
5.56
7.0
4.82
7.0
7.26
7.0
6.42
8.0
7.91
5.0
5.68
5.0
4.74
5.0
5.73
8.0
6.89
To get a feel of the data, the first thing that many would do is to calculate some statistical measures such as the mean, average, variance, and standard deviation. This allows to measure the central tendency of data and its dispersion. If we do this for the four above datasets, we obtain:
Dataset 1: mean of X = 9, variance of X= 11, mean of Y = 7.5, variance of Y = 4.125 Dataset 2: mean of X = 9, variance of X= 11, mean of Y = 7.5, variance of Y = 4.125 Dataset 3: mean of X = 9, variance of X= 11, mean of Y = 7.5, variance of Y = 4.125 Dataset 4: mean of X = 9, variance of X= 11, mean of Y = 7.5, variance of Y = 4.125
So these datasets appears quite similar. They have exactly the same values for all the above statistical measures. How about calculating the correlation between X and Y for each dataset to see how the points are correlated?
Ok, so these datasets are very similar, isn’t it? Let’s try something else. Let’s calculate the regression line of each dataset (this means to calculate the linear equation that would best fit the data points).
Dataset 1: y = 3.00 + 0.500x Dataset 2: y = 3.00 + 0.500x Dataset 3: y = 3.00 + 0.500x Dataset 4: y = 3.00 + 0.500x
Again the same! Should we stop here and conclude that these datasets are the same?
This would be a big mistake because actually, these four datasets are quite different! If we visualize these four datasets with a scatter plot, we obtain the following:
Visualization of the four datasets (credit: Wikipedia CC BY-SA 3.0)
This shows that these datasets are actually quite different. The lesson from this example is that by visualizing the data, difference sometimes becomes quite obvious.
Visualizing the relationship between two attributes
Simple visualizations techniques like scatter plots are also very useful for quickly analyzing the relationship between pairs of attributes in a dataset. For example, by looking at the two following scatter plots, we can quickly see that the first one show a positive correlation between the X and Y axis (when values on the X axis are greater, values on the Y axis are generally also greater), while the second one shows a negative correlation (when values on the X axis are greater, values on the Y axis are generally also smaller).
(a) positive correlation (b) negative correlation (Credit: Data Mining Concepts and Techniques, Han & Kamber)
If we plot two attributes on the X and Y axis of a scatter plot and there is not correlation between the attributes, it may result in something similar to the following figures:
No correlation between the X and Y axis (Credit: Data Mining Concepts and Techniques, Han & Kamber)
These examples again show that visualizing data can help to quickly understand the data.
Visualizing outliers
Visualization techniques can also be used to quickly identify outliers in the data. For example in the following chart, the data point on top can be quickly identified as an outlier (an abnormal value).
Identifying outliers using a scatter plot
Visualizing clusters
In data mining, several clustering algorithms have been proposed to identify clusters of similar values in the data. These clusters can also often be discovered visually for low-dimensional data. For example, in the following data, it is quite apparent that there are two main clusters (groups of similar values), without applying any algorithms.
Data containing two obvious clusters
Conclusion
In this blog post, I have shown a few simple examples of how visualization can help to quickly see patterns in the data without actually applying any fancy models or performing calculations. I have also shown that statistical measures can actually be quite misleading if no visualization is done, with the classic example of the Francis Anscombe Quartet.
In this blog post, the examples are mostly done using scatter plots with 2 attributes at a time, to keep things simple. But there exists many other types of visualizations.
— Philippe Fournier-Viger is a professor of Computer Science and also the founder of the open-source data mining software SPMF, offering more than 120 data mining algorithms.
In this blog post, I will give an introduction to sequential pattern mining, an important data mining task with a wide range of applications from text analysis to market basket analysis. This blog post is aimed to be a short introduction. If you want to read a more detailed introduction to sequential pattern mining, you can read a survey paper that I recently wrote on this topic.
What is sequential pattern mining?
Data mining consists of extracting information from data stored in databases to understand the data and/or take decisions. Some of the most fundamental data mining tasks are clustering, classification, outlier analysis, and pattern mining. Pattern mining consists of discovering interesting, useful, and unexpected patterns in databases Various types of patterns can be discovered in databases such as frequent itemsets, associations, subgraphs, sequential rules, and periodic patterns.
The task of sequential pattern mining is a data mining task specialized for analyzing sequential data, to discover sequential patterns. More precisely, it consists of discovering interesting subsequences in a set of sequences, where the interestingness of a subsequence can be measured in terms of various criteria such as its occurrence frequency, length, and profit. Sequentialpatternmining has numerous real-life applications due to the fact that data is naturally encoded as sequences of symbols in many fields such as bioinformatics, e-learning, market basket analysis, texts, and webpage click-stream analysis.
I will now explain the task of sequential pattern mining with an example. Consider the following sequence database, representing the purchases made by customers in a retail store.
This database contains four sequences. Each sequence represents the items purchased by a customer at different times. A sequence is an ordered list of itemsets (sets of items bought together). For example, in this database, the first sequence (SID 1) indicates that a customer bought some items a and b together, then purchased an item c, then purchased items f and g together, then purchased an item g, and then finally purchased an item e.
Traditionally, sequential pattern mining is being used to find subsequences that appear often in a sequence database, i.e. that are common to several sequences. Those subsequences are called the frequentsequential patterns. For example, in the context of our example, sequentialpatternmining can be used to find the sequences of items frequently bought by customers. This can be useful to understand the behavior of customers to take marketing decisions.
To do sequential pattern mining, a user must provide a sequence database and specify a parameter called the minimum support threshold. This parameter indicates a minimum number of sequences in which a pattern must appear to be considered frequent, and be shown to the user. For example, if a user sets the minimum support threshold to 2 sequences, the task of sequential pattern mining consists of finding all subsequences appearing in at least 2 sequences of the input database. In the example database, many subsequences met this requirement. Some of these sequential patterns are shown in the table below, where the number of sequences containing each pattern (called the support) is indicated in the right column of the table.
Note that the pattern <{a}, {f, g}> could also be put in this table, as well as the pattern <{f, g},{e} >, <{a},{f, g},{e} > and <{b},{f, g},{e} > with support = 2 …(2020/03)
For example, the patterns <{a}> and <{a}, {g}> are frequent and have a support of 3 and 2 sequences, respectively. In other words, these patterns appears in 3 and 2 sequences of the input database, respectively. The pattern <{a}> appears in the sequences 1, 2 and 3, while the pattern <{a}, {g}> appears in sequences 1 and 3. These patterns are interesting as they represent some behavior common to several customers. Of course, this is a toy example. Sequentialpatternmining can actually be applied on database containing hundreds of thousands of sequences.
Another example of application of sequentialpatternmining is text analysis. In this context, a set of sentences from a text can be viewed as sequence database, and the goal of sequentialpatternmining is then to find subsequences of words frequently used in the text. If such sequences are contiguous, they are called “ngrams” in this context. If you want to know more about this application, you can read this blog post, where sequential patterns are discovered in a Sherlock Holmes novel.
Can sequential pattern mining be applied to time series?
Besides sequences, sequential pattern mining can also be applied to time series (e.g. stock data), when discretization is performed as a pre-processing step. For example, the figure below shows a time series (an ordered list of numbers) on the left. On the right, a sequence (a sequence of symbols) is shown representing the same data, after applying a transformation. Various transformations can be done to transform a time series to a sequence such as the popular SAX transformation. After performing the transformation, any sequentialpatternmining algorithm can be applied.
Where can I get Sequential pattern mining implementations?
To try sequentialpatternmining with your datasets, you may try the open-source SPMF data mining software, which provides implementations of numerous sequential pattern mining algorithms: https://www.philippe-fournier-viger.com/spmf/
It provides implementations of several algorithms for sequentialpatternmining, as well as several variations of the problem such as discovering maximal sequential patterns, closed sequential patterns and sequential rules.Sequential rules are especially useful for the purpose of performing predictions, as they also include the concept of confidence.
What are the current best algorithms for sequential pattern mining?
There exists several sequentialpatternmining algorithms. Some of the classic algorithms for this problem are PrefixSpan, Spade, SPAM, and GSP. However, in the recent decade, several novel and more efficient algorithms have been proposed such as CM-SPADE and CM-SPAM (2014), FCloSM and FGenSM (2017), to name a few. Besides, numerous algorithms have been proposed for extensions of the problem of sequentialpatternmining such as finding the sequential patterns that generate the most profit (high utility sequentialpatternmining).
In this blog post, I have given a brief overview of sequential pattern mining, a very useful set of techniques for analyzing sequential data. If you want to know more about this topic, you may read the following recent survey paper that I wrote, which gives an easy-to-read overview of this topic, including the algorithms forf sequentialpatternmining, extensions, research challenges and opportunities.
Fournier-Viger, P., Lin, J. C.-W., Kiran, R. U., Koh, Y. S., Thomas, R. (2017). A Survey of Sequential Pattern Mining. Data Science and Pattern Recognition, vol. 1(1), pp. 54-77.
— Philippe Fournier-Viger is a professor of Computer Science and also the founder of the open-source data mining software SPMF, offering more than 120 data mining algorithms.
In this blog post, I will discuss an important question for young researchers, which is: Is it better to try to writemorepapers or to try to write fewer but betterpapers? In other words, what is more important: quantity or quality in research?
To answer this question, I will first explain why quantity and quality are important, and then I will argue that a good trade-off needs to be found.
Quantity
There are several reasons why quantity is important:
Quantity shows that someone is productiveand can have a consistent research output. For example, if someone has published 4 papers each year during the last four years, it approximately shows what can be expected from that researcher in terms of productivity for each year. However, if a researcher has an irregular research output such as zero papers during a few years, it may raise questions about the reasons why that researcher did not writepapers. Thus writing more show to other people that you are more active.
Quantity is correlated with research impact. Even though, writing morepapers does not means that the papers are better, some studies have shown a strong correlation between the number of papers and the influence of researchers in their field. Some of reasons may be that (1) writing morepapers improve your visibility in your field and your chances of being cited, (2) if you are more successful, you may obtain more resources such as grants and funding, which help you to writemorepapers, and (3) writing more may improve your writing skills and help you to writemore and betterpapers.
Quantity is used to calculate various metrics to evaluate the performance of researchers. In various countries and institutions, metrics are used to evaluate the research performance of researchers. These metrics include for example: the number of papers and the number of citations. Although metrics are imperfect, they are often used for evaluating researchers because they allow to quickly evaluate a researcher without reading each of his publications. Metrics such as the number of citations are also used on some website such as Google Scholar to rank articles.
Quality
The quality of papers is important for several reasons:
Quality shows that you can do excellent research. It is often very hard to publish in top level journals or conferences. For example, some conferences have an acceptance rate of 5 % or even less, which means that out of 1000 submitted papers, only 50 are accepted. If you can get some papers in top journals and conferences, it shows that you are among the best researchers in your field. On the contrary, if someone only publish papers in weak and unknown journals and conferences, it will raise doubts about the quality of the research, and about his ability at doing research. Publishing in some unknown conference/journals can be seen as something negative that may even decrease the value of a CV.
Quality is also correlated with research impact. A paper that is published in a top conference or journal has more visibility and therefore has more chance of being cited by other researchers. On the contrary, papers published in small or unknown conferences have more chance of not being cited by other researchers.
A trade-off
So what is the best approach? In my opinion, both quantity and quality are important. It is especially important to write several papers for young researchers to kickstart their career and fill their CV to apply for grants and obtain their diplomas. But having some qualitypapers is also necessary . Having a few good papers in top journals and conferences can be worth much more than having many papers in weak conferences. For example, in my field, having a paper in a conference like KDD or ICDM could be worth more than 5 or 10 papers in smaller conferences. But the danger of putting too much emphasis on quality is that the research output may become very low if the papers are not accepted. Thus, I believe that the best approach is to use a trade-off: (1) once in a while write some very high qualitypapers and try to get them published in top journals and conferences, (2) but sometimes writepapers for easier journals and conference to increase the overall productivity, and get some papers published.
Actually, a researcher should be able to evaluate whether a given research project is suitable for a high level conference/journal or not based on the merit of the research, and whether the research needs to be published quickly (for very competitive topics). Thus, a researcher should decide for each paper whether it should be submitted to a high level conference/journal or something easier.
But, there should always be a minimum quality requirement for papers. Publishing bad papers or publishing very weak papers can have a negative influence on your CV and even look bad. Thus, even when considering quantity, one should ensure that a minimum quality requirement is met. For example, since my early days as researchers, I have set a minimum quality requirements that all my papers be at least published by a well-known publisher among ACM, IEEE, Springer, Elsevier, and be indexed in DBLP (an index for computer science). For me, this is the minimum quality requirement but I will often aim at good or excellent confernce/journal depending on the projects.
Hope that you have enjoyed this post. If you like it, you can continue reading this blog, and subscribe to my Twitter ( @philfv ).
— Philippe Fournier-Viger is a professor of Computer Science and also the founder of the open-source data mining software SPMF, offering more than 120 data mining algorithms.
Many researchers are usingMicrosoft Word for writing research papers. However, Microsoft Word has several problems or limitations. In this blog post, I will discuss the use of LaTeX as an alternative to Microsoft Word for writing research papers.
What is LaTeX?
LaTeX is a document preparation system, proposed in the 1980s. It is used to create documents such as researchpapers, books, or even slides for presentations.
The key difference between LaTeX and software like Microsoft Word is that Microsoft Word let you directly edit your document and immediately see the result, while usingLaTeX is a bit like programming. To write a research paper usingLaTeX, you have to write a text file with the .tex extension using a formatting language to roughly indicate how your paper should look like. Then, you can run the LaTeX engine to generate a PDF file of your research paper. The following figure illustrate this process:
In the above example, I have created a very simple LaTeX document (Example.tex) and then I have generated the corresponding PDF for visualization (Example.pdf).
Why using LaTeX?
There are several reasons why many researchers prefer LaTeX to Microsoft Word for writingresearchpapers. I will explain some of them, and then I will discuss also some problems about usingLaTeX.
Reason 1: LaTeX papers generally look better
LaTeXpapers often look better than papers written using Microsoft Word. This is especially true for fields like computer science, mathematics and engineering where mathematical equations are used. To illustrate this point, I will show you some screenshots of a paper that I have written for the ADMA 2012 conference a few years ago. For this paper, I had made two versions: one using the Springer LNCS LaTeX template and the other one using the Springer LNCS Microsoft Word template.
This is the first page of the paper.
The first page is quite similar. The main difference is the font being used, which is different usingLaTeX. Personally, I prefer the default LaTeX font. Now let’s compare how the mathematical equations appears in Latex and Word.
Here, we can see that mathematical symbols are more beautiful usingLaTeX. For example, the set union and the subset inclusion operators are in my opinion quite ugly in Microsoft Word. The set union operator of Word looks too much like the letter “U”. In this example, the mathematical equations are quite simple. But LaTeX really shines when displaying more complex mathematical equations, for example using matrices.
Now let’s look at another paragraph of text from the paper to further compare the appearance of Word and LaTeXpapers:
In the above picture, it can be argued that both LaTeX and Word papers look quite similar. For me, the big difference is again in the font being used. In the Springer Word template, the Times New Roman font, while LaTeX has its own default font. I prefer the LaTeX font. Also, I think that the URLs look better in LaTeXusing the url package.
Reason 2: LaTeX is available for all platforms
The LaTeX system is free and available for most operating systems, and documents will look the same on all operating systems.
To install LaTeX on your computer you need to install a LaTeX distribution such as MikTeK ( https://miktex.org/ ). After installing LaTeX, you can start working on LaTeX documents using a text editor such as Notepad. However, it is more convenient to also install an editor such as TexWorks or WinShell. Personally, I use TexWorks. This is a screenshot of my working environment using TexWorks:
I will open my LaTeX document on the left window. Then, the right window will display the PDF generated by LaTeX. Thus, I can work on the LaTeX code of my documents on the left and see the result on the right.
If you want to try LaTeX without installing it on your computer, you can use an online LaTeX editor such as ShareLatex (http://www.sharelatex.org ) or OverLeaf. Using these editors, it is not necessary to install LaTeX on your computer. I personally sometimes use ShareLatex as it also has some function for collaboration (history, chat, etc.), which is very useful when working on a research paper with other people.
Reason 3: LaTeX offers many packages
Besides the basic functionalities of LaTeX, you can install hundreds of packages to add more features to LaTeX. If you use MikTek for example, there is a tool called the “MikTek package manager” that let you choose and install packages. There are packages for about everything from packages to display algorithms to packages for displaying chessboards. For example, here is some algorithm pseudocode that I have written in one of my recent paper using a LaTeX package called algorithm2e:
As you can see the presentation of the algorithm is quite nice. Doing the same using Word would be very difficult. For example, it would be quite difficult to add a vertical line for the “for” loop using Microsoft Word.
Reason 4: You don’t need to worry about how your document will look like
When writing a LaTeX document, you don’t need to worry about how your final document will look like. For example, you don’t need to worry about where the figures and tables will appear in your document or where the page breaks will be. All of this is handled by the LaTeX engine during the compilation of your document. When writing document, you only need to use some basic formatting instructions such as indicating when a new section starts in your document. This let you focus on writing.
Reason 5: LaTeX can generate and update your bibliography automatically
Another reason for usingLaTeX is that it can generate the bibliography of a document automatically. There are different ways of writing a bibliography usingLaTeX. One of the most common way is to use a .bib file. A .bib file provide a list of references that can be used in your document. Then, you can use these references in your .tex document using the \cite{} command and the bibliography will be automatically generated.
I will illustrate this with an example:
A), I have created a Latex document (a .tex file) where I cite a paper called “efim” using the LaTeX command \cite{efim}.
B) I have created a corresponding LaTeXbib file that provides bibliographical information about the “efim” paper.
C) I have generated the PDF file using the .tex file and the .bib file. As you can see, the \cite{} command has been replaced by 25, and the corresponding entry 25 has been automatically generated in the correct format for this paper and added to the bibliography.
The function for generating a bibliography usingLaTeX can save a lot of time to researchers especially for documents containing many references such as thesis, books, and journal papers.
Moreover, once you have created a .bib file, you can reuse it in many different papers. And it is also very easy to change the style of your bibliography. For example, if you want to change from the APA style to the IEEE style, it can be done almost automatically, which saves lot of time.
In Microsoft Word, there is some basic tool for generating a bibliography but it provides much less features than LaTeX.
Reason 6: LaTeX works very well for large documents
LaTeX also provides many features that are useful for large documents such as Ph.D thesis and books. These features include generating tables of contents, tables of figures, and dividing a document into several files. Some of these features are also provided in Microsoft Word but are not as flexible as in LaTeX. I have personally written both my M.Sc. and Ph.D. thesis usingLaTeX and I have saved a lot of time by doing this. I have simply downloaded the LaTeX style file from my university and then used it in my LaTeX document, and after that all my thesis was properly formatted according to the university style, without too much effort.
Problems of LaTeX
Now, let’s talk about the disadvantage or problems faced usingLaTeX. The first problem is that there is a somewhat steep learning curve. LaTeX is actually not so difficult to learn but it is more difficult than using Word. It is necessary to learn various commands for preparing LaTeX documents. Moreover, some errors are not so easy to debug. However, the good news is that there exist some good places to ask questions and obtain answers when encountering problems with LaTeX such as Tex.StackExchange ( http://tex.stackexchange.com/ ). There also exist some free books such as the Not So Short Introduction To LaTeXthat are quite good for learning LaTeX, and that I use as reference. Actually, although, there is a steep learning curve, I think that it is an excellent investment to learn to use LaTeX for researchers. Moreover, some journals in academia actually only accept LaTeXpapers.
The second problem with LaTeX is that it is actually not necessary to use LaTeX for writing simple documents. LaTeX is best used for large documents or documents with complex layouts or for special needs such as displaying mathematical equations and algorithms. I personally use LaTeX only for writingresearchpapers. For other things, I use Microsoft Word. Some people also use LaTeX for preparing slides using packages such as beamer, instead of using Powerpoint. This can be useful for preparing a presentation with lot of mathematical equations.
Conclusion
In this blog post, I have discussed the use of LaTeX for writingresearchpapers. I hope that you have enjoyed this blog post.
— Philippe Fournier-Viger is a professor of Computer Science and also the founder of the open-source data mining software SPMF, offering more than 120 data mining algorithms.