In this blog post, I will discuss one of my recent and current project. I have been recently working with my colleague Chun-Wei Lin on launching a new journal, titled “Data Science and Pattern Recognition“.
This is a new open-access journal, with a focus on data science, big data, and pattern recognition. The journal will be published by the Taiwanese publisher Ubiquitous International, which already publishes several SCI journals. In the next year, we will be pushing hard to make this journal a high level journal. For this purpose, we have worked hard on creating an outstanding editorial board with world-class researchers in the fields of data science and pattern recognition.
It is our goal to make the journal EI and SCI indexed in the next years. Currently, there is no publication fee for publishing in this journal, to help support the journal initially. But there should be some in the future, since it is an open-access journal. Thus, it is a good time to submit your papers, now!
The first issue, will be published around March 2017 with 6 papers, and there will be 4 issues per year, for a total of approximately 24 papers per year. In the first issue, I will publish a survey paper on pattern mining that I have been working on recently, among several other good papers.
If you have any question about the journal, you can have a look at the website, let me know directly, or use the contact e-mail on the journal website.
— Philippe Fournier-Vigeris a professor of Computer Science and also the founder of theopen-source data mining software SPMF,offering more than 120 data mining algorithms.
In this blog post, I will discuss thejob of university professor. And, I will discuss why I have chosen to become one. This post is especially aimed at students who are considering working in academia after their Ph.D.
What is the job of university professor?
When I was an undergraduate student, I did not know much about what the professors were really doing at my university beside teaching. In general, there are three main tasks that a professor must do:
Teaching. This consists of teaching courses for students. This is a very important task since it is the reason for having professors in universities. The numbers of courses that a professor teaches every year can vary greatly depending on the university and the rank of the professor. Some university are known to put more emphasis on research, while other put more emphasis on teaching. When I was working as a professor in Canada, I first started with about 220 hours (5 courses) / year, which did not give me much free time for doing research. Then after receiving some funding and having several graduate students in my team, my teaching load was reduced to about 3 courses / year to let me do more research. Now in China, in 2016, I have taught about 100 hours.
Research. The second task that a professor must do is to carry innovative research that lead to the publication of research papers. Why a professor is doing research? Some key reasons are to attract top talents in university, and to ensure that the professors stay active and keep their knowledge up to date. Now, what is the difference between the research that a professor does and the research that a graduate student do? The main difference is that a professor is expected to carry a long-term research program and have several students carrying research in his team. It is a bit like going from the job of an employee to being a team leader or business manager, in the sense that the professor has to manage not only his own research projects but generally also the research of a team, and have a clear plan for the next years to come. Many of the tasks related to research that a professor does is related to the management of his team. For example, a professor typically has to apply for funding from the government, or find projects with companies to supply funds to his research team. This may involve writing numerous grant applications and attend many meetings.
Service to the community. The third task that a professor must do is to provide service to the community. This can be various activities at the university level, national level, or international level. For example, when I was a professor in Canada, I was involved in a programming competition for high schools, and recruiting students for our university. I also helped to organize a LAN party for students every year, and other activities with students. Another task that I was doing was to evaluate the applications of the master degree students applying to the university in our program. This latter task was consuming many hours of my time every week. At a national and international level, my service to the community has included tasks such as reviewing articles for conferences and journals, being the editor ofa journal, and the founder of anopen-source data mining software.
Thus, to be a professor, one should enjoy doing these three tasks. If one only enjoys teaching but does not enjoy research, then perhaps that it is best to become a lecturer (a professor not carrying research – in the North-American system of education). Or if one only enjoy research but does not enjoy teaching, then it is probably best to become a researcher rather than a professor, and work in the industry. But there are some exceptions. Nonetheless, a professor should ideally enjoy doing all these tasks, and should do them well.
What are the advantages/disadvantages of being a professor?
As for any other jobs, there are advantages and disadvantages for choosing this job. Let’s analyze this:
Salary: Depending on the country and university, the salary of a university professor can be quite high. But it is generally lower than working as a researcher in the industry. Thus, the motivation should not be only about money.
Schedule: One of the greatest advantage and disadvantage of being a professor is the work schedule. A professor may be extremely busy. He may have to work during week-ends, evenings, and often more than 10 hours a day to keep up with research, teaching and other tasks. The first few years of being a professor can be really hard in terms of schedule. The reason is that a new professor typically has to teach, prepare many new courses and apply for funding, and setup his research team. This is quite different from the life of a Ph.D student who often can concentrate on only doing research. After a few years, the job of a professor becomes easier. However,although the schedule of a professor is very busy, whatI like aboutitis the freedom about how to organize my time. A professor may typically decide to work at home if he is not teaching and may decide to wake up late but finish working late a night. This is different from working in a company.
Traveling: Another thing that I like about being a professor is the opportunity for traveling. A professor typically has to attend international conferences to present his research and meet other researchers, and may also visit other universities for collaborations. This is interesting from the research perspective. But it is also interesting from a personal perspective. Of course, it depends on the funding. Not every professor has funding for attending international conferences and travelling.
Beingyour own boss: Another advantage of being a professor is that a professor generally has a lot of freedom for the topic of his research. A professor may decide to work on topics that he like rather than work on the projects of other people (as someone would typically do, when working in a company). I often think of being a professor as someone running a business. The professor mustdecide of the research directions and manage his team. I enjoy this freedom of being able to work on the topics that I like, and also to publishthe result of my research freely as open-source programs that can benefit anyone.
What is required to become a university professor?
In terms of academic skills, good universities require to have a Ph.D. degree. But having a PhD. is often not enough. It is also generally required to have several publications in good conferences and journals. Now, if we analyze this in more details, a professor should have the following skills:
teaching: A professor must be able to explain concepts in simple ways, clearly, and concisely so that students can learn efficiently. A professor must also be able to make the classes enjoyable rather than boring, and prepare their courses well, assignments and exams.
Carrying research: A professor must be good at doing research. This includes being able to find interesting research questions and findinnovationsolutions to solve these problems.
writing: A professor must also be good at writing. This is important as a professor must write journal papers, conference papers, grant applications, and many other documents. Being good at writing is related to being good at teaching, since writing often requires to explain concepts in simple ways so that the reader can understand (just like teaching).
Managing a team/sociability:A professorshouldbe able to manage a team and also ideally establish collaborations with other researchers. Thus some management and social skills are required.
Being good in his field: A professor must also be good in his field of study. For example, for the field of computer science, a professor should ideally also be a good programmer. This is different from being a good researcher, as a good programmer is not necessarily a good researcher, and a good researcher is not necessarily a good programmer.
Being a workaholic 😉: actually, not everybody can work or likes to work 10 hours or more every day. And also not every professor work very hard but to become a professor it still generally requires a huge amount of work. And the first years of being a professor can be quite hard. When I was an undergraduate student,I saw it as achallenge for me to see how far I would be able to go into academia, and since I like working, the amount of work did not put me away. I have worked very hard in the last 10 years .For example, during my studies, I typically justtooka few days off during a whole year, and worked every day from the morning until the evening. For me, it is worth it.
It might seem like a long list of skills to have. But actually, all these skills can be developed over time. When I was an undergraduate students for example, I did not know how to write well. I remember that the first paper that I attempted to write by myself during my master degree was quite terrible. Over the years during my master’s degree and Ph.D., I have greatly improved my writing skills by practicing writing many papers. I also greatly improved my research skills, and teaching skills. In terms of teaching, it requires some practice and dedication to become a good teacher.
Conclusion
So why did I choose to become a professor? The short answer is that I really like research and the freedom of doing my own research. I also liketo be able to manage my schedule, travelling, and I also enjoy teaching. Would I move to the industry someday? No. Even if I could probably earn more in the industry, I am happy to do what I am doing in academia.
If you have enjoyed reading this blog post, you can continue reading this blog, which has many other posts related to academia and data mining.
Philippe Fournier-Vigeris a professor of Computer Science and also the founder of theopen-source data mining software SPMF,offering more than 120 data mining algorithms.
In this blog post, I will introduce the populardata miningtask ofclustering (also called cluster analysis). I will explain what is the goal of clustering, and then introduce the popularK-Meansalgorithm with an example. Moreover, I will briefly explain how an open-source Java implementation of K-Means,offered in theSPMFdata mining librarycan be used.
What is clustering?
Clusteringis one of the most fundamental data mining tasks.
Consider that you have adatabase, containing several records (which we will call instances). For example, consider the following database describing several bank customers.
AGE NATIONALITY INCOME GENDER ACCOUNT_BALANCE ...
25 Canada 3000 M 50
42 Brazil 4000 F 500
...
In this example, each instance (customer) is described according to some attributes such as age, nationality, income, gender,and account_balance.
Thegoal of clusteringis toautomatically find groups of instances(e.g. customers)that are similar in a database. For example, for the database of bank customers, clustering the customers, consists of automatically grouping the customers that have a similar profile. For example, a clustering algorithm could perhaps find that many female customers who are rich have a similar profile or that many young Canadians with a low income have a similar profile (this is just an example).
Thus, the goal of clustering is to find groups of instances having similar characteristics in a database. These groups of instances found by clustering are calledclusters. Thus, the goal of clustering is to find clusters of similar instances.
Let’s have a look at a second examplebefore giving a more formal definition of what is clustering. This time, we will considera database of 2D points. Consider the following database containing31 two dimensional points.
Instance 1
(1,1)
Instance 17
(16, 16)
Instance 2
(0,1)
Instance 18
(11.5, 8)
Instance 3
(1,0)
Instance 19
(13, 10)
Instance 4
(11,12)
Instance 20
(12, 13)
Instance 5
(11, 13)
Instance 21
(14, 12.5)
Instance 6
(13, 13)
Instance 22
(14.5, 12.5)
Instance 7
(12, 8.5)
Instance 23
(14.5, 11.5)
Instance 8
(13, 8)
Instance 24
(15, 10.5)
Instance 9
(13, 9)
Instance 25
(15, 9.5)
Instance 10
(13, 7)
Instance 26
(12, 9.5)
Instance 11
(11, 7)
Instance 27
(10.5, 11)
Instance 12
(8, 2)
Instance 28
(10, 10.5)
Instance 13
(9, 2)
Instance 29
(9, 3)
Instance 14
(10, 1)
Instance 30
(9, 4)
Instance 15
(7, 13)
Instance 31
(9,5)
Instance 16
(5, 9)
Each instance (point) in this database is described by two attributes: the X and Y coordinates. This database can be represented visually usually a XY chart:
It can be observed in these figures that some points are quite close, while some other points are quite far away. In this context, the goal of clustering is to find groups of points that are similar (close to each other). By applying aclustering algorithmsuch asK-Means, three clusters could be found (represented by the blue, red and green colors):
Intuitively, these clusters somewhat make sense as they are groups of points that are close to each other.
Clustering has many applications. It can be used for example to cluster similar documents in categories (for example, automatically classifying news on a website into categories).
Clustering algorithms
To perform clustering, it is necessary to apply aclustering algorithm. There exists hundreds of clustering algorithms to find hidden clusters in data. Different algorithms have different advantages and disadvantages and can be more or less suitable to different types of data. Some of the most popular algorithms are for exampleK-MeansandDBScan.
In the following paragraphs, I will briefly explain how the K-Means algorithms works.
The K-Means algorithm
TheK-Means algorithmwas proposed in 1967 by MacQueen. This algorithm has two main parameters: (1) a database, (2) a positive integer Krepresenting the number of clusters to be extracted from the database.
TheK-Meansalgorithmconsists of thefollowing steps:
(1) The algorithm reads the database in memory. The database contains several instances.
(2) The algorithm initializeKempty clusters. Each cluster has aprototype, which is a randomly generated instance.
(3) Each instance in the database is assigned to the cluster having the closest prototype.
(4) Then, the prototype of each cluster is recomputed as the average of all the instances in that cluster.
(5) Then, Step3 and Step 4 are repeated several times, until the clusters become stable.
Now let me illustrate these steps withan exampleso that it becomes clear. By applyingK-Meanson the database of 2D points, (1) K-Means will first load the database of 31 points in memory. Then, assume that the user has set K = 3 to generate 3 clusters. (2)K-Meanswill initially createthree empty clusterswhich will have some random points as prototypes. Let say that thesethree points (prototypes) are (2,2), (16, 4) and (11, 0) . We can represent this visually as follows (where the prototypes are represented by the X symbol):
Then (3) K-means will assign each of the 31 points to the cluster having the closest prototype. The result will be as follows:
Then (4) the prototype of each cluster is recomputed as the average of the points that it contains. The new prototypes are approximately (1.75, 2.75), (9.1, 5.2) and (12.9, 10.7) for the three clusters of points.
Then (3), each point is assigned to the cluster having the closest prototype. The result is the clusters shown in the picture below. As you can see, some points have moved from one cluster to another.
Then, step (4) is applied again to recompute the prototype of each cluster as the average of its points. The new prototypes are (0.7, 0.7), (12.4,10.8) and (8.7,4.1).
Then step (3) is applied again. K-Means assignseach point to the cluster having the closest prototype. Since the cluster did not change after this step, the K-Means algorithms stop and the final result is the following three clusters, here displayed with colors:
An open-source Java implementation of K-Means
If you want to try the K-Means algorithm with the above example by yourself, aJava implementation of K-Means is provided in theSPMF library.
To run the example, you will need to haveJava 1.8or higher installed on your computer.
(1) The first step is to download the database of points, which can be obtained here:
(2) Then, you will need to download the release version of SPMF, which is a file calledspmf.jar,and can be obtained from thedownload page of SPMF.
(3) Then, double-click on the filespmf.jarto launch SPMF. If it does not work, it means that there is a problem with the Java installation on your computer. If it works correctly, the user interface of SPMF should open:
(A) choose theKMeans algorithmin the combo box.
(B) choose the input file as “inputDBScan2.txt”
(C) set the output file as “text.txt”. The output of the algorithm will be written to that file.
(D) Set the parameter K to 3
(E) To be able to calculate how similar the instances are close to each other, we need to specify a distance function. For this example, we should use theEuclidian distance, as it is the typical distance function for comparing 2D points.
(F) Select the “cluster viewer” to see the results as a chart.
(G) Click the “Run algorithm” button” to launch theK-Meansalgorithm.
The result will be displayed in a window, such as this:
Note that it is quite possible that the result that you will obtain is not exactly the same. The reason is thatK-Meansis a randomized algorithm. In other words, K-Means utilize random numbers. Thus, if K-Means is run several times, it may not always generate the same result.
Why is the K-Means algorithm popular?
Having presented the K-Means algorithm, let’s now briefly discuss its characteristics in more details. Why is K-Means popular? The main reason is that it is a very simple algorithm. It is easy to implement and also easy to understand.
Some drawback of theK-Meansalgorithms is that the final result depends on how the initialprototypesare selected randomly. If we are lucky, we may find some nice clusters as in the above example, that somewhat make sense. However, if the initial points are not chosen in an appropriate way, the result may not be meaningful. There exists various solutions to this problem such as runningK-Meansseveral times, and then to choose the best set of clusters.
Another limitation ofK-Meansis that the user must explicitly specify the number of clusters to be found (the K parameter). But finding the best value for the K parameter may require to try several values.
To address these limitations various extensions ofK-Meanshave been proposed. Besides, many other clustering algorithms have been proposed.
Conclusion
This blog post has given an overview of the task ofclustering, a fundamental data mining task. The goal was to give a simple introduction. The post has explained how theK-Meansalgorithm works and has shown how to use theSPMFimplementation of K-Means. In a future blog post, I will explain theDBScan clustering algorithm, which is also offered in SPMF.
For the interested reader, a good introduction to clustering can be found in the book “Introduction to data mining” by Tan, Steinbach and Kumar. The first chapter about clustering is freely available here.
By the way, you can also try a demo version of k-means in your browser for teaching purpose with the K-Means demo tool.
Philippe Fournier-Vigeris a full professor and also the founder of theopen-source data mining software SPMF,offering more than 110 data mining algorithms. If you like this blog, you can tweet about it and/or subscribe to my twitter account @philfv to get notified about new posts.
This blog post briefly explain howtime series data mining can be performed with theJava open-source data mining library SPMF(v.2.06). It first explain what is a time series and then discuss how data mining can be performed on time series.
What is a time series? What is the difference with sequences?
There are two main types of sequential data considered in data mining: sequences and time series.
Atime-series is an ordered list of numbers. For example, the figure below shows atime-series. This time series could represent for example, temperature readings on different days, collected with sensors. Time series are also used to represent many other types of data such as stock market data.
A time series
On the other hand, asequenceis an ordered list of nominal values (symbols). For example, a sequence is shown below.
a, b, c, e, g, h, a, b, c, e
This sequence indicates that an item (symbol) a, was followed by an itemb, which was followed by an itemc, then,e, g, h, a, b, c, ande, in that order. Sequences have also many real-life applications. They can be used for example to represent sentences in texts (sequences of words), sequences of items purchased by customers in retail stores, and sequences of webpages visited by users.
How to analyze time series?
Bothtime seriesandsequencescan be analyzed usingdata mining techniquesto discover interesting patterns for understanding the data or decision-making. Sincetime seriesare a type ofnumericdata,andsequencesaresymbolicdata,the traditional techniques for analyzing time series and sequences are quite different. However, it is possible to convert time series to sequences by discretizing the time series (transforming the numbers into symbols. Then techniques for analyzing sequences can also be applied to analyze the time series.
There exists several ways of transforming time-series to sequences. One of the most popular techniques is theSAX algorithm(Lin et al., 2007). But many other techniques exist and variations. TheSPMF open-source data mining librarydiscussed in this blog post, offers a fastJava implementation of the SAX algorithm. Let’s see how it works.
Discretizing a time series
Consider the time series shown in the previous figure. It containsn=11 data points. To apply the SAX algorithm, it is necessary to specify a number of symbols wand a number of data points v. Let’s say that the number of symbols is 4 and that the number of data point is 8. The SAX algorithm will perform two steps.
First, it will convert the time series from 11 data points to 8 data points. To do this, it will split the time series into 8 segments and replace each segment by its average. This is called thepiecewise aggregate approximation (PAA)of the time series. Visually, this gives the following result.
Transforming a time series to its PAA representation
As it can be seen above, transforming a time series to its PAA representation is a way of reducing the dimensionality of a time series by reducing the number of data points.
Second, the SAX algorithm transforms the PAA representation of the time series to symbols. Each data point is replaced by a symbol. The number of symbols is selected by the user (here we assume thatv = 4 symbolsfor this example). Now, the main question is how the symbols are chosen? The main idea in SAX is to assume that values follow a normal distribution and to choose the symbol to represent various intervals of values such that each interval is equally probable under the normal distribution (for more details, see the paper of Lin et al. 2007). Applying the SAX algorithm implementation of SPMF on the time series produces the following result.
Four symbols are created: a= [-Infinity, 4.50] b= [4.50, 6.2] c= [6.2, 7.90] d= [7.90,Infinity]
Then the result is a sequence of symbols: a, a, c, d, d, c, c, b. Let’s see this more visually in the following figure.
SAX representation of the time series
This sequence is the symbolic representation of the time series. Then, after a time series has been converted to a sequence, it is possible to apply traditional pattern mining algorithms on the time series.
For example, in the SPMF library, several algorithms are providing for performing data mining on sequences of symbols. However, most of them requires to have several sequences of symbols to for example find patterns common to several sequences (sequential pattern mining). If one has a single time series, this is not a problem. It is also possible to split a time series into several sequences.
Splitting a time series
In SPMF, there is currently two ways of splitting a time series: (1) splitting into segments having a given number of data points, or (2) splitting a time series into a given number of segments. For example, if we split the above time series into two segments, we would obtain:
Splitting a time series
Then, these two time series could be converted to sequence of symbols using the SAX algorithm to apply traditional pattern mining algorithms such as sequential pattern mining and sequential rule mining algorithms, to find patterns that are common to these sequences. Of course, this is just a toy example, so there is no interesting patterns to be found in these time series. 😉
Removing noise in a time series using the moving average
There exists various ways of removing noise in time series. One of the simplest is to calculate themoving average of the time series. To do this, the user must specify a window sizez(a number of data points). Then, each data point of the time series is replaced by the average of the previous zpoints. For example, if we use again the same time series, and set zto 3 points, the result is:
Moving average of a time series
As it can be seen in this example, applying the moving average makes the time series to become more “smooth”.
Other possibilities: time series clustering and vizualization
This blog post has given an overview of the main operations for time series mining in SPMF. More details can be found in the documentation of SPMF in theSPMF website.
Note that it is also possible to apply clustering algorithms such as K-Means, also provided in SPMF, to automatically group similar time series (time series clustering).
Besides, a time series viewer is also integrated in SPMF for visualizing time series and performing basic operation such as zooming in, zooming out, and printing.
SPMF time series viewer
This time series viewer will also be improved in future versions of SPMF, with additional features for manipulating time series. Other features will also be added for time series mining. Any code submission related to time series is also welcome.
For those familiar with SPMF, all the time series operations are grouped inside a new category called “time series mining” in the user interface of SPMF:
Time series operations in SPMF 2.06
Conclusion
In this blog post, we have introducedtime series data miningwith theSPMF Java open source data mining library. This gives a brief overview of possibilities for time series mining. Of course, there exist numerous other algorithms for time series mining. But this already provides many possibilities.
If you like this blog, you can subscribe to the RSS Feed or my Twitter account (https://twitter.com/philfv) to get notified about future blog posts. Also,if you want to support this blog, please tweet and share it!
References
Lin, J., Keogh, E., Wei, L., Lonardi, S.: Experiencing SAX: a novel symbolic representation of time series. Data Mining and Knowledge Discovery 15, 107–144 (2007)
This brief blog discusses what not to do when applying for a M.Sc. or Ph.D. position in a research lab. The aim of this post is to give advices to those applying for such positions.
I had previously discussed about this topic in another blog post, where I explained that it is important to send personalized e-mails to potential supervisors rather than sending the same e-mail to many professors. I will thus not explain that again.
In this post, I will rather emphasizes another important aspect, which is to give a good impression of yourself to other people. I discuss this by using an e-mail that I have received today:
From *****@***.sa
Subject: Apply For Scholars ship Ph.d
Sir Philippe Fournier IHOPE TO APPLY MY TO YOR PROGRAM OF DO PH.D ON COMPUTER SCIENCE ,I HAVE READ THAT YOU OFFER PH.D PROGRAM ON COMPUTER SCINCE AS GENERAL MY TOPIC IS WEP APPLICATION SECIRTY ALL THAT .I HAVE BE IN PAKISTAN FOR SEVEN YERS IDID MASTER THERE SO MY PROJECT WAS WEB SIDE FOR SINDH UNIVERSITY FROM THAT DATE 2010-7-21 IDID TWO PAPER PUBLICATION .WITH ALL OF INTERESTED TO WORK WITH YOU HOPE TO APPLY MY ON YOUR PROGRAMS. THANKS FOR YOR TIMEM
A person sending this type of e-mails has zero chances of getting a position in my team. Why?
It is poorly written. There are many typos and English errors. If this person cannot take the time to write an e-mail properly, then it gives the impression that this person is careless, and would do a bad job.
The person did not take the time to spell my full name correctly. Not spelling the name properly shows a lack of respect or shows carelessness. This is something to absolutely avoid.
The person asks to work on web security. I have never published a single paper on that topic. Thus, I would not hire that person. This person should contact a professor working on such topics.
The applicant does not provide his CV and very few information about himself. He mentions that he has two publications. But it does not mean anything if I don’t know where they have been published. There are so many bad conferences and journals. An applicant should always send his CV, to avoid sending e-mails back and forth to obtain the information. I will often not answer the e-mails if there are no CV attached to the e-mail, just because it wastes time to send e-mails to ask for the CV, when it should be provided. Besides, when a CV is provided, it should be detailed enough. It is even better when a student can provides transcripts showing his grades in previous studies.
The applicant does not really explain why he wants to work with me or how he has found my profile. For a master degree, this is not so important. But when applying for a Ph.D., it is expected that the applicant will chose his supervisor for a reason such as common research interests (as I have discussed above).
That is all I wanted to write for today.
If you like this blog, you can subscribe to the RSS Feed or my Twitter account (https://twitter.com/philfv) to get notified about future blog posts. Also, if you want to support this blog, please tweet and share it!
This tutorial will explain how to analyze text documentsto discover complex and hidden relationships between words. We will illustrate this with a Sherlock Holmes novel. Moreover we will explain how hidden patterns in text can be used to recognize the author of a text.
The Java open-source SPMF data mining library will be used in this tutorial. It is a library designed to discover patterns in various types of data, including sequences, which can also be used as a standalone software, and to discover patterns in other types of files. Handling text documents is a new feature of the most recent release of SPMF (v.2.01).
Obtaining a text document to analyze
The first step of this tutorial is to obtain a text document that we will analyze. A simple way of obtaining text documents is to go on the website of Project Gutenberg, which offers numerous public domain books. I have here chosen the novel “Sherlock Holmes: The Hound of the Baskervilles” by Arthur Conan Doyle. For the purpose of this tutorial, the book can be downloaded here as a single text file: SHERLOCK.text. Note that I have quickly edited the book to remove unnecessary information such as the table of content that are not relevant for our analysis. Moreover, I have renamed the file so that it has the extension “.text” so that SPMF recognize it as a text document.
Downloading and running the SPMF software
The first task that we will perform is to find the k most frequent sequences of words in the text. We will first download the SPMF software from the SPMF website by going to the download page. On that webpage, there is a lot of detailed instructions explaining how the software can be installed. But for the purpose of this tutorial, we will directly download spmf.jar, which is the version of the library that can be used as a standalone program with a user interface.
Now, assuming that you have Java installed on your computer, you can double-click on SPMF.jar to launch the software. This will open a window like this:
Discovering hidden patterns in the text document
Now, we will use the software to discover hidden patterns in the Sherlock Holmes novel. There are many algorithms that could be applied to find patterns. We will choose the TKS algorithm, which is an algorithm for finding the k most frequent subsequences in a set of sequences. In our case, a sequence is a sentence. Thus, we will find the k most frequent sequences of words in the novel. Technically, this type of patterns is called skip-grams. Discovering the most frequent skip-grams is done as follows.
A) Finding the K most frequent sequences of words (skip-grams)
We will choose the TKS algorithm
We will choose the file SHERLOCK.text as input file
We will enter the name test.txt as output file for storing the result
We will set the parameter k of this algorithm to 10 to find the 10 most frequent sequences of words.
We will click the “Run algorithm” button.
The result is a text file containing the 10 most frequent patterns
The first line for example indicates that the word “the” is followed by “of” in 762 sentences from the novel. The second line indicates that the word “in” appears in 773 sentences from the novel. The third line indicates that the word “the” is followed by “the” in 869 sentences from the novel. And so on. Thus we will next change the parameters to find consecutive sequences of words
B) Finding the K most frequent consecutive sequences of words (ngrams)
The above patterns were not so interesting because most of these patterns are very short. To find more interesting patterns, we will set a minimum pattern length for the patterns found to 4. Moreover, another problem is that some patterns such as “the the” contains gaps between words. Thus we will also specify that we do not want gaps between words by setting the max gap constraint to 1. Moreover, we will increase the number of patterns to 25. This is done as follows:
We set the number of patterns t0 25
We set the minimum length of patterns to 4 words
We require that there is no gap between words (max gap = 1)
We will click the “Run algorithm” button.
The result is the following patterns:
Now this is much more interesting. It shows some sequences of words that the author of Sherlock Holmes tends to use repeatedly. The most frequent 4-word sequences is “in front of us” which appears 13 times in this story.
It would be possible to further adjust the parameters to find other types of patterns. For example, using SPMF, it is also possible to find all patterns having a frequency higher than a threshold. This can be done for example with the CM-SPAM algorithm. Let’s try this
C) Finding all sequences of words appearing frequently
We will use the CM-SPAM algorithm to find all patterns of at least 2 words that appear in at least 1 % of the sentences in the text. This is done as follows:
We choose the CM-SPAM algorithm
We set the minimum frequency to 1 % of the sentences in the text
We require that patterns contain at least two words
We will click the “Run algorithm” button.
The result is 113 patterns. Here is a few of them:
Here there are some quite interesting patterns. For example, we can see that “Sherlock Holmes” is a frequent pattern appearing 31 times in the text, and that “sir Charles” is actually more frequent than “Sherlock Holmes”. Other patterns are also interesting and give some insights about the writing habits of the author of this novel.
Now let’s try another type of patterns.
D) Finding sequential rules between words
We will now try to find sequential rules. A sequential rule X-> Y is sequential relationship between two unordered sets of words appearing in the same sentence. For example, we can apply the ERMiner algorithm to discover sequential rules between words and see what kind of results can be obtained. This is done as follows.
We choose the ERMiner algorithm
We set the minimum frequency to 1 % of the sentences in the text
We require that rules have a confidence of at least 80% (a rule X->Y has a confidence of 80% if the unordered set of words X is followed by the unordered set of words Y at least 80% of the times when X appears in a sentence)
We will click the “Run algorithm” button.
The result is a set of three rules.
The first rule indicates that every time that 96 % of the time, when “Sherlock” appears in a sentence, it is followed by “Holmes” and that “Sherlock Holmes” appeared totally 31 times in the text.
For this example, I have chosen the parameters to not obtain too many rules. But it is possible to change the parameters to obtain more rules for example by changing the minimum confidence requirement.
Applications of discovering patterns in texts
Here we have shown how various types of patterns can be easily extracted from text files using the SPMF software. The goal was to give an overview of some types of patterns that can be extracted. There are also other algorithms offered in SPMF, which could be used to find other types of patterns.
Now let’s talk about the applications of finding patterns in text. One popular application is called “authorship attribution“. It consists of extracting patterns from a text to try to learn about the writing style of an author. Then the patterns can be used to automatically guess the author of an anonymous text.
For example, if we have a set of texts written by various authors, it is possible to extract the most frequent patterns in each text to build a signature representing each author’s writing style. Then, to guess the author of an anonymous text, we can compare the patterns found in the anonymous text with the signatures of each author to find the most similar signature. Several papers have been published on this topic. Besides using words for authorship attribution, it is also possible to analyze the Part-of-speech tags in a text. This requires to first transform a text into sequences of part-of-speeches. I will not show how to do this in this tutorial. But it is the topic of a few papers that I have recently published with my student. We have also done this with the SPMF software. If you are curious and want to know more about this, you may look at my following paper:
There are also other possible applications of finding patterns in text such as plagiarism detection.
Conclusion
In this blog post, I have shown how open-source software can be used to easily find patterns in text. The SPMF library can be used as a standalone program or can be called from other Java program. It offers many algorithms with several parameters to find various types of patterns. I hope that you have enjoyed this tutorial. If you have any comments, please leave them below.
== Philippe Fournier-Vigeris a full professor and also the founder of the open-source data mining software SPMF, offering more than 110 data mining algorithms. If you like this blog, you can tweet about it and/or subscribe to my twitter account @philfv to get notified about new posts.
One year and a half ago, I was working as a professor at a university in Canada. But I took the decision to not renew my contract and move to China. At that time, some people may have thought that I was crazy to leave my job in Canada since it was an excellent job, and I also had a house and a car. Thus, why going somewhere else? However, as of today, I can tell you that moving to China has been one of the best decisions that I ever took for my career. In this blog post, I will tell you my story an explain why I moved there. I will also compare the working conditions that I had in Canada with those that I have now in China.
Before moving to China
After finishing my Ph.D in 2010, I have worked as a post-doctoral researcher in Taiwan for a year. Then, I came back to Canada and worked as a faculty member for about 4 years there. However, in Canada, the faculty positions are very rare. When I was in Canada, I was hoping to move to another faculty position closer to my hometown to be closer to my family, but it has been almost impossible since there are about only five faculty positions that I could apply related to my research area in computer science, every year, for the whole country! Thus, getting a faculty position in Canada is extremely difficult and competitive. There are tons of people applying and very few positions available.
I had several interviews at various universities in Canada. But getting a faculty position in another university in Canada was hard because of various reasons. Sometimes a job can be announced but the committee can already have someone in mind or may prefer some other candidates for various strange reasons. For example, the last interview that I had in Canada about two years ago was at a university in Quebec, and basically, they hired someone else that had almost no research experience due to some “political reasons”. Just to give you a sense of how biased that hiring process was, here is a comparison of the candidate that was hired and me:
Total number of citations: < 150 (the selected candidate) 1031 (me)Number of citations (most cited paper): < 20 (the selected candidate) 134 (me)Number of citations (last year): < 30 (the selected candidate) >300 (me)Number of papers (this year): 4 (the selected candidate) >40 (me)
So who would you hire? But anyway, I just show this as an example to show that the hiring process is not always fair. Actually, this could have happened anywhere in the world. But when there are very few jobs available, as in Canada, it makes it even harder to find a position. But, it does not bother me, since this has led me to try something else and move to China, which has been one of the best decision for my career!
Before explaining what happened after this, let me make it clear that I did not leave my previous job in Canada because I did not like it. Actually, I had the chance to work at a great university in Canada and I made many friends there and had also had some wonderful students. I had my first opportunity to work as a professor there and it was a hard decision to leave. However, to go further in my career as a researcher, I wanted to move to a bigger university.
Moving to China
Thus, at the end of June 2015, I decided to apply for a faculty position at a top university in China. I quickly passed the interview and started to work there a few months later after quickly selling my house and my car in Canada. So now let’s talk about what you probably want to know: how my current job in China is compared to my previous position in Canada?
Well, I must first say that I moved to one of the top 10 university in China, which is also one of the top 100 university in the world for computer science. Thus, the level of the students there is quite high, and it is also an excellent research environment. But let’s analyse this in more details.
In terms of research funding:
In Canada, it has become extremely difficult to receive research funding due to budget cuts in research and the lack of major investment in education by the government. To give you an idea, the main research funding association called NSERC could only give me 100,000$ CAD for five years, and I was considered lucky to have this funding. But this is barely enough to pay one graduate student and attend one or two conference per year.
In China, on the other hand, the Chinese government offers incredible research funding opportunities. Of course, not everyone is equally funded. The smaller universities do not receive as much funding as the top universities. But there is some very good research program to support researchers, and especially the top researchers. In my case, I applied for a program to recruit talentsby the NSFC (National Science Foundation of China). I was awarded 4,000,000 RMB in research funding (about 800,000 $ CAD), for five years. Thus I now receives about eight times more funding than what I received in Canada for my research. This of course now make a huge difference for my research. I can thus buy expensive equipment that I needed such as a big data cluster, hire a post-doc, pay many students, and perhaps even hire a professional programmer eventually to support my research. Besides, after getting this grant for young talents, I was automatically promoted to Full Professor, and will soon become the director of a research center, and will get my own lab. This is a huge improvement for my career compared to what I had in Canada.
Now let’s compare the salary:
In Canada, I had a decent salary for a university professor.
In China, my base salary is already about 15% higher than what I received in Canada. This is partly due to the fact that I work in a top university, located in a rich city (Shenzhen) and that I also received a major pay increase after receiving the young talent funding. However, besides the salary, it is possible to receive many bonuses in China that can increase your salary through various program. Just to give you an example, in the city of Shenzhen, there is a program called the Peacock program that can provide more than 2,000,000 RMB (about 400,000 CAD $) for living expenses for excellent researchers working in that city, on five years. I will not say how much I earn. But by including these special program(s), I can say that my salary is now about twice what I earned in Canada.
In terms of living expenses, living in China is of course much less expensive than living in Canada. And the income tax is more or less similar, depending on the provinces in Canada. In the bigger cities in China, renting an apartment can be expensive. However, everything else is cheap. Thus, the overall living cost is much lower than Canada.
In terms of life in general, of course, the life is different in China than in Canada, in many different ways. There are always some advantages and disadvantages to live in any country around the world, as nothing is perfect anywhere. But I really enjoy my life in China. And since I greatly enjoy the Chinese culture (and speak some Chinese), this is great for me. The city where I work is a very modern city that is very safe (I would never be worried about walking late at night). In terms of work environment, I am also very satisfied. I have great colleagues, and everyone is friendly. It is on overall very exciting to work there, and I expect that it will greatly improve my research in the next few years.
Also, it is quite inspiring to work and contribute to a top university and a city that are currently very quickly expanding. To give you an idea, the population of that city has almost doubled in the last fifteen years, reaching more than 10 million persons, and 18 million when including the surrounding areas. And a few new subway lines have opened in the last few years. There are also many possibilities for projects with the industry and the government in such a large city.
Conclusion
In this blog post, I wanted to discuss a little bit about the reasons why I decided to move to China, and why I consider that it is one of the best decisions that I ever took for my career, as I think that it would be interesting for other researchers.
By the way, if you are a data mining researcher and are looking for a faculty position in China, you may leave me a message. My research center is looking to hire at least one professor with a data mining background.
== Philippe Fournier-Vigeris a full professor and also the founder of the open-source data mining software SPMF, offering more than 110 data mining algorithms. If you like this blog, you can tweet about it and/or subscribe to my twitter account @philfv to get notified about new posts.
This week, I have been attending the DEXA 2016 and DAWAK 2016 conferences, in Porto, Portugal, from the 4th to 8th September2016, to present three papers. In this blog post, I will give a brief report about these conferences.
About these conferences
The DEXAconference is a well-established international conference related to database and expert systems. This year, it was the 27th edition of the conference. It is a conference that is held in Europe, every year, but still attracts a considerable amount of researchers from other continents.
DEXA is a kind of multi-conference. It actually consists of 6 small conferences that are organized together. Below, I provide a description of each of those sub-conferences and indicate their acceptance rate.
DEXA 2016 (27th Intern. Conf. on Database and Expert Systems Applications).
Acceptance rate: 39 / 137 = 28% for full papers, and another 29 / 137 = 21 % for short papers
DaWaK 2016 (18th Intern. Conf. on Big Data Analytics and Knowledge Discovery)
Acceptance rate: 25 / 73= 34%
EGOVIS 2016 (5th Intern. Conf. on Electronic Government and the Information Systems Perspective)
Acceptance rate: not disclosed in the proceedings, 22 papers published
ITBAM 2016 (7th Intern. Conf. on Information Technology in Bio- and Medical Informatics)
Acceptance rate: 9 / 26 = 36 % for full papers, and another 11 / 26 = 42% for short papers
TrustBus 2016 (13th Intern. Conf. on Trust, Privacy, and Security in Digital Business)
Acceptance rate: 25 / 73= 43%
EC-Web 2016 (17th Intern. Conf. on Electronic Commerce and Web Technologies)
Thus, the DEXA conference is more popular than DAWAK and the other sub-conferences and also is more competitive in terms of acceptance rate than the other sub-conferences.
Proceedings
The proceedings of each of the first five sub-conferences are published by Springer in the Lecture Notes in Computer Science Series, which is quite nice, as it ensures that the papers are indexed by major indexes in computer science such as DBLP. The proceedings of the conferences were given on a USB drive.
badge and proceedings of DEXA
The conference location
The conference was locally organized by the Instituto Superior de Engenharia do Porto (ISEP) in Porto, Portugal. The location has been great, as Porto is a beautiful European city with a long history. The old town of Porto is especially beautiful. Moreover, visiting Porto is quite inexpensive.
First day of the conference
The first day of the conference started at 10:30 AM. The first day was mostly paper presentations. The main topics of the papers during the first day of DEXA were temporal databases,high-utility itemset mining, periodic pattern mining, privacy-preserving data mining, and clustering. In particular, I had two papers presentations related to itemsets mining:
a paper about discovering high utility itemsets with multiple thresholds.
Besides, there was a keynote entitled “From Natural Language to Automated Reasoning” by Bruno Buchberger from Austria, a famous researcher in the field of symbolic computation. The keynote was about using formal automated reasoners (e.g. math theorem prover) based on logic to analyze texts. For example, the speaker proposed to extract formal logic formulas from tweets to then understand their meaning using automated reasoners and a knowledge base provided by the user. This was a quite unusual perspective on tweet analysis since nowadays, researchers in natural language processing prefer using statistical approaches to analyze texts rather than using approaches relying on logic and a knowledge base. This gave rise to some discussion during the questions period after the keynote.
DEXA 2016 first keynote speech
During the evening, there was also a reception in a garden inside the institute were the conference was held.
Second day of the conference
On the second day, I have attended DAWAK. In the morning, there was several paper presentations. I have presented a paper about recent high-utility itemset mining. The idea is to discover itemsets (set of items) that have been recently profitable in customer transactions, to then use this insight for marketing decisions. There was also an interesting paper presentation about big data itemset mining by student Martin Kirchgessner from France.
Then, there was an interesting keynote about the price of data by Gottfried Vossen from Germany. This talk started by discussing the fact that companies are collecting more and more rich data about persons. Usually, many persons give personal data for free to use services such as Facebook or Gmail. There also exist several marketplaces where companies can buy data such as the Microsoft Azure Marketplace and also different pricing models for data. For example, one could have different pricing models to sell more or less detailed views of the same data. There also exist repositories of public data. Moreover other issues are what happen with the data of someone when he dies. In the future, a new way of buying products could be to pay for data about the design of an object, and then print it by ourselves using 3d printers or other tools. Other issues related to the sale of data is DRM, selling second-hand data, etc. Overall, it was not a technical presentation, but it discussed an important topic nowadays which is the value of data in our society that relies more and more on technologies.
Third day of the conference
On the third day of the conference, there was more paper presentations and also a keynote that I have missed. On the evening, there was a nice banquet at a wine cellar named Taylor. We had the pleasure to visit the cellar, enjoy a nice dinner, and listen to a Portuguese band at the end of evening.
Conclusion
This was globally a very interesting conference. I had opportunity to discuss with some excellent researchers, especially from Europe, including some that I had met at other conferences. There was also some papers quite related to my sub-field of research in data mining. DEXA may not be a first tier conference, but it is a very decent conference, that I would submit more papers to in the future.
== Philippe Fournier-Vigeris a full professor and also the founder of the open-source data mining software SPMF, offering more than 110 data mining algorithms. If you like this blog, you can tweet about it and/or subscribe to my twitter account @philfv to get notified about new posts.
This week, I have attended the IEA AIE 2016 conference, held at Morioka, Japan from the 2nd to the 4thAugust 2016. In this blog post, I will briefly discuss the conference.
About the conference
IEA AIE 2016 (29th International Conference on Industrial, Engineering and Other Applications of Applied Intelligent Systems) is an artificial intelligence conference with a focus on applications of artificial intelligence. But this conference also accepts theoretical papers, as well as data science papers.
First day of the conference
Two keynote speeches were given on the first day of the conference. For the second keynote, I have seen about 50 persons attending the conference. Many people were coming from Asia, which is understandable since the conference is held in Japan.
The paper presentations on the first day were about topics such as Knowledge-based Systems, Semantic Web, Social Network (clustering, relationship prediction), Neural Networks, Evolutionary Algorithms and Heuristic Search, Computer Vision and Adaptive Control.
Second day
On the second day of the conference, there was a great keynote talk by Prof. Jie Lu from Australia about recommender systems. She first introduced the main recommendation approaches (content-based filtering, collaborative filtering, knowledge-based recommendation, and hybrid approaches) and some of the typical problems that recommender systems face (cold-start problem, sparsity problem, etc.). She then talked about her recent work on extensions of the recommendation problem such as group recommendation (e.g. recommending a restaurant that will globally satisfy or not disapoint a group of persons), trust-based recommendation (e.g. a system that recommend products to you based on what friends that you trust have liked, or the friends of your friends), fuzzy recommender systems (a recommender systems that consider each item can belong to more than one category), and cross-domain recommendation (e.g. if you like reading books about Kung Fu, you may also like watching movies about Kung Fu).
After that there was several paper presentations. The main topics were Semantic Web, Social networks, Data Science, Neural Networks, Evolutionary Algorithms, Heuristic Search, Soft Computing and Multi-Agent Systems.
In the evening, there was a great banquet at the Morioka Grand Hotel, a nice hotel located on a mountain, on the outskirts of the city.
Moreover, during the dinner, there was a live music band:
Third day
On the third day, there was an interesting keynote on robotics by Prof. Hiroshi Okuno from the Waseda University / University of Tokyo. His team has proposed an open-source sofware called Hark for robot audition. Robot audition means the general process by which a robot can process sounds from the environment. The sofware which is the results of years of resutls has been used in robots equipped with arrays of microphones. By using the Hark library, robots can listen to multiple persons talking to the robot at the same time, localize where the sounds came from, and isolate sounds, among many other capabilities.
It was followed by paper presentations on topics such as Data Science (KNN, SVM, itemset mining, clustering), Decision Support Systems, Medical Diagnosis and Bio-informatics, Natural Language Processing and Sentiment Analysis. I have presented a paper about high utility itemset mining for a new algorithm called FHM+.
Location
The location of the conference is Morioka, a not very large city in Japan. However, the timing of the conference was perfect. It was held during the Sansa Odori festival, one of the most famous festival in Japan. Thus, during the evenings, it was possible to watch the Sansaparade, where people wearing traditional costumes where playing Taiko drums and dancing in the streets.
Conclusion
The conference has been quite interesting. Since it is a quite general conference, I did not discuss with many people that were close to my research area. But I met some interesting people, including some top researchers. During the conference, it was announced that the conference IEA AIE 2017 will be held in Arras (close to Paris, France).
==
Philippe Fournier-Vigeris a full professor and also the founder of the open-source data mining software SPMF, offering more than 110 data mining algorithms. If you like this blog, you can tweet about it and/or subscribe to my twitter account @philfv to get notified about new posts.