The Data Blog

How to find a good thesis topic in Machine Learning?

Posted on 2020-11-04 by Philippe Fournier-Viger

In this blog post, I will talk about how to find a good thesis topic on machine learning. This is an important question for many students that are required to select a topic for their research and want to work on machine learning. Choosing a good research topic is a critical step in the research process to ensure the success of the research and for publishing good papers.

What is a good research topic?

A good research topic is a research problem that is:
(1) novel,
(2) challenging to solve (cannot just be solved by applying existing techniques) but not too challenging or long,
(3) useful (otherwise, there is no reason to do the research), and
(4) a problem that is interesting for other researchers or has applications.

It should also be clear that a research topic is not equal to doing a programming or software development project. Just solving a software development problem is generally not research. Research is about solving a novel and difficult problem that requires to develop some inovative solution.

How to find a good research topic on machine learning?

To find a good research topic, it is important to know what other researchers have done in recent years. Thus, to select a topic, one should first read papers in good journals and conferences to see what other researchers have been doing. By reading recent papers, one can try to think about the limitations of these studies and what could be improved, or what other researchers have not done yet (because there is no point to do the same thing again). Reading the literature requires some time and is not so easy to do but is very important to choose a good topic. For the young students, it is recommended to ask advices from their supervisor during this step. The supervisor should be able to suggest some good papers to read as starting point and to validate the research topic ideas. When reading paper, one can pay attention to the related work section and conclusion, which sometimes highlight some limitations of previous studies and can give some ideas for research.

Generally, it is a good to work on some research area that other researchers are interested in rather than working on some obscure problems that have few applications.

After finding some idea that is novel, it is also important to keep searching for papers to make sure that the idea is really novel and no one has done it before. If someone did it already, it is better to find this issue as early as possible to avoid wasting time on pursuing a research idea that has been done before. When searching for other papers, it is also important to find the right keywords for searching. Some research topic may have already been studied but with a different name. Thus, when searching for papers, it is important to try various keywords and to keep searching to make sure that the literature has been checked carefully.

How to describe your research topic?

A common misconception about choosing a research topic is to think that just choosing a title is good enough for choosing a research topic. But it is not. A research topic should be defined clearly and with more details.

For example, a topic title like ‘machine learning for image processing’ is too broad and does not mean much about what one would like to do. What kind of machine learning technique? What kind of image processing task? And what is the originality? All of this could be explained more clearly with a more detailed description.

To clearly define your research topic, I recommend to write some text explaining:
– the title,
– why the problem is important?
– what are the limitations of previous studies that this research will address?
– why that research problem is challenging?
– and give a sketch of some possibilities for solving the problem.

If you can answer the above questions, then it means that you have carefully thought about your research topic.

You may also ask some senior researchers to look at your research topic to confirm that it is a good topic.

Conclusion

In this blog post, I have discussed the problem of searching for a research topic in machine learning. Hope this has been helpful. If you have some comments, please leave them in the comment section below.

—-
Philippe Fournier-Viger is a professor of Computer Science and also the founder of the open-source data mining software SPMF, offering more than 170data mining algorithms.

Posted in Academia, artificial intelligence, Machine Learning | Tagged artificial intelligence, machine learning, master, mtech, phd, Research, thesis, thesis topic, university | 1 Comment

Brief Report about the PKDD 2020 conference

Posted on 2020-10-22 by Philippe Fournier-Viger

In this blog post, I will talk about the ECML PKDD 2020 conference, that was held from the 14th to 18th September 2020. This post will be a little bit brief because I did not attend the whole conference but just a few presentations.

What is PKDD?

The PKDD conference is the number one data mining and machine learning conference in europe. This year, it was the 31st edition of this conference. The PKDD conference proceedings are published by Springer in the Lecture Notes in Computer Sciences (LNCS) series, which gives good visibility to the papers. Moreover, it is noteworthy that workshop papers are also published in Springer LNCS volumes.

Due to the coronavirus pandemic, the conference was held online but was supposed to be held in Ghent Belgium.

Videos

Many of the papers and videos have been made available online on the website of the conference: https://slideslive.com/ecmlpkdd2020/main-track-research-track

I have been watching a few of them, and it has been very interesting, as papers of this conference are high quality papers.

Program

The PKDD 2020 conference has 5 keynotes, an applied data science track, a research paper track, industry track, demo track, workshops, tutorials, and a journal paper track.

Opening ceremony

I will here report important information that was presented during the PKDD 2020 opening ceremony.

There was about 1000 persons involved in the program committee for reviewing papers, and about 1000 attendees. It was explained that hundreds of people were recruited this year to join the program committee due to the increase of papers in machine learning.

In the research track, this year 687 papers were submitted. From that, 131 were accepted. Thus, the acceptance rate was 19.1 %. Here is a few slides about the research track:

For the Applied Data Science track, 235 papers were submitted, and 65 accepted. Thus, the acceptance rate was 28 %, which is quite higher than the research track. Here is the number of papers by topic:

For the demo track, 23 papers were submitted, and 10 were accepted, for an acceptance rate of 43 %. Some information about this track:

Here are the statistics about the papers submitted to the DMKD or Machine learning journals for the journal track :

For the industry track, the acceptance rate was about 50%:

This is about the diversity of authors in terms of regions:

Here are the best data mining papers:

And this was the best applied data science paper:

Pattern mining papers

As I am interested by the topic of pattern mining, I have made a list of the main papers on this topic published in the PKDD 2020 conference:

Maximum Margin Separations in Finite Closure Systems
Florian Seiffarth (University of Bonn); Tamas Horvath (University Bonn); Stefan Wrobel (Fraunhofer IAIS & Univ. of Bonn)
Discovering outstanding subgroup lists for numeric targets using MDL
Hugo Manuel Proença (LIACS); Peter Grünwald (CWI); Thomas Bäck (LIACS); Matthijs van Leeuwen (LIACS)
A Relaxation-based Approach for Mining Diverse Closed Patterns
Arnold Hien et al.
OMBA: User-Guided Product Representations for Online Market Basket Analysis
Amila Silva (The University of Melbourne); Ling Luo (The University of Melbourne); Shanika Karunasekera (The University of Melbourne); Christopher Leckie (The University of Melbourne)

Conclusion

That is all for my report about the PKDD 2020 conference. The report is not very long because of my busy schedule. Hence, I only watched a few presentations from the conference. Hope this report has still been interesting.

—-
Philippe Fournier-Viger is a professor of Computer Science and also the founder of the open-source data mining software SPMF, offering more than 170data mining algorithms.

Posted in artificial intelligence, Big data, Conference, Data Mining | Tagged ai, artificial intelligence, conference, data mining, europe, machine learning, pkdd, pkdd2020 | 4 Comments

Why it takes so long for a journal paper to be reviewed?

Posted on 2020-10-22 by Philippe Fournier-Viger

Today, I will talk about the review process of journal papers and why it sometimes take so much time for a paper to be reviewed. It is a question that is often asked by young researchers, who are sometimes impatiently waiting for their journal papers to be published to graduate. I will first give a brief overview of the review process of journal papers and then explain what can go wrong.

An overview of the review process

The main steps of the review process are:

STEP1: Initial screening and assignment to an editor: After submitting a paper to a journal, there will typically be a first screening of the paper performed by an assistant working for the journal. This is done to check the format of the paper and if there are some other issues such as plagiarism or if the paper has been submitted twice. If the paper does not pass this screening, it may be rejected directly. Otherwise, it will be send to an editor.
STEP2: Inviting reviewers: The editor will then look at the manuscript and invite some appropriate reviewers to review the paper. The reviewers will either accept or decline to review the paper. Usually, a minimum number of reviewers is required. Thus, if there is not enough reviewers, the editor will receive a notification and will have to invite more reviewers until the minimum number of reviewers is attained. To speedup that process, an editor may pre-select a list of alternate reviewers that may be automatically invited when a person decline to review.
STEP3: Reviewing the paper: The third step which is partly done in parallel to the second step is that reviewers will review the paper and submit their reviews. The editor will give them a deadline that is will vary depending on the journal. For example, some journal will give 1 or 2 months.
STEP4: The decision: When all the reviews are completed, the editor will receive a notification. Then the editor will read the reviews and submit his decision such as to accept, reject, or request minor or minor revisions of the paper. It is also possible that the editor does not wait for all reviews to take a decision. For example, if the editor needs 4 reviews but two of them are “reject”, he will likely reject the paper without waiting for the 4 reviews.
STEP5: Sending the notification: The decision will be sent to the authors. Then, if a revision is required, the authors will submit the revision and the process will start again from STEP1. If the paper is accepted, it will go in production. And if the paper is rejected, it will be the end.

What can go wrong?

Several things can go wrong an delay the review process:

The editor is busy and waits to invite reviewers: A first problem is that after the editor receives the paper, he may not invite reviewers right away because he is busy with other things. Some editors are quite fast (for example, I always assign reviewers whithin 24 hours) but others will take a few weeks (this happened to me a few times).
Many reviewers decline to review the paper: A second problem is that it is sometimes difficult for the editor to find suitable reviewers for a paper that agree to review. The reason is that some reviewers are busy (e.g. they already accepted to review other papers or have other things to do), have a conflict of interest or are just not interested in reading the paper. Thus, a reviewer may decline to review the paper. Hence, the editor may need to invite morereviewers to review the paper. Sometimes more than a dozen potential reviewers may have to be invited before enough reviewers will accept. For the papers that are on very specialized or unpopular topics, it may be more difficult.
Some potential reviewers take too much time to accept or decline to review: When a researcher receives an invitation to review from an editor, he will typically answer quickly. However, in some cases, the researcher does not answer and the editor may give up to 14 days before retracting the invitation. If the potential reviewer does not answer, the editor need to find a replacement. Thus, the task of finding enough reviewers will in some case take more than a month.
Reviewers are late: This is another major problem. Although reviewers often have 1 month or more to review a paper, they are sometimes very busy and will submit their review not just late but very late. For example, some reviewers are late by more than 30 days… and this is not very rare. In such situation, the editor can either decide to wait or send more reminders to the reviewer or remove him and take decision without him, or start again to invite some new reviewers (which may take more time!)
The editor take time to handle the reviews: Even after the reviewers have submitted their reviews, the editor may still be busy and wait before submitting his decision. In the worst case, I have seen some editor taking a few weeks for submiting their decision.
A combination of the above problems: And of course, all the above problems may appear at the same time. If someone is really unlucky, the editor may take a lot of time before inviting reviewers, many reviewers may decline to review and they may take much time to do so. Then, the reviewers may be late or even never submit their reviews. Hence, the editor may have to find some new reviewers again and they could be late again, and then the editor may be late to handle the decision.
Problems during the second round of review: Besides the above problems, some other problems may occur after the first round of review. For example, during the second round of review, the editor will typically invite the same reviewers from the first round. However, reviewers from the first round may not accept to review the paper again. This happens quite frequently because the reviewers are busy or for other reasons. In that case, the editor may have to invite some new reviewers, which may take extra time. And these new reviewers may ask for more changes in the paper, which may further delay the overall process.

Conclusion

In this blog post, I talked about some common reasons why the reviewing process may be long for some journal papers. As an author, there is not much that one can do to make this process faster. Hope that it has been interesting. If you have any comments or questions, you may post them below in the comment section.

—-
Philippe Fournier-Viger is a professor of Computer Science and also the founder of the open-source data mining software SPMF, offering more than 170data mining algorithms.

Posted in Academia, General | Tagged academia, journal | 5 Comments

(video) Top-K Cross-Level High Utility Itemset Mining

Posted on 2020-10-18 by Philippe Fournier-Viger

Today, I will share a video of our upcoming paper presentation about top-k cross-level high utility itemset mining that we will present at the UDML 2020 workshop at ICDM 2020.

In this paper, we present a novel algorithm named TKC for discovering cross-level high utility itemsets (CLHUIs) in a database of transactions while considering a taxonomy of items. A taxonomy means that items are organized into categories and sub-categories. Moreover, to make it easier to find interesting patterns, we let the user directly specify the number k of patterns to be found. The TKC algorithm returns the top-k cross-level high utility itemsets that have the highest utility.

Here is the video (MP4 format, 20 minutes):

VIDEO LINK: https://www.philippe-fournier-viger.com/tkc_the_video.mp4

And this is the reference of the paper, including the PPT presentation:

Nouioua, M., Wang, Y., Fournier-Viger, P., Lin, J.-C., Wu, J. M.-T. (2020). TKC: Mining Top-K Cross-Level High Utility Itemsets. Proc. 3rd International Workshop on Utility-Driven Mining (UDML 2020), in conjunction with the ICDM 2020 conference, IEEE ICDM workshop proceedings, to appear. [ppt]

The datasets and source code will be made available soon on the SPMF data mining library, wihch offers more than 170 algorithms for pattern mining.

Besides, if you are interested by this topic, you can also check another recent paper on this topic by our team. The paper below presents the CLH-Miner algorithm for cross-level high utility itemset mining. It was used as basis to develop the TKC algorithm.

Fournier-Viger, P., Yang, Y., Lin, J. C.-W., Luna, J. M., Ventura, S. (2020). Mining Cross-Level High Utility Itemsets. Proc. 33rd Intern. Conf. on Industrial, Engineering and Other Applications of Applied Intelligent Systems (IEA AIE 2020), Springer LNAI, pp. 858-871. [ppt]

Hope you will enjoy this video. I will post more videos soon about recent papers. And also, we am currently preparing the source code and datasets to release them soon.

—-
Philippe Fournier-Viger is a professor of Computer Science and also the founder of the open-source data mining software SPMF, offering more than 170data mining algorithms.

Posted in Pattern Mining, Utility Mining, Video | 1 Comment

A brief report about the IEA AIE 2020 conference

Posted on 2020-09-22 by Philippe Fournier-Viger

In this blog post, I will write a brief report about the 33rd International Conference on Industrial, Engineering & Other Applications of Applied Intelligent Systems (IEA AIE 2020), held in Kitakyushu, Japan from the 22nd to 25th September 2020.

What is IEA AIE 2020?

The IEA AIE 2020 is a well-established academic conference that has been running for 33 years. It is about artificial intelligence, and it attracts many papers about the applications of intelligent systems.

I have personally attended this conference many times in the past (IEA AIE 2009, IEA AIE 2010, IEA AIE 2011, IEA AIE 2014, IEA AIE 2016, IEA AIE 2018, IEA AIE 2019), and have 13 papers in its proceedings.

This year, I am attending IEA AIE 2020 as author but also as program chair. The organizers this year are:

Proceedings of IEA AIE 2020

This year, 119 papers have been submitted to the conference. Each paper has been reviewed by at least 3 reviewers. The Program Committee who evaluated the papers is composed of 82 researchers from 36 countries.

A total of 62 full papers were accepted and 17 short papers. All of those were published in the Springer proceedings. Moreover, an additional 9 poster papers were published in a separated poster proceedings with ISBN.

The papers covered many applications to real-life problems such as: •engineering, science, •industry, •automation •robotics, •business and finance, •medicine and biomedicine, •bioinformatics, •cyberspace, •human-machine interaction.

This year, in the new format of the IEA AIE conference, some special sessions were organized. A special session is like a special track about a specific topic of interests. All papers from the special track are published in the main conference proceedings. Here is the information about the two tracks:

Best paper awards

Moreover, this year, four types of awards are announced during the conference:

Best paper award: Dolly Sapra, Andy D. Pimentel: Constrained Evolutionary Piecemeal Training to Design Convolutional Neural Networks.
Best theory paper award: Fan Zhang et al.
A new integer linear programming formulation to the inverse QSAR QSPR for acyclic chemical compounds
Best application paper award: Wei Zhang and Chris Challis
Automatic identification of account sharing for video streaming services
Best special session paper award: Wei Song, Lu Liu, Chaomin Huang
TKU-CE: Cross-Entropy Method for Mining Top-K High Utility Itemsets

These awards were selected by a committee based on review scores and a discussion of the top papers. To ensure that the process is fair, papers from the organization committee members were excluded from receiving awards.

A partly virtual conference

Due to the coronavirus pandemy around the world, the conference was held virtually and also on site in Japan at the same time. This required some special organizations from the local organizers and was very well done. I was happy to saw friends in the conference.

Day 1 – Opening session, keynotes and regular papers

In the opening session of the conference, the conference was presented. Each organizers gave some words about the conference. Then, there was two keynote speeches : one by Prof. Tao Wu about healthcare, and the other by Prof. Ee-Peng Lim about AI for social goods.

The talk of Prof. Lim was very interesting as he talked about two projects that can have a positive implications for the society. The first one was about a probabilistic model of the labor market in Singapore. The second one was about an application that can let users take picture of their food to keep track of what they are eating. The system FoodAI can be tested on this website: https://foodai.org/ Here is a few slides from this presentation.

In his conclusion, Prof. Lim also talked about three challenges for the development and adoption of proposed models.

The keynote was followed by several paper presentations on various topics.

Day 2 – regular papers + keynote talks

On the second day, there was more paper presentations and also two keynote talks (one by Prof. Bo Huang and one by Prof. Enrique Herrera Viedma).

The keynote of Prof. Enrique Herrera Viedma was about group decision making, that is how a group of expert can reach an agreement to take decisions.

Generally, group decision making is reached through a concensus reaching process which requires discussion between experts and involve multi stage negotiation. Here are a few slides, describing the main process:

He explained that nowadays group decision making is done in a new context, with social networks and Web 2.0 tools.

Then, he discussed in more details about properties of social networks and how sentiment analysis can play a role in decision-making models. Here are a few of the important properties of social networks:

Sentiment analysis can be used in group decision making to understand how a user feels about a particular topic, and in particular the preferences of experts about different alternatives. Here are some details:

Here is an overview of the proposed group decision making based framework

Then, there was more details but I will not report on everything.

An upcoming special issue in the Applied Intelligence journal

Another great thing this year at IEA AIE is that there will be a special issue in the Applied Intelligence journal (Springer, Q2). The best papers of the conference will be invited for an extension in the special issue. Details will be announced after the conference.

Next year… IEA AIE 2021… in Kuala Lampur

It was announced that the IEA AIE 2021 conference will be held next year in Kuala Lampur (Malaysia).

The website of IEA AIE 2021 is already online at http://ieaaie2021.wordpress.com/ Here are the key dates related to this conference:

Day 3 and 4 – More paper presentations

On the third and fourth days, there was more paper presentations.

Pattern mining papers

This year, there was 7 pattern miningpapers, which shows that it is a popular topic at this conference. These papers cover topics such as periodic itemset mining, and high utility itemset mining. Since this is a topic of interest for me and to several readers of this blog, here is the list of some of the papers:

TKU-CE: Cross-Entropy Method for Mining Top-K High Utility Itemsets
Wei Song, Lu Liu and Chaomin Huang
Mining Cross-Level High Utility Itemsets
Philippe Fournier-Viger, Ying Wang, Jerry Chun-Wei Lin, Jose Maria Luna and Sebastian Ventura [ppt]
Maintenance of Prelarge High Average-Utility Patterns in Incremental Databases
Jimmy Ming-Tai Wu, Qian Teng, Jerry Chun-Wei Lin, Philippe Fournier-Viger and Chien-Fu Cheng
Efficient Mining of Pareto-front High Expected Utility Patterns
Usman Ahmed, Jerry Chun-Wei Lin, Jimmy Ming-Tai Wu, Youcef Djenouri, Gautam Srivastava and Suresh Kumar Mukhiya
TKE: Mining Top-K Frequent Episodes
Philippe Fournier-Viger, Yanjun Yang, Peng Yang, Jerry Chun-Wei Lin and Unil Yun
A Fast Algorithm for Mining Closed Inter-Transaction Patterns
Thanh-Ngo Nguyen, Loan T.T. Nguyen, Bay Vo and Ngoc-Thanh Nguyen

Conclusion

I have enjoyed the conference. Next year, IEA AIE 2021 will be in Malaysia, and then IEA AIE 2022 in Japan.

—
Philippe Fournier-Viger is a computer science professor and founder of the SPMF open-source data mining library, which offers more than 170 algorithms for analyzing data, implemented in Java.

Posted in artificial intelligence, Conference, Data Mining | Tagged 2020, ai, artificial intelligence, conference, data mining, ieaaie, japan, springer | 4 Comments

The Imposter Syndrome in Academia

Posted on 2020-08-27 by Philippe Fournier-Viger

In this blog post, I will talk about something that many students or researchers have or are experiencing in academia, which is called the imposter syndrome. It is the feeling of not being worthy of having achieved some success or being in a given position. For example, a new PhD student accepted in a top university may feel that he was just lucky and did not really get accepted because of his skills or efforts. A professor may similarly feel that he received funding but that it is undeserved. The imposter syndrome is something very common in academia. Many people have experienced it at some point in their career.

Personally, when I was first admitted in the master degree in computer science more than 15 years ago, I felt that there was still some gaps in my knowledge. For example, I thought that I had not learnt enough about some topics in computer science or mathematics during the bachelor degree. Although I was a reasonably good programmer, it appeared to me that some other students were better. Moreover, another question that I had when starting the master degree was: Even if I am a good student, will I be successful at research? This is a question that many students have because doing research is something new at that stage.

Then, during the master and Ph.D degree, I published several papers on e-learning and started to attend academic conferences. But when attending the conferences, I felt sometimes that my knowledge of the field was not so deep compared to that of many experts there.

Later, I changed my research direction towards data mining and became very good in some research areas there. However, I still felt that I did not know enough about some hot topics like big data.

The examples above are situations that could be viewed as some form of imposter syndrome.

Now, I would like to talk more about this.

Is the imposter syndrome something bad?

Yes, if it discourage you. No, if it motivates you to work harder and to improve yourself. Personally, when I perceive that I have some weaknesses, I will work harder to try to overcome them, and in the end, it will be positive. Thus, whether the imposter syndrome is something negative of positive depends on your attitude towards it.

How to overcome the imposter syndrome?

A good start is to recognize that you have several skills and to think about your strengths. Moreover, you should remember that although some other people may appear to be better at some things, you are better at other things. For example, another professor may seem to be better at teaching than you are but you may be a better researcher, or a student may seem better programmer than you, but you are better at writing research papers. And in any case, you can work out on your weaknesses to improve yourself.

Another important thing is to not be scared that people “unmask you” and discover that you are an “imposter“. Remember that no one is perfect and you should not be shy to admit htat you have weaknesses. You can then ask for help or questions to other people because this will help to improve yourself. For example, it is OK to ask a question about something that you do not understand during a research seminar.

Related to this, I will tell you another story. I remember some friend of mine that was scared of telling his supervisor that his programming skills were weak during his PhD studies. He did not tell his supervisor during his whole Ph.D but he was stressed that the supervisor may find out about it. In such case, I think that he should have been honest with the supervisor (and that is what I told him at that time). If he had done that, perhaps that the supervisor could have gave him some suggestions to improve his skills and my friend would have felt less stressful. But my friend found another solution. He instead worked hard and asked for help from many other students, and finally improved himself.

How long the imposter syndrome last?

There is no answer that is suitable for everyone. Some people overcome that syndrome by receiving some recognition from other people such as some award, a prize or obtaining a degree. But sometimes, the imposter syndrome stays there for a long time. For example, I have read some story about a tenured professor in a top level university that mentioned that he felt the imposter syndrome until he retired. After completing a paper, he was always thinking that he could maybe not find good ideas anymore for his next research projects.

Conclusion

In this blog post, I talked about the imposter syndrome and told you a few stories about it. The imposter syndrome is something very common at all levels from students to professors. The important is to know that you are not alone that you have strengths, and to think about this in a positive way to help you grow and improve yourself rather than discourage you. Don’t be afraid that people “unmask you” but instead ask questions, and work on improving yourself.

—
Philippe Fournier-Viger is a computer science professor and founder of the SPMF open-source data mining library, which offers more than 170 algorithms for analyzing data, implemented in Java.

Posted in Academia, Research | Tagged imposter syndrome, Research | 1 Comment

Big problem on my website on IONOS webhosting!

Posted on 2020-08-25 by Philippe Fournier-Viger

Hi all,

A bad news is that the database of this blog was reverted to 3 years ago due to some technical problem. I have used 1and1 IONOS as hosting service for my websites for the last 10 years, but now it seems that the database for the blog was overwritten with an old backup because everything is as it was 3 years ago in January 2017. How could it have happened?

I have contacted 1and1 IONOS to try to fix the issue, but they denied that it is their fault and did not have any backup older than 7 days. And my own backup is a little bit old… This is unfortunate. Thus, I think that maybe all blog posts of the last three years are lost (maybe 50+ posts). Anyway, this kind of things happen, and I will continue the blog again soon…

But this time, I will not trust the 1and1 hosting service and I will do my own backups regularly.

I am now trying to recover old posts through the Internet Wayback Machine and the cache of web search engines… I have recovered a dozen posts already and I will continue but it will take some time.

Update: After several hours, I think that I have recovered most of the missing blog posts… but maybe there are some broken links. At least, most of the posts are not lost.

Philippe

Posted in Uncategorized | 10 Comments

An Introduction to Data Mining

Posted on 2020-08-22 by Philippe Fournier-Viger

In this blog post, I will introduce the topic of data mining. The goal is to give a general overview of what is data mining.

What is data mining?

Data mining is a field of research that has emerged in the 1990s, and is very popular today, sometimes under different names such as “big data” and “data science“, which have a similar meaning. To give a short definition of data mining, it can be defined as a set of techniques for automatically analyzing data to discover interesting knowledge or pasterns in the data.

The reasons why data mining has become popular is that storing data electronically has become very cheap and that transferring data can now be done very quickly thanks to the fast computer networks that we have today. Thus, many organizations now have huge amounts of data stored in databases, that needs to be analyzed.

Having a lot of data in databases is great. However, to really benefit from this data, it is necessary to analyze the data to understand it. Having data that we cannot understand or draw meaningful conclusions from it is useless. So how to analyze the data stored in large databases? Traditionally, data has been analyzed by hand to discover interesting knowledge. However, this is time-consuming, prone to error, doing this may miss some important information, and it is just not realistic to do this on large databases. To address this problem, automatic techniques have been designed to analyze data and extract interesting patterns, trends or other useful information. This is the purpose of data mining.

In general, data mining techniques are designed either to explain or understand the past (e.g. why a plane has crashed) or predict the future (e.g. predict if there will be an earthquake tomorrow at a given location).

Data mining techniques are used to take decisions based on facts rather than intuition.

What is the process for analyzing data?

To perform data mining, a process consisting of seven steps is usually followed. This process is often called the “Knowledge Discovery in Database” (KDD) process.

Data cleaning: This step consists of cleaning the data by removing noise or other inconsistencies that could be a problem for analyzing the data.
Data integration: This step consists of integrating data from various sources to prepare the data that needs to be analyzed. For example, if the data is stored in multiple databases or file, it may be necessary to integrate the data into a single file or database to analyze it.
Data selection: This step consists of selecting the relevant data for the analysis to be performed.
Data transformation: This step consists of transforming the data to a proper format that can be analyzed using data mining techniques. For example, some data mining techniques require that all numerical values are normalized.
Data mining: This step consists of applying some data mining techniques (algorithms) to analyze the data and discover interesting patterns or extract interesting knowledge from this data.
Evaluating the knowledge that has been discovered: This step consists of evaluating the knowledge that has been extracted from the data. This can be done in terms of objective and/or subjective measures.
Visualization: Finally, the last step is to visualize the knowledge that has been extracted from the data.

Of course, there can be variations of the above process. For example, some data mining software are interactive and some of these steps may be performed several times or concurrently.

What are the applications of data mining?

There is a wide range of data mining techniques (algorithms), which can be applied in all kinds of domains where data has to be analyzed. Some example of data mining applications are:

fraud detection,
stock market price prediction,
analyzing the behavior of customers in terms of what they buy

In general data mining techniques are chosen based on:

the type of data to be analyzed,
the type of knowledge or patterns to be extracted from the data,
how the knowledge will be used.

What are the relationships between data mining and other research fields?

Actually, data mining is an interdisciplinary field of research partly overlapping with several other fields such as: database systems, algorithmic, computer science, machine learning, data visualization, image and signal processing and statistics.

There are some differences between data mining and statistics although both are related and share many concepts. Traditionally, descriptive statistics has been more focused on describing the data using measures, while inferential statistics has put more emphasis on hypothesis testing to draw significant conclusion from the data or create models. On the other hand, data mining is often more focused on the end result rather than statistical significance. Several data mining techniques do not really care about statistical tests or significance, as long as some measures such as profitability, accuracy have good values. Another difference is that data mining is mostly interested by automatic analysis of the data, and often by technologies that can scales to large amount of data. Data mining techniques are sometimes called “statistical learning” by statisticians. Thus, these topics are quite close.

What are the main data mining software?

To perform data mining, there are many software programs available. Some of them are general purpose tools offering many algorithms of different kinds, while other are more specialized. Also, some software programs are commercial, while other are open-source.

I am personally, the founder of the SPMF open-source data mining library, which is free and open-source, and specialized in discovering patterns in data. But there are many other popular software such as Weka, Knime, RapidMiner, and the R language, to name a few.

Data mining techniques can be applied to various types of data

Data mining software are typically designed to be applied on various types of data. Below, I give a brief overview of various types of data typically encountered, and which can be analyzed using data mining techniques.

Relational databases: This is the typical type of databases found in organizations and companies. The data is organized in tables. While, traditional languages for querying databases such as SQL allow to quickly find information in databases, data mining allow to find more complex patterns in data such as trends, anomalies and association between values.
Customer transaction databases: This is another very common type of data, found in retail stores. It consists of transactions made by customers. For example, a transaction could be that a customer has bought bread and milk with some oranges on a given day. Analyzing this data is very useful to understand customer behavior and adapt marketing or sale strategies.
Temporal data: Another popular type of data is temporal data, that is data where the time dimension is considered. A sequence is an ordered list of symbols. Sequences are found in many domains, e.g. a sequence of webpages visited by some person, a sequence of proteins in bioinformatics or sequences of products bought by customers. Another popular type of temporal data is time series. A time series is an ordered list of numerical values such as stock-market prices.
Spatial data: Spatial data can also be analyzed. This include for example forestry data, ecological data, data about infrastructures such as roads and the water distribution system.
Spatio-temporal data: This is data that has both a spatial and a temporal dimension. For example, this can be meteorological data, data about crowd movements or the migration of birds.
Text data: Text data is widely studied in the field of data mining. Some of the main challenges is that text data is generally unstructured. Text documents often do no have a clear structure, or are not organized in predefined manner. Some example of applications to text data are (1) sentiment analysis, and (2) authorship attribution (guessing who is the author of an anonymous text).
Web data: This is data from websites. It is basically a set of documents (webpages) with links, thus forming a graph. Some examples of data mining tasks on web data are: (1) predicting the next webpage that someone will visit, (2) automatically grouping webpages by topics into categories, and (3) analyzing the time spent on webpages.
Graph data: Another common type of data is graphs. It is found for example in social networks (e.g. graph of friends) and chemistry (e.g. chemical molecules).
Heterogeneous data. This is some data that combines several types of data, that may be stored in different format.
Data streams: A data stream is a high-speed and non-stop stream of data that is potentially infinite (e.g. satellite data, video cameras, environmental data). The main challenge with data stream is that the data cannot be stored on a computer and must thus be analyzed in real-time using appropriate techniques. Some typical data mining tasks on streams are to detect changes and trends.

What types of patterns can be found in data?

As previously discussed, the goal of data mining is to extract interesting patterns from data. The main types of patterns that can be extracted from data are the following (of course, this is not an exhaustive list):

Clusters: Clustering algorithms are often applied to automatically group similar instances or objects in clusters (groups). The goal is to summarize the data to better understand the data or take decision. For example, clustering techniques such as K-Means can be used to automatically groups customers having a similar behavior.
Classification models: Classification algorithms aims at extracting models that can be used to classify new instances or objects into several categories (classes). For example, classification algorithms such as Naive Bayes, neural networks and decision trees can be used to build models that can predict if a customer will pay back his debt or not, or predict if a student will pass or fail a course. Models can also be extracted to perform prediction about the future (e.g. sequence prediction).
Patterns and associations: Several techniques are developed to extract frequent patterns or associations between values in database. For example, frequent itemset mining algorithms can be applied to discover what are the products frequently purchased together by customers of a retail store. Some other types of patterns are for example, sequential patterns, sequential rules, periodic patterns, episode mining and frequent subgraphs.
Anomalies/outliers: The goal is to detect things that are abnormal in data (outliers or anomalies). Some applications are for example: (1) detecting hackers attacking a computer system, (2) identifying potential terrorists based on suspicious behavior, and (3) detecting fraud on the stock market.
Trends, regularities: Techniques can also be applied to find trends and regularities in data. Some applications are for example to (1) study patterns in the stock-market to predict stock prices and take investment decisions, (2) discovering regularities to predict earthquake aftershocks, (3) find cycles in the behavior of a system, (4) discover the sequence of events that lead to a system failure.

In general, the goal of data mining is to find interesting patterns. As previously mentioned, what is interesting can be measured either in terms of objective or subjective measures. An objective measure is for example the occurrence frequency of a pattern (whether it appears often or not), while a subjective measure is whether a given pattern is interesting for a specific person. In general, a pattern could be said to be interesting if: (1) it easy to understand, (2) it is valid for new data (not just for previous data); (3) it is useful, (4) it is novel or unexpected (it is not something that we know already).

Conclusion

In this blog post, I have given a broad overview of what is data mining. This blog post was quite general. I have actually written it because I am teaching a course on data mining and this will be some of the content of the first lecture. If you have enjoyed reading, you may subscribe to my Twitter feed (@philfv) to get notified about future blog posts.

—
Philippe Fournier-Viger is a professor of Computer Science and also the founder of the open-source data mining software SPMF, offering more than 120 data mining algorithms.

Posted in Big data, Data Mining, Data science | Tagged artificial intelligence, big data, data mining, data science, machine learning | 23 Comments

Unethical Reviewers in Academia!

Posted on 2020-08-15 by Philippe Fournier-Viger

In this blog post, I will talk about a common problem in academia, which is the unethical behavior of some reviewers that ask authors to cite several of their papers.

It is quite common that some reviewer will ask authors to cite his papers to increase his citation count. I have encountered this problem many times for my own papers when submiting to journals. Sometimes the reviewer will try to hide his identify by asking to cite four or five papers and include one or two from himself among those. But sometimes, it is very obvious as the reviewer will directly ask to cite many papers and they will all be from the same author. For example, just a few weeks ago, I received a notification for one of my papers where the reviewer wrote:

The related work needs improvement: Please add the following works:
…. title of paper 1 …
…. title of paper 2 ..
…. title of paper 3 ….
…. title of paper 4 …

That reviewer asked to cite four papers by the same person. In that case, it is very easy to guess who is the reviewer. In some cases, I have even seen two reviewers of the same papers both asking the author to cite their papers. Each of them was asking to cite about five of their papers. This was completely ridiculous and gave a very bad impressionabout the review process. This unethical behavior is quite common. If you submit many papers to journals, you will sooner or later encounter this problem, even for top 20 % journals.

Why it happens? The reason is that many universities consider citation count as an important metric for performance evaluation. Thus, some authors will try to artificially increase their citation count by forcing other authors to cite their papers.

So what are the solutions?

Authors facing this problem will often accept to cite the papers from the reviewer because they are afraid that the reviewer will reject the paper if they don’t. This is understandable. However, if the authors accept, this will encourage the reviewer to continue this unethical behavior for other papers. Thus, the best solution is to send an e-mail to the editor to let them know about it. This is what I do when I am in this situation. If you let the editor knows, the editor will normally take this into account and may even take some punitive actions like removing the reviewer from the journal.
To avoid this problem before it happens, some editors will read carefully the reviews and delete unethical requests by reviewers. However, this does not always happen because editors are often very busy and may not spend the time to read all comments made by reviewers. But it is good that some journal such as IEEE Access will put a disclaimer in the notification to inform authors that they are not required to cite papers that are not relevant to the article. This is a good way of preventing this problem.
Reviewers should only ask to cite papers that are relevant to the paper and will contribute to improving the quality of the paper. To avoid conflict of interests, a reviewer can suggest to cite a paper rather than tell authors that they must cite paper. This is more acceptable.

Conclusion

In this blog post, I have talked about some unethical behavior that many people have encountered when submiting to journals, and sometimes also for conferences. The reason why I wrote this blog post is that I have encountered this situation for two of my papers in the last two months and I have become quite tired to see this happen in academia.

If it also happened to you, please leave a comment below with your story. I will be happy to read it!

—
Philippe Fournier-Viger is a computer science professor and founder of the SPMF open-source data mining library, which offers more than 170 algorithms for analyzing data, implemented in Java.

Posted in Academia, Research | Tagged academia, ethics, review, reviewer | 1 Comment

How to find a good thesis topic in Machine Learning?

Brief Report about the PKDD 2020 conference

Why it takes so long for a journal paper to be reviewed?

(video) Top-K Cross-Level High Utility Itemset Mining

More problems on IONOS web hosting… 4 days of downtime!

A brief report about the IEA AIE 2020 conference

The Imposter Syndrome in Academia

Big problem on my website on IONOS webhosting!

An Introduction to Data Mining

Unethical Reviewers in Academia!

Archives

Categories

Recent Posts

Recent Comments

Number of visitors:

Related posts:

Related posts:

Related posts:

Related posts:

Related posts:

Related posts:

Related posts:

Related posts:

Related posts:

Related posts:

Archives

Categories

Recent Posts

Recent Comments

Tag cloud

Number of visitors: