Expensive Academic Conferences – the case of ICDM

I was recently thinking of attending IEEE ICDM 2018 (International Conference on Data Mining) in Singapore, next month. It is a top 5 data mining conference. According to my schedule, I could attend it for 2 days, and since Singapore is close to China, it is convenient to go there. However,  I was quite surprised by how expensive the registration fee of this conference has became. As of today the “standard registration fee (by 28 October)” is roughly 1360 USD$ or 9300 CNY.

icdm registration fee 2018

Registration fees  from ICDM2018 website

This is actually the most expensive conference that I have ever considered attending. Most conferences that I have attended have been in the 300-700 USD range, twice less than ICDM. But is it an outlier? To see more clearly, I decided to compare the standard registration of ICDM 2018 with those of previous editions of ICDM:

  • ICDM 2018: 1360 $ USD (11 % increase from 2017)
  • ICDM 2017: $1220 USD  (12% increase from 2015)
  • ICDM 2015: 1080 $USD  (28 % increase from 2013)
  • ICDM 2013: 844 $USD   (68% increase from 2011)
  • ICDM 2011: about 500 $ USD

This is quite interesting. It shows a steady increase in the registration price of the ICDM conference over the years. The registration fee has increased so much, that the price is now 2.7 times higher than 8 years ago!

Why is it so expensive?

One could argue that the reason is the location of the conference. But the increase has been steady over the years no matter where the conference was organized. Moreover, such big conferences have often thousands of attendees, and usually many sponsors. I recently attended the KDD 2018 conference, which was also expensive, but less than ICDM. There was about more than 3000 attendees, and if I remember they received more than 1 million dollars in sponsorship.

Thus, where all this money goes?  A good part goes to renting a convention center, publishing the proceedings and other aspects such as providing scholarships to students. But many conferences also make some considerable profit.  Some conferences are not for profit, while some other conferences will pay the local organizers or the association organizing the conference. I am not sure about how the money is used in the case of ICDM or IEEE and what they will do with the profits, as I could not find the information. But I believe that such big conferences can generate a huge amount of money. By discussing with organizers of smaller conferences (200 attendees) that have much lower registration fees and less sponsorship, I know that some conferences can still make 20,000$ profit.

About IEEE, it is not their only conference in the 1000$ USD range. Some other flagship conferences like IEEE ICC (about communication) also have fees greater than 1000$ USD.  In the field of data mining, the KDD conference is also quite expensive, although still less than ICDM.  In some ways, manypeople want to attend these conferences so they are willing to pay these high fees.

Consequences of high registration fees

The consequence of such high registration fees is that some people may not have enough money to attend, and that a lot of money is spent by researchers.  And in many case, that money comes from research projects funded by the government. Thus, one could argue that this money could be used in better ways.

Personally, I was thinking of attending ICDM but when I saw that I would have to pay almost 1400 $ USD for two days, I think it is not reasonable to spend that much money. I have enough research funding but I still do not want to waste the money provided by the government for supporting research. Thus, this year, I will use the money for other things rather than going to ICDM.

==
Philippe Fournier-Viger is a professor, data mining researcher and the founder of the SPMF data mining software, which includes more than 100 algorithms for pattern mining.

Posted in Academia, Big data, Conference, Data Mining | 2 Comments

Periodic patterns in Web log time series

Recently, I have analysed trends about visitors on this blog. I have made two observations. First, there is about 500 to 1000 visitors per day. For this, I want to thank you all for reading and commenting on the blog.  Second, if we look carefully at the number of visitors per day, it becomes a time series, and we can clearly see some patterns that is repeating itself every week. Below is a picture of this time series for January 2018.


periodic visitor accesses

As you can see, there is a clear pattern every week. Toward the beginning of the week on Monday and Tuesday, the number of visitor increases, while around Friday it starts to decrease. Finally, on Saturday and Sunday, there is a considerable decrease, and then it increases again on Monday. This pattern is repeating itself every week. We can see it visually, but such patterns could be detected using time series analysis techniques such as an autocorrelation plot. Besides, it would be easy to predict this time series using time series forecasting models.

We can also see a relationship with the concept of  periodic patterns that I have previously discussed in this blog. A periodic pattern is pattern that is always repeating itself over time.  That is all for today. I just wanted to shared this interesting finding.

==
Philippe Fournier-Viger is a professor, data mining researcher and the founder of the SPMF data mining software, which includes more than 100 algorithms for pattern mining.

Posted in Data science, Time series, Web | Tagged , , , | Leave a comment

Upcoming book: High Utility Itemset Mining: Theory, Algorithms and Applications

I am happy to announce that the draft of the book about high utility pattern mining has been finalized and submitted to the publisher (Springer). It should thus be published in the very near future.

high utility pattern mining

The book contains 12 chapters written by several top researchers from the field of pattern mining, for a total of 350 pages. The title is “High Utility Itemset Mining: Theory, Algorithms and Applications”. It discuss high utility itemset mining and other related topics. I show you here the table of content:

Editors: Philippe Fournier-Viger, Jerry Chun-Wei Lin, Bay Vo, Roger Nkambou, Vincent S. Tseng.

  • Chapter 1: A Survey of High Utility Itemset Mining
    Philippe Fournier-Viger, Jerry Chun-Wei Lin, Tin Truong Chi, Roger Nkambou
    This chapter gives a more than 39 pages introduction to high utility pattern mining, designed for getting a quick overview of the field and the main results.
  • Chapter 2: A Comparative Study of Top-K High Utility Itemset Mining Methods
    Srikumar Krishnamoorthy
    This chapter gives an in-depth discussion of top-k high utility itemset mining, including a very detailed comparison of the state-of-the-art algorithms.
  • Chapter 3: A Survey of High Utility Pattern Mining Algorithms for Big Data
    Morteza Zihayat, Methdi Kargar, Jaroslaw Szlichta
    This chapter reviews algorithms for mining high utility patterns in big data.
  • Chapter 4: A survey of High Utility Sequential Pattern Mining
    Tin Truong Chi, Philippe Fournier-Viger
    This chapter provides a survey of  high utility sequential pattern mining. It contains several new theoretical results and a very detailed comparison of upper-bounds and algorithms.
  • Chapter 5: Efficient Algorithms for High Utility Itemset Mining without Candidate Generation
    Jun-Feng Qu, Mengchi Liu, Philippe Fournier-Viger
    This chapter presents the HUI-Miner algorithm and a novel extension called HUI-Miner*, which improves is performance in many situations.
  • Chapter 6: High Utility Association Rule Mining
    Loan T.T. Nguyen, Thang Mai, Bay Vo
    This discusses another important topic of discovering high utility associations.
  • Chapter 7: Mining High-utility Irregular Itemsets
    Supachai Laoviboon, Komate Amphawan
    This chapter considers the time dimension in high utility itemset mining to find regular patterns.
  • Chapter 8: A survey of Privacy Preserving Utility Mining
    Duy-Tai Dinh, Van-Nam Huynh, Bac Le, Philippe Fournier-Viger, Ut Huynh, Quang-Minh Nguyen
    This chapter provides an overview of techniques for hiding high utility patterns for privacy purposes.
  • Chapter 9: Extracting Potentially High Profit Product Feature Groups by Using High Utility Pattern Mining and Aspect based Sentiment Analysis
    Seyfullah Demir, Oznur Alkan, Firat Cekinel, Pinar Karagoz
    This section presents an interesting application of high utility pattern mining related to sentiment analysis
  • Chapter 10: Metaheuristics for Frequent and High-Utility Itemset Mining
    Youcef Djenouri, Philippe Fournier-Viger, Asma Belhadi, Jerry Chun-Wei Lin
    This chapter provides a survey of evolutionary and swarm intelligence algorithms for high utility itemset mining.
  • Chapter 11: Mining Compact High Utility Itemsets without Candidate Generation
    Cheng-Wei Wu, Philippe Fournier-Viger, Jia-Yuan Gu, Vincent S. Tseng
    This chapter presents algorithms for mining closed and maximal high utility itemsets. It includes a novel strategy for identifying maximal patterns when using a depth-first search.
  • Chapter 12: Visualization and Visual Analytic Techniques for Patterns
    Wolfgang Jentner and Daniel A. Keim.
    This chapter discusses the problem of vizualizing patterns found.

This will be a very good book with many great contributions, and I am excited that it will be published soon. I will keep you updated on this blog as we get closer to the release.

==
Philippe Fournier-Viger is a professor, data mining researcher and the founder of the SPMF data mining software, which includes more than 100 algorithms for pattern mining.

Posted in Big data, Data Mining, Data science, Pattern Mining, Utility Mining | Tagged , , , , , , | Leave a comment

What I don’t like about academia

In this blog post, I will talk about academia. There are numerous things that I like about academia, and I really enjoy working in academia. But for this blog post, I will try to talk  about what I don’t like in academia to give a different perspective.

academia

Even when we like something very much, there is always some things that we don’t like. So, here we go. Here is a list of some things that I more or less dislike in academia:

  • A sometime excessive pressure to publish: There is sometimes a great pressure on researchers to produce many publications in a given time frame, which may come from various sources. It is in part necessary as it increases productivity and ensures that researchers do not become lazy. But a drawback is that some researchers may be less willing to take risks or may focus on short-term projects rather than on more difficult but more rewarding projects.
  • Conflicts of interests at various levels. A researcher should avoid conflicts of interest. However, not everyone does and this is a problem. A few years ago, for example, I was a program committee member of a conference and discovered that a reviewer reviewed his own paper. I reported this issue to the conference organizers and that person was kicked out of the program committee. Another, example is some journal reviewers that always ask that we cite their papers in their reviews even if it is not relevant to our paper, just to increase their citation count. In my field, there is one reviewer that is especially known for doing this as several researchers talked to me about him. This is not a good behavior and I usually report it to the journal editor but since reviewers work for free, there is typically no consequence for such people. A third example is that some researchers will often give preferential treatment to their friends. For example, I ever attended a conference  where three of the awards were handed to collaborators of the conference organizer. Although these papers may be good, it remains suspicious. Another example is when I was applying for jobs in Canada, several years ago. At that time, I was one of remaining two candidates for a professor position but finally the other much less experienced researcher was chosen, due to a likely conflict of interest.
  • Predatory journals and conferences. There are many journals of very low quality that only publish to earn money. These journals usually have very broad scope, are published by unknown publishers and sometimes appear to not review papers. They also often send spam to promote their journals. This is a problem, and I obviously dislike such journals.
  • Unethical publications by some researchers. I have discovered and reported several journal papers that contained plagiarism. These papers have been generally retracted, as they should. But in some cases, unethical behavior is not so easy to detect. For example, I have ever read some papers where I thought that results were fake but there was not enough evidences to prove it. It certainly happens that some researchers publish fake results, which is bad for academia.
  • Publishers that sometimes are too greedy. It is well known that some publishers charge very high fees to universities and individuals to publish and/or access research publications. This is somewhat unfortunate because research is often funded by a government, done by researchers and reviewed for free by reviewers, while publishers are those earning money. It would be difficult to change this as popular publishers are well established and there are pressure to keep this system. On the other hand, this publication system is not that bad. Actually, the good publishers will filter many bad papers, and ensure minimum quality levels for papers, which is important.
  • Insufficient funding for research in some countries. Currently, I have a lot of funding so I cannot complain about insufficient funding. But in some other countries, funding is quite rare and often insufficient for researchers in academia. This was the case when I was working in Canada. To apply for the national funding by NSERC, we would have to write a budget requesting large amounts of money but one was considered lucky to even just get a fraction of it. Thus not so much money was available to students, for attending conferences and publications, and buying equipment. Besides, there is not enough professors at several universities in countries like Canada.
  • Reviewers that do not do their job well. As researchers, our work are evaluated by other researchers to determine if our work should be published in a given conference proceedings or journal. Generally, reviewers do a good job and do it for free, which is very appreciated. However, in some cases, reviewers don’t do their job correctly. For example, it ever happened to me that a reviewer rejected my paper because he thought the problem could be solved in a more simple way. But the solution proposed by the reviewer in his review was wrong. Having said that, a reviewer often misunderstand a paper because it is not well written. Thus, such situations are often to be blamed on authors rather than reviewers. And often when a paper is rejected there are multiple problems in the paper.
  • Unprofessional behavior. In some cases, some researchers have highly unprofessional behavior. This was for example the case for the ADMA 2015 conference, which was canceled without notifying authors, after papers had been submitted. The website just went offline and organizers just ignored emails.
  • Bad paper presentations. I have attended many international conferences. Sometimes paper presentations are good. But sometimes they are not good. There are several easily avoidable mistakes that a presenter should not do such as turning is back to the audience, exceeding the time limit, and not being prepared.

This is all for today! I just wanted to share some things that I don’t like about academia. But actually, I really like academia. You can share your own perspective on academia in the comments below, or perhaps that you may want to share solutions on how to improve academia. 😉

==
Philippe Fournier-Viger is a professor of Computer Science and also the founder of the open-source data mining software SPMF, offering more than 145 data mining algorithms.

Posted in Academia, General, Research | Tagged | Leave a comment

News about the data mining blog

This data mining blog has been created more than five years ago and has had a considerable success with more than 800,000 views. For this, I want to thank all the readers. Today, I will announce some important news related to this blog.

Translation of the blog

The first news is that the blog will be translated to make it more accessible in other languages. Since I work in China and there is a very large Chinese data mining community, I have recently added a Chinese translation of the data mining blog. It can be accessed by clicking the following link in the menu of this website.

chinese blog

In the Chinese version of the data mining blog, not all blog posts will be translated, but the most important ones.  Currently four posts have been translated. I have published two and the others will be published in the following weeks.

chinese data mining

I am also considering adding a French translation since I am a native French speaker.  Other languages could also be added such as Vietnamese and Spanish if volunteers are willing to help me translating to other languages.

Video tutorials about data mining and big data

The second news is that I am currently experimenting with software to record lectures and publish them online as HTML5 videos. In the near future, I will start publishing  various videos about data mining. This will include some lectures that I have given, as well as some tutorials for my SPMF data mining software. I will also record some video tutorials to present some classical data mining algorithms. Moreover, I will discuss why recording videos can be useful to promote research, in a future blog post.

Conclusion

In this blog post, I have given some news about future plans for the blog. Thanks again for reading and commenting. I am also looking for contributors. If you would like to contribute as a guest author or translator, just let me know.

==
Philippe Fournier-Viger is a professor, data mining researcher and the founder of the SPMF data mining software, which includes more than 100 algorithms for pattern mining.

Posted in Academia, Big data, Data Mining, General | Tagged , , , , | Leave a comment

Report about the DEXA 2018 and DAWAK 2018 conferences

This week, I am attending the DEXA 2018 (29th International Conference on Database and Expert Systems Applications) and the DAWAK 2018 (20th Intern. Conf. on Data Warehousing and Knowledge Discovery) conferences from the 3rd to 6th September in Regensburg, Germany.

dexa 2018 dawak 2018

Those two conferences are well established European conferences dedicated mainly to research on database and data mining. These conferences are always collocated. It is not the first time that I attend these conferences. I previously attended  DEXA 2016 and DAWAK 2016 in Portugal.

These conferences are not in the top 5 of their fields but are still quite interesting, usually with some good papers. The proceedings of the conference are published by Springer in the LNCS (lecture notes in Computer Science series, which ensures that the paper are indexed by various academic databases.

Acceptance rates

For DEXA 2018, 160 papers were submitted, 35 have been accepted (22.%) as full papers, and 40 as short papers (25 %).

For DAWAK 2018, 76 papers were submitted, 13 have been accepted (17.%) as full papers, and 16 as short papers (21 %).

Location

The conference is held at University of Regensburg, in Regensburg, a relatively small town with a long history, about 1 hour from Munich. It is a UNESCO world heritage site. The university:

dexa 2018 location

A picture of the old town:

regensburg dexa

Why I attend these conferences?

This year, my team and collaborators have four papers at these conferences, on topics related high utility itemset mining, periodic pattern mining and privacy preserving data mining:

  • Fournier-Viger, P., Zhang, Y., Lin, J. C.-W., Fujita, H., Koh, Y.-S. (2018). Mining Local High Utility Itemsets . Proc. 29th International Conference on Database and Expert Systems Applications (DEXA 2018), Springer, to appear.
  • Fournier-Viger, P., Li, Z., Lin, J. C.-W., Fujita, H., Kiran, U. (2018). Discovering Periodic Patterns Common to Multiple Sequences. 20th Intern. Conf. on Data Warehousing and Knowledge Discovery (DAWAK 2018), Springer, to appear.
  • Lin, J. C.-W., Zhang, Y. Y., Fournier-Viger, P., … (2018) A heuristic Algorithm for Hiding Sensitive Itemsets. 29th International Conference on Database and Expert Systems Applications (DEXA 2018), Springer, to appear.
  • Lin, J. C.-W., Fournier-Viger, P, Liu, Q., Djenouri, Y., Zhang, J. (2018) Anonymization of Multiple and Personalized Sensitive Attributes. 20th Intern. Conf. on Data Warehousing and Knowledge Discovery (DAWAK 2018), Springer, to appear.

The two first papers are projects of my master degree students, who will also attend the conference.  Besides, I will also chair some sessions of both conferences.

Another reason for attending this conference is that it is an European conference. Thus, I can meet some European researchers that I usually do not meet at conferences in Asia.

Day 1

I first registered. The process was quick. We receive the proceedings of the conference as a USB drive, and a conference bag.

dexa 2018 proceedings

I attended several talks from both the DEXA 2018 and DAWAK 2018 conference on the first day. Here is a picture of a lecture room.

dexa 2018 lecture

There was also an interesting keynote talk about database modelling.

dexa keynote

In the evening, a reception was held at the old town hall.

Day 2

The second day had several more presentations. In the morning I was the chair of the session on classification and clustering. A new algorithm that enhance the K-Means clustering algorithm was proposed, which has the ability to handle noise. An interested presentation by Franz Coenen proposed an approach were data is encrypted and then transmitted to a distant server offering data mining services such as clustering. Thanks to the encryption techniques, privacy can then be ensured. In the morning, there was also a keynote about “smart aging”. I did not attend it though because I instead had a good discussion with collaborators.

Day 3 – Keynote on spatial trajectory analysis

There was a keynote about “Spatial Trajectory Analytics: Past, Present and Future” by Xiaofang Zhou. It is a timely topic as nowadays we have a lot of trajectory data in various applications.

dexa trajectory data keynote

What is trajectory data? It is the traces of moving objects. Each object can be described using time, spatial positions and other attributes. Some examples of trajectory data is cars that are moving. Such trajectory data can be obtained by the GPS of cars. Another example is the trajectory of mobile phones. Trajectory data is not easy to analyze because it samples the movement of an object. Besides, trajectories are influenced by the environment (e.g. a road may be blocked). Other challenges is that data may be inaccurate and some data points may be redundant.

trajectory data

Trajectory data can be used in many useful ways such as route planning, point of itnerest recommendation, environment monitoring, urban planning, and resource tracking and scheduling. Trajectory data can also be combined with other types of data.

trajectory data applications

But how to process trajectory data? Basically, we need to monitor the objects to collect the trajectories, store them in databases (which may provide various views, queries, privacy support, and indexing), and then the data can be analyzed (e.g. using techniques such as clustering, sequential pattern mining or periodic pattern mining). Here is a proposed architecture of a trajectory analysis system:

trajectory data analysis

This is a first book written by the presenter in 2011 about spatial trajectory mining:

trajectory data book

Here are some important topics in trajectory analysis:

trajectory analysis research topics

Then, the presenter discusses some specific applications of trajectory data analysis. Overall, it was an interesting introduction to the topic.

Day 3 – Banquet

In the evening, attendees were invited to a tour of a palace, and then to a banquet in a German restaurant.

dexa dawak banquet

Day 4

On the last day, there was more paper presentations and another keynote.

Next year

DAWAK 2019 and DEXA 2019 will be hosted in Linz, Austria from the 26th to the 29th August 2019.

Best paper award

The best paper award was given to the paper “Sequence-based Approaches to Course Recommender Systems” by Osmar Zaiane et al. It presents a system to recommend undergraduate courses to student.  This system, applies algorithms for sequential pattern mining and  sequence prediction among other to select relevant courses.

Conclusion

Overall, the quality of papers was relatively high, and I was able to meet several researchers related to my research. It was thus a good conference to attend.

==

Philippe Fournier-Viger is a professor, data mining researcher and the founder of the SPMF data mining software.

Posted in Big data, Conference, Data Mining, Data science | Tagged , , , , , | Leave a comment

China lead in mobile payment and services

In this blog post, I will talk about the wide adoption of mobile payment and mobile services in China. I have been working in China for several years and I am still quite amazed by everything that can be done with a cellphone there.

mobile payment in China

China’s mobile payment systems

A fundamental difference with many western countries is that mobile payment is widely used in China and that virtually everything can be paid with a cellphone, from buying something from a street vendor to paying a bill in a restaurant, or transferring money to a friend.

There are two main mobile payment systems in China called WeChat (by Tencent) and Alipay (by Alibaba). To use a mobile payment systems, one needs to download an application  on his cellphone and validate his identity, and generally link the application to a bank account for transferring money to the virtual wallet. This can be done in just a few minutes.  I will describe the main functions of these applications below.

The core function of Wechat is messaging. It allows to  maintain a list of friends and send messages, and make voice/video calls. But Wechat can also be used for mobile payments.  The main payment features are:

  • Transferring money from a bank account to the virtual Wechat wallet to refill it.
  • Sending money to a friend.
  • Sending money or receiving money from someone else by scanning a QR code on his cellphone or let him scan your QR code.
  • Pay a bill at a store. This requires to scan the QR code of the store with the cellphone and then enter the amount of money and password. Then, the store owner receives the money. Another way is to let the store owner scan your QR code to withdraw money from your account.
  • Pay for a wide variety of services such as:
    • Pay utility bills such as water, electricity
    • Pay the bills of your cellphone
    • Order food to be delivered to your door.
    • Order food at the restaurant by viewing the menu on the cellphone, and selecting items.
    • Order a taxi or ride
    • Rent a public bike by scanning the QR code of the bike,
    • Order cinema tickets,
    • Reserve airplane/train tickets/ hotel room
    • Use your cellphone as a ticket in the bus/subway if the cellphone has NFC technology
    • Buy products from online retail stores
    • Send money to charity
    • and many others

The other main payment system is Alipay.  Unlike Wechat,  Alipay is not a messaging application. It is designed for mobile payment and is actually more popular than Wechat. It offers mostly the same functions. Besides, some other functions that I did not mention above are:

  • Pay a credit card
  • Split the bill between friends at a restaurant
  • Buy game
  • Buy lottery

The Wechat and Alipay mobile payment systems are widely used, everyday by hundreds of millions of people. I know many people in China that basically use this to pay for everything in their daily life, and don’t use cash anymore. Actually, mobile payment is often the preferred way of payments in several stores.  For example, I recently bought some milk tea at a store and the employee asked me to pay with Alipay instead of money because he did not have change.

This is quite different from many western countries where mobile payment is rarely used. For example, Business Insider (https://www.businessinsider.com/alipay-wechat-pay-china-mobile-payments-street-vendors-musicians-2018-5/ ) revealed in May 2018 that the mobile payment market in China is valued at 16 trillions, while in the US, it is only 112 billions. In other words, the mobile payment market is more than 140 times larger in China than in the US.

What is the reason for the wide adoption of mobile payment in China? 

There are several reasons:

  • Cellphone plans are very cheap. Thus, many people has a cellphone with a data plan.
  • Using these payment systems is very simple.  To pay, one scans a QR code or let someone scan his QR code. Then, he enter his password to authorize the payment. It can be done for any kind of transactions, between individuals or at a store.  Anyone can receive or send money.
  • There is no fee to pay using these payment systems. For example, the only fee that Wechat charges is 0.1 % when transferring money back from a virtual wallet to a bank account (if the amount exceed 1000 RMB, which is about 150 $ USD). These fees are almost nothing compared to processing fees of credit cards or  debit card in many western countries.
  • Creating a mobile payment account is simple and basically just require to link the account to a bank account. This is much easier than getting a credit card, since mobile payment systems are not used to borrow money

Impact on innovation and adoption of mobile services

The fact that mobile payments are widely used in China has started to transform many aspects of daily life. For example, at the restaurant, it is possible to scan a QR code on a table to see the menu and then order food, which will then be delivered to the table. Another example is to scan the QR code of a bike on the street to unlock the bike, pay to use it, and then leave it anywhere after using it.  A third example, is to go to restaurant with friend, and then split the bill or quickly transfer money between phones, or use the phone to pay in the bus or subway. A fourth example, is to pay at a vending machine using by scanning a QR code.

The wide usage of mobile payment creates huge opportunities for the development of innovative mobile services in China, that cannot be offered on a large scale in other countries.  Thus, I believe that is a key advantage that helps drive innovation in China for mobile services.

Conclusion

In this blog post, I discussed the adoption of mobile payment and mobile services in China. Hope that it has been interesting! If you have any comments, please write it in the comment section below.

—-
Philippe Fournier-Viger is a professor of Computer Science and also the founder of the open-source data mining software SPMF, offering more than 145 data mining algorithms.

 

 

Posted in General, Industry, Mobile technology | Tagged , , , , | Leave a comment

Report about the KDD 2018 conference

This week, I am participating to the KDD 2018 ( 24th ACM SIGKDD Intern. Conference on Knowledge Discovery and Data Mining), in London, UK from the 19th to 23rd August 2018.

KDD 2018 london

The KDD conference is an international conference, established 24 years ago. It is the top conference in the field of data mining / data science / big data.  The proceedings are published by ACM. This year, more than 3,000 persons were registered at the conference, which is huge. Many researchers from the industry are attending the conference.  KDD was held at the Excel convention center:

kdd london excel

Day 1 – Registration

On the first day, the registration started at 7:00 AM and the tutorials started at 8:00 AM. I arrived at around 7:45 and had to wait about 20 min in line to register. The problem was that there was hundreds of people who all wanted to register at the same time and maybe only six volunteers to serve them. Thus, several people arrived late at the tutorials.  However, I don’t blame the organizers because this is something hard to avoid for such big conferences, and this year it is a record for attendance.  This is the waiting lines:

The  registration desk:

We receive a conference bag containing a USB stick with the proceedings, a pen, notebook and various promotional materials from businesses.
kdd registration

An APP called Whova was offered for our cellphones. This APP allows to see all the attendees from the conference, to create discussion groups and to see the schedule of the conferences. These three features are respectively shown in the three screenshots below.

Day 1 – Tutorials

After registration, I attended a tutorial about data mining in online retail stores, organized by JD.com (jingdong). I also attended a tutorial on fact checking in the afternoon and part of the workshop on explainable models for healthcare. Actually, there was more than 10 tutorials at the same and many seem interesting but I could not attend all of them!

Day 2 – The 1st International Workshop on Utility-Driven Mining (UDM 2018)

This year, I co-organized the first UDM 2018 workshop on utility mining. The workshop is about finding patterns in databases that have a high utility or importance (e.g. high profit). The workshop program included a keynote by Nitesh Chawla about decision-making, which was unfortunately cancelled due to unexpected events. However, we still had a great workshop with seven paper presentations. The published papers can be found on the workshop page.  Among these papers, here is a brief description of some interesting ideas:

All the presentations have been recorded and will be made available online in the future. Besides, a special issue in a Springer journal is being organized for the best papers, and a Springer book is planned for the proceedings of the workshop.

Day 2 – Opening ceremony

The opening ceremony was also on the second day. Here is some pictures about the location and some interesting slides about the conference.

kdd 2018 openingkddkdd 2kdd research trackSome statistics about the “Applied data science” track:

kdd 5

kdd7

Day 2 – poster session

On the evening of the second day, there was also a poster session, which is always good to meet new researchers and have research discussion.

kdd poster session 1

Day 2 – Evening with Jingdong

I was invited to a special private event, which is an evening with JD.com (jingdong). JD.com is one of the top online retail company in China, and also one of the biggest technology company in the world. There was a panel with several high profile researchers such as Philip. S. Yu, Jiawei Han, Jian Pei, and Christos Faloutsos, as well as presentations of research and products at JD.com. Moreover, there was live music, a dinner and drinks. I had good discussions with people from JD.com and it allowed to establish several relationships with people from JD.com. I am quite impressed with what they are doing. A few pictures from that event:

jd.com at kdd

kdd 2018 jd

kdd jingdong event

kdd jingdong 2018kdd 2018 jingdong

Day 3 – Deep learning with Keras, hands-on tutorial by Google

I attended this event to see what is going on in the deep learning area. But I was disappointed by this event. I arrived 30 minutes before to reserve a seat and then realized that we had to download about 5 GB of material for the tutorial on our laptops. This was however not possible on a 40 k/s WIFI connection. I expected that I could at least look at some live demo or tutorial on the screen. But that was not the case. The presenter basically just talked for five minutes with maybe 5 slides, and then let everyone work by themselves (which we could actually do by ourself at home).  After 1 hour, it was now clear that the presenter would not do any live demo or explain much on the screen.  Thus, I left.  Here is a picture from that tutorial:

kdd keras tutorial

Day 3 – Panel on what is a data scientist 

I attended a panel discussion about what is a data scientist with panelists from both academia and the industry.

kdd panel

I noted a few key points of that discussion, which I report below. Hopefully, my notes are accurate 😉

  • Hamit Hamotcu (Analytics center):
    • There is some confusion and many different titles.  Some people are even using the same title in the same company but have very different backgrounds such as a  bachelor in economics with an online degree in data science, and a PhD. in machine learning. These persons are certainly not doing the same thing.
  • Ravi Kumar (Google):
    • Data scientist may be a casual term that is more general than other terms like machine learning expert.
  • Kjersten Moody (State Farm Insurance):
    • Data analyst is more about reporting.
  • Narendra Mulani (Accenture):
    • At Accenture, “data scientist” is defined using a set of competencies.  If Accenture recruits someone having a software engineering, optimization, or machine learning background, Accenture will then help him develop his competencies with time and training to turn him into a data scientist.
    • He would like that curriculum are standardized across universities.
  • Claudia Perlich (Two Sigma):
    • She do not want to debate what is or not is a data scientist. She is more interested that the “stuff” can gets done.
  • Jeanette Wing (Columbia University):
    • It is not good to have a plethora of titles that are meaningless. There should be some effort to standardize the titles.  There should be some dialogue between industry and academia to achieve that.
    • “To be honest, I don’t know what is the difference between data scientist, data analyst, data engineer, etc. ” But in their program, they have computer science courses (machine learning, computer systems, distributed systems etc.) and others about statistics.  Data science is more than an agglomeration of computer science and statistics.  Students in their program must do a project with real data.
    • There are now over 200 programs in data science in the US. We should have some minimum requirements about what is the skillset of a data scientist.  The ACM is interested in coming up with a standardized curriculum.
  • Some people from Spotify in the audience:
    • What is the difference between “Research scientist” and “Data scientist”?

Location of KDD 2019 and KDD 2020 

It was announced that KDD 2019 will be held in the city of Anchorage (Alaska),  USA. Then, KDD 2010 will be held in San Diego, USA. In other words, the two next KDD conferences will be in the USA. Personally, I would have prefered that it would be in different countries.

Day 4

On the fourth day, there was again several activities and talks. In the afternoon, I attended the presentation of a company called Yixue which has an intelligent e-learning system for students in China. Their system is quite impressive.

Then, in the evening I attended the banquet. It was a buffet. It was reasonably good but the choice of food was quite limited.  But the most important is that I had some good discussions with other researchers.

Then, after the banquet I went to a cocktail organized by a leading artificial intelligence company from Montreal, Canada  called Element AI at an hotel nearby. This was a great event.

Day 5

On the fourth day, there was again more talks. I also visited again the exhibition of company products. Then, this was the end of the conference.

Conclusion

Overall, this was a great conference. For me, what I like about KDD is that there are many companies. For those in academia, it is good to see what is happening in the industry, and for those from the industry, it is good to learn about the latest research from  academia. Besides, KDD is so big that it is possible to talk with many researchers. 

Hope you have enjoyed reading this post. In about 1 week, I will be going to the DEXA and DAWAK conferences. I will also write blog posts about these conferences. Then, later this autumn, I should attend the ICDM conference.

==
Philippe Fournier-Viger is a professor of Computer Science and also the founder of the open-source data mining software SPMF, offering more than 145 data mining algorithms.

Posted in Big data, Conference, Data Mining, Data science | Tagged , , , , , | 5 Comments

A Model for Football Pass Prediction (source code + dataset)

In this blog post, I will discuss the data challenge of the Machine Learning for Sport Analytics workshop (MLSA 2018) at PKDD 2018. The challenge consisted of predicting the receivers of football passes (pass prediction).

football pass prediction

I will first briefly describe the data and then give an overview of my model called FPP (Football Pass Predictor) that was accepted as a paper in the workshop.

The dataset

The football pass prediction dataset consists of records describing thousands of football passes made during fifteen football matches of a Belgium team against other teams. Each record is a football pass. It gives the X, Y positions of the 14 players of each team (but at any time, not all players are on the field), the timestamps at which the pass started and ended, and the player who sent and the one who received the pass. Some limitation of the data is that all records of the fifteen matches are shuffled so each pass cannot be analyzed within its context in the overall football game. Besides, it is unclear if the X, Y positions were recorded at the time that the pass started or ended. The name of teams and players are also not provided as well as whether a team is playing on the left or right side of the field (although this information ça be inferred from player positions).

The goal

The goal of the challenge is to predict which player will receive each pass. However, no evaluation criteria were proposed for the challenge. Moreover, the organizers did not split the data into some training and testing data to evaluate solutions. Thus, I decided to simply use the accuracy as evaluation measure. The accuracy is, the number of correct predictions divided by the total number of predictions (records).  Moreover, I also considered the accuracy if two predictions are made instead of one.

The Football Pass Predictor model

Since the dataset is quite simple and I had not much time, I designed a simple  model to solve the problem of pass prediction. The model consists of a set of heuristics. After defining each heuristic, I  fine tuned its parameters by hand to achieve a high accuracy. If I had more time, I would have use a genetic algorithm to automatically tune parameters. I tried many heuristics and kept the ones increasing accuracy. I will give an overview of the model below.

The model is based on the observation that few passes are intercepted (less than 15 %). Thus, it make sense to only predict that passes will succeed. The model uses four heuristics:

  • A player generally prefers to send the ball to the closest player of the same team.
  • A player is less likely to send the ball to a player if this player is close to a player from the opposite team.
  • A player is less likely to send the ball to a player if this player is close to two players from the opposite team.
  • A player generally prefer to send the ball forward than backward.

Using these heuristics, the proposed model called FPP (Football Pass Predictor) can achieve more than 33% accuracy for one guess, and more than 50% for two guesses. This is considerably more than a random prediction model, which achieves about 8%.

I also tried to use more complex heuristics such as checking if a player of the opposite team is between the sender and a potential receiver by calculating angles but it did not improve accuracy.

Paper, source code, and dataset

The workshop paper describing the FPP model in more details:

Fournier-Viger, P., Liu, T., Lin, J. C.-W. (2018). Football Pass Prediction using Player Locations. Proc. of the 5th Machine Learning and Data Mining for Sports Analytics (MLSA 2018), in conjunction with the PKDD 2018 conference, CEUR, 6 pages.

The source code of the proposed FPP model can be downloaded from my website (it includes the dataset, which was originally obtained from the workshop website): http://www.philippe-fournier-viger.com/foot2018/ The model is implemented in Java and released under the GPL 3 open source license.

Besides, a simple video presentation of the paper can be found here (HTML5 video for playback on various devices).

Conclusion

In this blog post, I discussed the problem of football pass prediction and presented the FPP (Football Pass Predictor) model, which is simple but achieves quite high accuracy.  It would certainly be possible to further improve the model.

If you have  comments, please post them in the comment section below!

—-
Philippe Fournier-Viger is a professor of Computer Science and also the founder of the open-source data mining software SPMF, offering more than 145 data mining algorithms.

Posted in Big data, Data Mining, Data science | Tagged , , , , , , | Leave a comment

The future of pattern mining

In this blog post, I will talk about the future of research on pattern mining. I will also discuss some lessons learnt from the decades of research in this field and talk about research opportunities.

patterns

What is the state of research on pattern mining?

Over the last decades, many things have been discovered in pattern mining. The field has become more mature. For example,  algorithms for pattern mining generally always follow the same general approaches, established more than a decade ago. The main types of algorithms in pattern mining are the Apriori based algorithms, pattern growth algorithms and vertical algorithms. The proposal of these fundamental approaches has facilitated the development of new algorithms.

However, although traditional pattern mining problems have been well-studied such as frequent itemset mining, novel pattern mining problems are constantly proposed, and these problems often have unique challenges that require new tailored solutions. For example, this is the case for subgraph mining, where a subgraph mining algorithm must be able to deal with the problem of subgraph isomorphism checking, which does not exist in traditional pattern mining problems such as itemset mining. Another example is the design of efficient algorithms for novel architecture such as cloud systems, parallel systems, GPUs, and FPGAs, which requires to rethink traditional algorithms and their data structures.

A second observation about the state of research on pattern mining is that not all research areas of pattern mining have been explored equally. For example, some topics such as frequent itemset mining and association have received a lot of attention while other problems such as sequential rule mining and periodic pattern mining have been much less explored. In my opinion, this is not because these latter problems are less useful but perhaps because the problem of frequent itemset mining is simpler.

A third observation is that the field of pattern mining seems to be less popular in the last decade.  This is certainly true but it is not something to worry about because there are countless research problems that have not been solved in this field. Besides, all fields of computer science follow some trends that are cyclic.  This is the case for example for research on artificial intelligence which currently receives a lot of attention but was previously met with disinterest and lack of funding opportunities during specific time periods in the last decades (the “AI winters”). Besides, although pattern mining may seem to be less studied than before, some subfields of pattern mining are actually becoming more and more popular. For example, this is the case for high utility pattern mining, which has been growing steadily since the last 15 years. Here is a plot of the number of papers per year on utility mining (a figure prepared by Gan et al (2018):

This figure clearly shows a growing interest on the topic of utility pattern mining. Besides, quality papers in the field of pattern mining are still published in top conferences and journals.

What lessons can we learn?

Several lessons can be learnt. The first one is that too much research have in my opinion focused on improving the performance of algorithms in the last decades, while neglecting the applications of these algorithms. Don’t get me wrong. Performance is very important as one does not want to wait several hours to find patterns. However, considering the usefulness of the discovered patterns ensure that these algorithms will actually be used in real applications.  If researchers would think more about the usefulness of patterns, I think that this could help grow the field of pattern mining further.

There are several pattern mining problems, which have not been applied in real life. Why? A first reason is that the assumptions of some of these problems are unrealistic or too simple.

For researchers working on pattern mining, I think that potential applications should always be considered first.  Working on problems that have many potential applications or are more useful should be preferred. Thus a key lesson is to not forget the user and the applications. If possible discussions with potential users should be carried to learn about their needs. In general, a principle is that the more a problem is specialized, the less likely it will be to be used in real-life. For example, if someone would propose a very specialized problem such as “mining recent high utility episode patterns in an uncertain data streams when considering a sliding window and a gap constraint”, it is certainly less likely to be useful than the more general problem of “mining high utility episodes”.

A second reason why many algorithms are not used in real life is that many researchers do not provide their source code or applications. Sometimes, it is because the authors cannot share them due to restrictions from their institutions or collaborators. And sometimes, it is simply because researchers are worried that someone could design a better algorithm. There are also other reasons such as the lack of time to release the algorithms.  But sharing the source code of algorithms could greatly help other researchers and people interesting in using the algorithms. I previously wrote a detailed blog post about why researchers should share their implementations.

Research opportunities

Having discussed the state of research on pattern mining, there are actually many research opportunities such as:

  • Proposing faster and more memory efficient algorithms,
  • Proposing algorithms having more features or more user-friendly (e.g. interactive algorithms, visualization or algorithms offering to specify additional constraints that are useful for the user)
  • Proposing new pattern mining tasks that have novel challenges,
  • Proposing new applications of existing algorithms,
  • Proposing variations of existing problems (e.g. mining patterns in big data, using parallel architectures, etc.)

I personally think that pattern mining is a good research area because it is challenging and many things can be done.

Conclusion

This is what I wanted to talk about for today. Hope you will have enjoyed this blog post. If you have any other ideas or comments, please leave them in the comment section.

—-
Philippe Fournier-Viger is a professor of Computer Science and also the founder of the open-source data mining software SPMF, offering more than 145 data mining algorithms.

Posted in Big data, Data Mining, Data science | Tagged , , , | 2 Comments