A few years ago, I decided to give a try at theMLDM 2016 conference, which I had never attended. It was not a bad conference, although quite small and the registration is quite expensive for a conference (about 650 euros). I submitted a paper to MLDM because at that time, it was still published by Springer (update: in 2019, it was not published by Springer anymore) and the timing was good.
The conference itself was not bad, but as several other attendees, I have been disappointed by the MLDM conference location, which was supposed to be New York, but was instead in Newark, New Jersey!
Why is this a problem? The problem is that Newark is about 45 minutes by train from New York. Moreover, the location of the MLDM conference was one of the worst among all conferences that I have attended. The Ramada Hotel was located in the middle of highways, and there was basically nowhere to walk around. To go to New York we had to take a shuttle back to the Newark airport to then take a 40 minute train to New York.
Because of the misleading information about the conference being held in New York on the MLDM website, some attendees even booked airplanes to the JFK airport or Laguardia airport, which are in New York, and had to travel about 1 hour by train to get out of New York to arrive in Newark for the MLDM conference. Some of those persons were quite frustrated by the location.
The real location of MLDM 2019 is in Newark
A few years later, one could expect that things have changed. I did not submit papers but I decided to check. On the 28th February, I had a look at the webpage of MLDM 2019.
The deadline for submitting papers had passed. But the conference is again advertised as being in New York City. There are even some picture of New York on the website.
On March 14th 2019, I checked again. I clicked on the location section of the MLDM conference website, and it still advertised as being held in New York city (see below). And the exact conference location is *** not available ***.
On April 19 2019, I checked again. The deadline for submitting papers has passed since a long time now. The website has been updated, and if we look carefully, it is said that the MLDM conference will be held in Newark, New Jersey rather than New York. Thus again, the conference will not be in New York! It will be held in the same Ramada Hotel as in 2016, in Newark.
It is important to note that it is written only at one place on the website that the MLDM conference will be in Newark, while “New York City” is written everywhere else on the website, and there are many pictures of New York. Thus if someone does not read carefully, it is very easy to be mislead and think that the conference is in New York. Besides, since it is announced to be inNewark after the deadline for submitting papers, authors are already somewhat committed to attending the conference, and they expect it to still be in New York.
It should be announced **before the deadline* that the conference is in Newark.
This pattern of announcing that the conference is in New York before moving it to Newark seems to be repeating itself every year since 2016. I could not find the website of MLDM 2017 and MLDM 2018 because it is offline, but the proceedings of MLDM 2017 and MLDM 2018 claim that it was in New York. However, it was probably also in Newark, just like MLDM 2016.
Conclusion
Although I do not submit papers to MLDM since 2016, I have written this blog post because I think that it should be clearly announced that the MLDM conference is in Newark rather than New York. This will avoid disappointment of attendees who have submitted a paper and expected the conference to be held in New York.
Update2020-1: Another misleading information about MLDM 2019 was that it was supposed to be published by Springer in the LNAI series (this is written for example on the CFP posed on WikiCFP). But at the conference, some authors found that the proceedings were **not** published by Springer but by a small publisher called Ibai (see some author commenting about this in the comment section below). In case that it gets removed, here is screenshot of the call for papers:
This is a video presentation of the Apriori algorithmfor discovering frequent itemsets in data. Frequent itemset mining is one of the most popular data mining task.
This year, I am attending the PAKDD 2019 conference (23rd Pacific Asia Conference on Knowledge Discovery and Data Mining), in Macau, China, from the 14th to the 17th April 2019. In this blog post, I will provide information about the conference.
About the PAKDD conference
PAKDD is one of the most important international conference on data mining, especially forAsia and the pacific area. I have attended this conference several times in recent years. I have written reports about the PAKDD 2014, PAKDD 2015, PAKDD 2017 and PAKDD 2018 conferences.
The proceedings of PAKDD are published in the Springer Lectures Notes on Artificial Intelligence (LNAI) series, which ensures good visibility for the paper. Until the end of May 2019, the proceedings of PAKDD 2019 can be downloaded for free.
This year, PAKDD2019 received a record of 567 submissions from 46 countries. 25 papers were rejected because they did not follow the guidelines of the conference. Then, other papers were reviewed each by at least 3 reviewers. 137 papers have been accepted. Thus the acceptance rate is 24.1 %.
Location
The PAKDDconference was held at The Parisian hotel, a 5 stars hotel in Macau, China. Macau is a very nice city, located in the south of China. It has nice weather and some of its major industries are casinos and tourism. Macau was once occupied by Portugal before being returned to China. As a result, there is a certain Portuguese influence in Macau.
The Parisian Hotel, Macau
Day 0: Registration
On the first day, I arrived at the hotel and registered. The staff was very friendly. Below are some pictures of the registration area, the conference bags and materials. The bag is good-looking and contains the proceedings on a USB, the program, as well as some delicious local food as a gift.
The PAKDD2019conference bagThe conference material and gift!The PAKDD2019 Registration Desk
Day 1 : Tutorial: IoT BigData Stream Mining
In the morning, I have attended the IoT Big Data Stream Mining tutorial by Joao Gama, Albert Bifet, and Latifur Khan.
IoT Big Data Stream tutorial
It was first discussed that IoT is a very important topic nowadays. According to Google Trends, IoT (Internet of Things) has became more popular than “Big Data”.
IoT Applications
In traditional data mining, we often assume that we have a dataset to train a model. A key difference between traditional data mining and analyzing the data of IoT is that the data may not be a static dataset but a stream of data, coming from multiple devices. A data stream is a “continous flow of data generated at high-speed from a dynamic time-changing environment”. When dealing with a stream, we need to build a model that is updated in real-time and can fit in a limited amount of memory, to be able to do anytime predictions. Various tasks can be done on data streams such as classification, clustering, regression and pattern mining. Some key idea in stream mining is to extract summaries of the stream because all the data of a stream cannot be stored in memory. Then, the goal is to provide approximate predictions based on these summaries and provide an estimation of the error. It is also possible to not look at all the data but to take some data samples, and to estimate the error based on the sample size.
If you are interested in this topics, slides of this tutorial can be found here.
Day 1: Welcome reception
After the workshops and tutorials, there was a welcome reception in the evening at the Galaxy Hotel. There were drinks and food. It was a good opportunity for discussing with other researchers. I met several researchers that I knew and met several people that I did not knew.
The PAKDD2019 Welcome Reception
Day 2: Conference Opening
The second day started with the conference opening, where a traditional lion dance was first performed.
Then, the organizers talked. It was announced that there was more than 300 participants to the conference this year.
The PC chair gave information about the conference. Here are some pictures of some slides:
Then, there was a keynote about relational AI by Dr. Jennifer L. Neville. It was about the analysis of graph or networks such as social networks.
In the evening, there were no activities were planned, so I went with other researcher to eat at a restaurant in the Taipa area.
Day 3: Keynote on Talent Analytics
In the morning, there was a keynote by prof. Hui Xiong about “Talent Analytics: Prospects and Opportunities”. The talk is about how to identify and manage talents, which is very important for companies.
A talent is some “experienced professional with deep knowledge”. This is in contrast with personnel that do simple standardized work and have simple knowledge and may in the future be replaced by machines. Talents are team players and elite talents also have leadership. Leadership means to have vision about the current situation and what will happen in the next five years, be able to manage a team and manage risks. In terms of team management, it is important to find talents for the right positions and manage the team well.
The presenter explained that intelligent talent management (ITM) means to use data with an objective, and to take decisions based on data, and to offer specific solution to complex scenarios and be able to do recommendations and predictions. Some examples of tasks are to predict when talents will leave, do intelligent recruitment, do intelligent talent development, management, organization, and risk control. Doing this well requires big data technical knowledge and human resource management knowledge.
Then, there was paper presentations.
Day 3: Excursion and banquet
In the afternoon, there was a 4 hour city tour of St. Paul Ruin, Senado Square, A Ma temple and the Lotus flower square. Here are a few pictures.
Finally, the conference banquet was held in the evening. Several awards were announced.
Ee-Peng Lim received the Distinguished Contributions AwardShengrui Wang et al. received the best application paper awardThe best Student paper award went to Heng-Yi Li et al.The Best Paper Award went to Yinghua Zhang
And there was some music and show during the banquet:
Day 4: Keynote Talk on Big Data Privacy
In the morning, there was a keynote talk by Josep Domingo-Ferrer about how to reconcile privacy with data analytics. He explained what is big data anonymization, limitation of the state of the art techniques, how to empower subjects, users and controllers, and opportunities for research.
It was first discussed that several novels have anticipated the problem of data privacy, and nowadays many countries have adopted laws to protect data. A few principles are proposed to handle data: (1) only collect data that is needed that and keep it only as long as possible, (2) let the user give specific and explicit consent, and (3) limit collected data to some purpose, (4) the process should be open and transparent, (5) the ability to erase or rectify data, (6) protect data from security threats, (7) accountability, and (8) privacy should be in the design of the system.
But it is sometimes complicated to comply with these principles. It seems to be in conflict with the use of big data.
A solution is data anonymization. After we anonymize data, it may be easier to use the data for secondary uses. Thus a challenge is to create these anonymized big data sets.
Statistical disclosure control is a set of techniques to anonymize data. It is used to reduce the risk that data is re-identified. A goal is often to anonymize the data to reduce the risks of disclosure while preserving the usefulness of the data (utility).
On the other hand, privacy-first models ensure that the anonymized data meet some minimum requirements. One of the most famous approach is called “k-anonymity“.
Other approaches are “differential privacy” techniques.
Some challenges related to privacy forbig data is to ensure privacy in dynamic data (data streams). For big data, there are methods that anonymize data locally (e.g. by adding noise or generalization) before sending them to controller.
Some limitations of state-of-the-art techniques are as follows:
There was then some discussion of some proposals for privacy preserving big data analytics. I will not report all the details. The conclusions of the talk:
Day 4 – afternoon
In the afternoon, there was a PAKDDmost influential paper award presentation on Extreme Support Vector Machine by Prof. Qing He, as well as the PAKDD2019 Challenge Award presentation.
Conclusion
Overall, this was an excellent conference. It was well-organized. I met many researchers, listened to several interesting talks. Looking to PAKDD 2020next year in Singapore.
Update: I have also written reports following this conference about PAKDD 2020 and PAKDD 2024.
This week, I have attended the 7th China International Technology Expo (CITE 2019), which was held at the Shenzhen Convention and Exhibition Center in the city of Shenzhen, China from the 9th to the 11th April 2019. In this blog post, I will give a brief overview of this fair, where various companies were showing their new products and services.
The event is organized as a fair, where companies have booths, separated by themes: (1) Smart Home, Smart City, Smart Terminal, (2) New Display, (3) Intelligent Manufacturing and 3D printing, (4) Robot and Intelligent Systems, (5) Artificial Intelligence and Intelligent Hardware, (6) IOT, Blockchain, Cyber security, (7) Automative electronics, battery, New energy, (8) Basic electronics, components, equipments and materials.
There was numerous Chinese companies as well as some international companies. And it was quite interesting to see the various products on display. The CITE 2019 fair is reasonably big but not as big as some other technology fairs in China such as the BIG DATA expo.
Below, I show some selected pictures from the CITE2019 fair:
Robot for cleaning windowsThere was many specialised machinesCurved displaysA robot fish was swimmingRobots for assembly linesAnother assembly line robotMore displays including on transparent glassesLED displaysMore machinesMultiplayer virtual reality games3D printers were also on displayThere was many types of robots for kids and homeRealistic looking robots that can move8K displaysFlexible displaysSome of the booths at CITE2019
Conclusion
This was just a short blog post to give a glimpse of this event. I think it is quite interesting to attend such event to see what is happening in the industry. Hope you have enjoyed reading this blog post about CITE2019. If you want to get notified about next blog posts, you can follow me on Twitter at@philfv.
Today, I will list a few useful mailing lists related to data mining and big data. Subscribing to these mailing list is useful for PhD students and researchers, as many jobs, conferences, special issues and other opportunities are advertised on these mailing lists. It is also good to post your own announcements for jobs, call for papers, etc.
Five years ago, I had analyzed the source code of the SPMF data mining software using an open-source tool called CodeAnalyzer ( http://sourceforge.net/projects/codeanalyze-gpl/ ). This had provided some interesting insights about the structure of the project, especially in terms of lines of codes and code to comment ratio. In 2013, for SPMF 0.93, the results were as follows:
Metric Value ——————————- ——– Total Files 280 Total Lines 53165 Avg Line Length 32 Code Lines 25455 Comment Lines 23208 Whitespace Lines 5803 Code/(Comment+Whitespace) Ratio 0,88 Code/Comment Ratio 1,10 Code/Whitespace Ratio 4,39 Code/Total Lines Ratio 0,48 Code Lines Per File 90 Comment Lines Per File 82 Whitespace Lines Per File 20
Today, in 2018 I decided to analyze the code of SPMF again to get an overview of how the code has evolved over the last few years. Here are the result for the current version of SPMF (2.35):
Metric Value ——————————- ——– Total Files 1385 Total Lines 238938 Avg Line Length 32 Code Lines 118117 Comment Lines 91241 Whitespace Lines 32797 Code/(Comment+Whitespace) Ratio 0,95 Code/Comment Ratio 1,29 Code/Whitespace Ratio 3,60 Code/Total Lines Ratio 0,49 Code Lines Per File 85 Comment Lines Per File 65 Whitespace Lines Per File 23
Many numbers remain more or less the same. But it is quite amazing to see that the number of lines of code has increased from 25,455 to 118,117 lines. The project is thus about four times larger now. This is in part due to contributions from many people, in recent years, while at the beginning the software was mainly developed by me. The total number of lines may still not seem very big for a software. However, most of the code is quite optimized and implement complex algorithms. Thus, many of these lines of code took quite a lot of time to write.
The number of comment lines has also increased, from 23,208 to 91,241 lines. But the ratio of code to comment lines has slightly increased. Thus, perhaps that adding some more comments is needed.
What is next for SPMF? Currently, I am preparing to release a new version of SPMF, which will include about 10 new algorithms. It should be released in about 1 or 2 weeks, as I need to finish other things first.
That is all for today! If you have comments or questions, please post them in the comment section below.
— Philippe Fournier-Vigeris a full professor working in China and founder of the SPMF open source data mining software.
In this blog post, I talk about how to improve the quality of your research papers. This is an important topic as most researchers aim at publishing papers in top level conferences and journals for various reasons such as graduating, obtaining a promotion or securing funding.
Write less papers. Focus on quality instead of quantity. Take more time for all steps of the research process: collecting data, developing a solution, doing experiments, and writing the paper.
Work on a hot topic or new research problem, that can have an impact. To publish in top conferences and journals, it will help to work on a popular or recent research problem. Your literature review should be up to date with recent and relevant references. If all your references are more than 5 years old, the reviewers may think that the problem is old and unimportant. Choosing a good research topic also mean to work on something that is useful and can have an impact. Thus, take the time to choose a good research problem before starting your work.
Improve your writing skills. For top conferences and journals, the papers must be well written. Often, this can make the difference between a paper being accepted and rejected. Hence, spend more time to polish your paper. Read your paper several times to make sure that there is no obvious errors. You may also ask someone else to proofread your paper. And you may want to spend more time reading and practicing your English.
Apply your research to real data or make collaboration with the industry. In some field like computer science, it is possible to publish a paper that is not applied to real applications. But if you put extra effort into showing the real application and obtain data from the industry, it may make your paper more convincing.
Collaborate with excellent researchers. Try to work with researchers who frequently publish in top conferences and journals. They will often find flaws in your project and paper that could be avoided and give you feedback to improve your research. Moreover, they may help improve your writing style. Thus, choose a good research team and establish relationships with good researchers and invite them to collaborate.
Submit to the top conferences and journals. Many people do not submit to the top conferences and journals because they are afraid that their papers will be rejected. However, even if it is rejected, you will still usually get valuable feedback from experts that can help to improve your research, and if you are lucky, your paper may be accepted. A good strategy is to first submit to the top journals and conferences and then if it does not work, to submit to lower level conferences and journals.
Read and analyze the structure of top papers in your field. Try to find some well-written papers in your field and then try to replicate the structure (how the content is organized) in your paper. This will help to improve the structure of your paper. The structure of the paper is very important. A paper should be organized in a logical way.
Make sure your research problem is challenging, and the solution is well justified. As I said, it is important to choose a good research problem. But it is important also to provide an innovative solution to the problem that is not trivial. In other words, you must solve an important and difficult problem where the solution is not obvious. You must also write the paper well to explain this to the reader. If the reviewer think that the solution is obvious or not well-justified, then the paper may be rejected.
Write with a target conference or journal in mind. It is generally better to know where you will submit the paper before you write it. Then, you can better tailor the paper to your audience. You should also select a conference or journal that is appropriate for your research topic.
Don’t procrastinate. For conference papers, write your paper well in advance so that you have enough time to write a good paper.
Those are my advices. If you have other advices or comments, please share them in the comment section below. I will be happy to read them.
— Philippe Fournier-Viger is a full professor working in China and founder of the SPMF open source data mining software.
This is a video presentation of the paper “Mining Partially-Ordered Sequential Rules Common to Multiple Sequences” about discovering sequential rules in sequences using the RuleGrowth algorithm.
This week, I have attended the 2018 International Workshop on Mining of Massive Data and IoT (2018 年大数据与物联网挖掘国际研讨会) organized by the Fujian Normal University in the city of Fuzhou,China from the 18th to 20thDecember 2018.
I have attended the workshop to give a talk and also to meet other researchers, and listen to their talks. There was several invited expertsfrom Canada, as well as from China. Below, I provide a brief report about the workshop. The workshop was held at the Ramada Hotel in Fuzhou.
Talks
There was 11 long talks. Given by the invited experts. The opening ceremony was chaired by Prof. Shengrui Wang and featured the dean Prof. Gongde Guo.
Prof. Jian-Yun Nie from University of Montreal (Canada) talked about information retrieval from big data. Information retrieval is about how to search for documents using queries (e.g. when we use a search engine). In traditional information retrieval, documents and queries are represented as vectors and relevance of documents is estimated by a similarity function. Prof. Nie talked about using deep learning to learn representation of content and matching for information retrieval.
Prof. Sylvain Giroux from University of Sherbrooke (Canada) gave a talk about transforming homes into smart homes that provide cognitive assistance to cognitively impaired people. He presented several projects, including a system called COOK that is designed to help people to cook using a modified oven equipped with sensors and communication abilities. He also shown another project using the Hololens to build a 3D mapping of all objects in a house and tag them with semantics (an ontology).
Prof. Guangxia Xu from Chongqing University of Posts and Telecommunications gave a talk about data security and privacy in intelligent environments.
Prof. Philippe Fournier-Viger (me), then gave a talk about high-utility pattern mining. It consists of discovering important patterns in symbolic data (for example, to identify the sets of items purchased by customers that yield a lot of money). I also presented the SPMF software that I founded, which offers more than 150 data mining algorithms.
Then, there was a talk by Dr. Shu Wu about using deep learning in context recommender systems. That talk was followed by a very interesting talk by Prof. Djemel Ziou of University of Sherbrooke (Canada) about his various projects related to image processing, object recognition, and virtual reality. In particular, Prof. Ziou talked about a project to evaluate the color of light pollution from pictures.
Then, another interesting talk was by Dr. Yue Jiang from Fujian Normal University. She presented two measures called K2 and K2* to calculate sequence similarity in the context of bioinformatics. The designed approach is alignment-free and can be computed very efficiently.
On the second day, there was more talks. A talk by Prof. Hui Wang from Fujian Normal University was about detecting fraud in the food industry. This is a complex topic, which requires to use complex techniques such as a mass spectrometer. It was explained that some products such as olive oil are often not authentic with up to 20% of olive oil looking suspicious. Traditionally, food tests were performed in a lab, but nowadays handheld devices have been developed using infrared light to quickly perform food tests anywhere.
Then, there was a talk by Prof. Hui-Huang Tsu about elderly home care and sensor data analytics. He highlighted privacy issues related to the use of sensors in smart homes.
There was a talk by Prof. Wing W.Y. Ng about image retrieval and a talk by Prof. Shengrui Wang about regime switch analysis in time series.
Conclusion
This was an interesting event. I had the opportunity to talk with several other researchers with common interests. The event was well-organized.
— Philippe Fournier-Viger is a full professor working in China and founder of the SPMF open source data mining software.
This is a video presentation of the paper “Mining Correlated High-Utility Itemsets Using the bond Measure” about correlated high utility pattern mining using FCHM.