China International BigData Industry Expo 2019 (a brief report)

This week I am attending the 2019 China International Big Data Industry Expo (CIBD 2019), held in Guiyang, China. I will report on the event on this blog. The event is from May 26-29.

Why this event is important?

The China International Big Data Industry Expo is a huge event, and the biggest related to big data in China. This year 448 companies are participating, including over 150 foreign companies such as SAS and Microsoft, and major Chinese companies like Tencent and Huawei. The exhibition space is more than 60,000 square meters and more than 1700 foreign visitors from 38 countries are attending. In previous years many leaders of the Chinese industry have also given talks at this expo such as Pony Ma and Jack Ma.

Why I attend?

It is an excellent event to connect with the industry and see the trends and recent innovations related to big data, and also to learn about new government policies. I have attended CIBD 2018 last year (report about CIBD 2018 here), and I think it was a great event. I attend as VIP guest.

Why is it held in the city of Guiyang?

I will explain briefly. Guiyang is located in the province of Guizhou in China. Historically, Guizhou is not one of the richest provinces in part due to its location a bit far from the coast. However, a key feature of the region is its large water and electricity supply, cool weather, and it is located in a stable geological area. All these factors are highly desirable for setting up large data centers for storing big data. For this reason, it has been selected as a key city for the development of the big data industry in China. Huge government incentives are in place to transform Guiyang into the Chinese city of big data. Due to this, it has grown very fast in recent years. Numerous large international and Chinese companies have data centers in Guiyang such as Apple, Tencent, and Alibaba. It is said that more than 1600 big data companies are now operating in Guiyang, generating a yearly revenue of more than 15 billions USD. The GDP of the city is also growing very fast (increased by 10 percent last year!). It is thus a very interesting place for everything about big data. The Big Data Expo is held every year in Guiyang around the end of May.

Location of Guiyang in China

Theme: Data creates values

This year, the expo has a special theme on the applications of big data. Beside the exhibition, 49 forums, and several talks, conferences and other activities are held. Some of the topics that are going to be discussed are big data, AI, self driving cars, security, data science, 5G, intelligent manufacturing,  blockchains, and smart cities.

Some announcements are also expected about new policies in Guiyang to attract talents, and the growth of the Shubo Avenue, a novel district in Guiyang for big data companies and projects that is receiving major investments.

2019 China International Big Data Fusion and AI Global Competition

On the 25th May afternoon, I attended this competition, which was held at the Empark hotel, and sponsored by Intel. The format of this competition is quite interesting with a set of 9 judges evaluating competitors, followed by an award ceremony. The judges included Prof. Jian Pei, King Wang (Tencent cloud), and others. Each competitor team had 8 minutes for presenting his project and answer questions from judges. The event was very well organized, offering simultaneous translation from Chinese to English, which makes it accessible to non Chinese speakers.

The first team was from Israel, a company called Keepod. They mentioned that 4 billion persons don’t have access to personal computing (excluding mobile devices) and the solution is not to buy a computer to each one. They instead propose to distribute an encrypted USB to each student that contains data and applications so that many people can share the same computers by just plugging their USB to a computer to work and then leave with their USB. The project is used in Cameroon and other countries.

The second team is a startup iSpace that relies on AI. They develop an advanced recovery control system and fault analysis for rockets. The system appeared interesting but the presenter spoke very fast and changed slides sometimes very quickly. In my opinion, they should have the presentation more succinct rather than try to show too much in a short time. But the technology looks great.

The third team is a company from Beijing working on AR (augmented reality). They mention that resolution of AR glasses is important. They developed prototypes of advanced AR glasses, that can have various applications such as for military. They are focused on the hardware solution and optics.

The next company is Braid (不来赛德) from Shenzhen, and relies on AI for industrial projects. They use knowledge and concept graphs, deep learning and other technologies. Some of their projects is related to analyzing transaction data from stock markets.

The next company is TrueMicro. They work on low power chips. One of their product is a computing stick called Movidius. One application is for traffic monitoring. They also develop chips based on RISC-V architecture for AI. They supply chips for some Huawei servers and 5G base stations. They also provide ASIC and FPGA.

The next team is Pzartech from Israel. Its goal is to provide solutions to reduce the downtime of complex mechanical systems such as engines of airplanes. In fact, if an airplane has a problem then it cannot fly until it is fixed and money is lost. The proposed solution uses image treatment, deep learning and semi synthetic data generation. A technician that repairs an engine takes a picture of a part with his cellphone to find information about the part such as its name, which greatly helps to fix a system more quickly. It is basically and object recognition problem supported by the cloud.

The next company works on IoT with 5G technology, and is named CranCloud. It works on base stations. They work on integrated solutions rather than only chips.

The next company is related to AI for smart security checks. Mostly, they have solutions for the analysis of pictures or videos based on AI, such as to analyze pictures from security checks at airports, or pictures of parcels send through mail. They use labelled data from train or subway security checks. They aim to detect forbidden objects such as lithium batteries.

The next company is about computer vision with AI. They mention that there are many applications of computer visions/ They discussed some applications such as intelligent security checks and intelligent kitchen. They propose an algorithm platform named Extreme Vision for vision recognition, which has more than 500 algortihms. Some applications are fire detection or detecting that construction workers don’t wear helmets. One of the judges mentioned that there are already many AI vision companies. The presenter explained that they provide a platform to facilitate the development of AI vision solutions.

The last company is a Shanghai based company, also working on deep learning technology for image processing and other related topics, which collaborates with Huawei, Xiaomi, Toyota and Apple. They have a transportation big data platform, and analyze data from vehicles to improve self driving cars, among other projects. They also have technology to analyze industrial parts. Their business model is to sell license for their software.

The judges then provided some general comments. One of the comment is that many teams were focusing on computer vision with AI, and solutions for this type of problems have become quite mature, and perhaps that it is important to focus on specific aTpplications such as security checks for this type of project. Moreover, a judge was also happy to see the more fundamental research such as on chips. There was also several other comments.

The awards were then presented. Keepod, Braid, and Pzartech received some “access” awards. Three companies received a “innovation award” such as the Extreme vision platform company from Shenzhen. Finally, the top three winners were announced. The third prize was to TrueMicro, the first prize was to iSpace, and the second prize was to CranCloud. I perhaps missed a few details about the awards and may not be totally accurate.

Opening ceremony

The opening ceremony was held on the 26th May at the Guiyang International Eco Conference Center.

Several leaders from the Chinese government were present such as:

Wang Chen, Member of the Political Bureau of the CPC Central Committee, Vice Chairman of the Standing Committee of the National People’s Congress, Miaowei, Minister of Industry and Information Technology, Guo Zhenhua, Deputy Secretary-General of the Standing Committee of the National People’s Congress, Yang Xiaowei, Deputy Director of the National Internet Information Office, Rongfa, Vice-Director of the State Administration of Taxation, Xianzude State Statistics Bureau, Wang Mingyu, Vice-Governor of Liaoning Province

and representatives and CEOs from many companies.

A letter from the Chinese president Xi Jinping supporting the expo was read.

and a letter from the secretary general of United Nations:

It was mentioned during the ceremony that some goals are to support big data companies, and the recruitment of talents, how big data can support the industry, how to ensure security of the data, build core technology, how to design regulations about how data is handled.

Paul M. Romer winner of the 2018 Nobel prize of economics gave a talk. He talked about the concept of cyber sovereignty, that is that each country should be able to regulate the Internet. He mentioned that in some countries like USA, what is good for firms often take over was is good for society. The most common business model is targeted advertisement and the user often don’t know about the data they are giving. He talked about other things such as implementing big data for road networks to improve people s lives using big data.

There was then a talk by Whitfield Diffie, famous cryptography specialist and Turing award winner. He first mentioned that 5000 years ago, and now we are moving our culture in the cloud. He mentioned that computers were designed for big data, as the properties of big data such as variety have always been there. For big data, we need computers to store and process data. He defined artificial intelligence as using computers to do things that people used to do such a playing Chess, Go, translation and autonomous driving. For him the most important aspect of AI is to leverage huge amount of data to think about things that people cannot think about. He also talked about cyber-security. Information should.not.be corrupted (integrity) and we need to know the source (authenticity). Confidentiality of data is important and depends on authenticity (for example, phishing websites). Big data can be used to reduce security, But AI can give new techniques of controlling computers that may improve security. Big data security depends on the control of input data, the mining process, and the results. Big data will be everywhere in our society and its security is crucial.

He shown a quote from the Chinese president:

There was then a talk by Prof Gao Wen From Beijing University. He talked that a fourth industrial revolution may happen in 10 years, where artificial intelligence would play the key role. He talked about weak vs strong AI and that technology we have today is weak AI. He also mention that AI has evolved from coding in the 1970s to expert systems with rules in the 1980s to now deep neural nets trained using big data (in terms of trends). He mentioned that having data is key to doing big data research so working with companies is good for academia. He thinks that computer that we use today will not be able to achieve strong AI, for example the brain consumes much less energy than a computer, so we should reach that efficiency. An advantage of China is the large amount of data. Open source platforms are great for advancing technology. China would benefit from developing its own open source platforms, and having more AI specialists.

The Industry Expo

The Interactive Art Exhibition

There was also a very interactive art exhibition where people could interact with art using technology. Here are a few pictures.

The Big Data Concert

In the evening, I was invited to a great Big Data concert by the Symphony Orchestra of Guiyang.

The Belt and Road Forum

I attended the “Belt and Road” Big Data Innovation Entrepreneurship Forum. There was several guests.

From the industry, Johannes Vizethum from Advisory Allies gave a talk about the applications of AI with big data, and about how AI can benefit to the industry. He discussed several use cases, including some augmented reality system to help repair cars. This system can recognize pieces of a car using image processing. Another system can evaluate the fatigue level or productivity of workers from video cameras. Another use case is intelligent tools for construction sites. Here are a few interesting slides related to AI:

The next speaker was Michael Eagleton and is presentation was called “Together we build prosperity”. He is now living in Shenzhen and involved in several business including his own Shenzhen Xinshunao Co. Ltd. He first talked about what is big data, and how internet is important in our daily lives. He has shown statistics indicating that 97.2% of organizations are investing in AI and big data, but it is not clear if these statistics are from China, USA or other places. He indicated that according to Wikibon the big data and analytics market is worth 49 billion $. Moreover, he shown statistics from Statista indicating that the big market is expected to grow by 20 %. He also cited Forbes/IBM, which says that data science and analytics jobs will reach 2.7 million by 2020 for the world, and that there is a gap between offer and demand, and more talents are needed on the job market. According to Domo, every person will generate 1.7 MB of data per second in 2020. He mentioned that automated analytics will be crucial in the future, and that all countries should collaborate to build the future.

The next speaker was Philip Beck and his talk was “Expand your business with Big Data”. He is an “angel investor”, who has worked a lot on marketing, and who has been living in China for more than a decade. He mentioned that for a business, some customers are more valuable (e.g. spend more money) than others, and that with big data, we can understand what characterize the most valuable customers. He mentioned how mobile payment systems like Alipay and Wechat Pay are widely used in China, and that the data collected from these systems can be used for marketing (e.g. sending targeted advertisements to some specific customers).

The next speaker was David Kovacs, director of CDSI Startup Campus Global. He first mentioned the strong relations between China and Europe. He also mention that the regulations and technology in China are suitable for innovation, and there are a lot of support from the government.

Then, there was a talk by Marian Danko founder and CEO of weHustle and TECOM Conf Startup Grind. This latter company provides a platform to connect entrepreneurs to share knowledge, experience and support each other. They also organize events for entrepreneurs to meet others.

Banquet

About the rapid growth of Guiyang

Here are some interesting facts about the growth of Guiyang (pictures from the Guiyang Today newspaper).

and it is interesting to see goals for the growth of the city for the next 15 years, announced during the Fifth plenary session of the tenth Guiyang Municipal Committee of the CPC:

This is an ongoing event. I will keep updating this blog post. …..

Conclusion

In this blog post, I have talked about the CIBD2019 expo. Hope you have enjoyed it!  If you have comments, please share them in the comment section below. I will be happy to read them.


Philippe Fournier-Viger is a full professor working in China and founder of the SPMF open source data mining software.

The review speed of academic journals

Today, I will talk about the review speed of academic journals. Review speed is an important criterion for selecting a journal to submit a paper when researchers face time constraints. For example, it is common that some students need to publish a paper quickly to graduate, or that a professor may want to get his research published quickly to meet requirements of a performance evaluation or just to be the first to publish some new ideas.

Before I talk about this topic in more details, it is important to know that that many aspects should be considered to select a journal for submitting a paper such as:

  • the reputation of the journal in your field (a good journal will give more visibility to your work),
  • metrics (e.g. impact factor?, is the journal indexed by major publication databases such as EI and SCI? ranking),
  • is the topic of the journal appropriate for your paper and did that journal previously publish papers related to your topic?
  • the review speed,
  • the cost of publishing a paper in the journal (is it free? or is there a fee?),
  • is the requirements of this journal suitable for your paper (in terms of article format, maximum number of pages, etc.).

Now, let’s talk in more details about review speed. The time required to process a paper can very greatly from one journal to another, and also in different fields. For some journals, the turnaround time will be very quick, and an author may get a first decision in just a few weeks, while in some other journals, it may take several months or even a year.

Personally, I have published more than 70 journal papers, and in some cases I have waited up to two years:

This is quite long for a paper. But it can be worse. In an extreme case, some people had their paper published after 10 years:

What influences the review time?

There are several factors that can influence the review time such as:

  • does the editor quickly find reviewers to evaluate the paper?
  • does the reviewers disagree (in that case more reviews may be needed)?
  • does the reviewers are late or do not submit their reviews at all (in that case other reviewers may have to be found)?
  • how much time the journal gives to reviewers (some journal may give a few months)?
  • does multiple rounds of reviews are needed?

How to check the review time of a journal?

Given that review time is important, how can we know the average review time? There are several ways. First, one may contact with the editor to ask about it, or discuss with colleagues or other people who have previously published in that journal. Second, some journal will publish the average review times on their websites. This is a very useful information. For example, many Elsevier journals indicate the average review time. As example, I show below the processing times for four journals related to applied artificial intelligence, namely Knowledge-Based Systems, Information Science, Engineering Applications of Artificial Intelligence and Advanced Engineering Informatics.

Review speed of the KBS (Knowledge-Based System) journal
https://journalinsights.elsevier.com/journals/0950-7051/review_speed

Review speed of the IS (Information Science) journal
https://journalinsights.elsevier.com/journals/0020-0255/oapt

Review speed of the EAAI (Engineering Application of Artificial Intelligence) journal
https://journalinsights.elsevier.com/journals/0020-0255/oapt

Review speed of the AEI (Advance Engineering Informatics) journal
https://journalinsights.elsevier.com/journals/1474-0346/oapt

As it can be seen above, the review time of journals can vary from one journal to another. Besides, it should be noted that those are average. It is quite possible to receive reviews more quickly if everything goes well, or more slowly, if there are some problems with the review process. In the above charts, it can be seen that Knowledge-Based Systems is slower than Advanced Engineering Informatics. But this is in my opinion understandable as KBS is perhaps a more famous journal, and may perhaps receive more papers. Generally, less famous journals may have faster processing times, but it is not always true.

Hope that this blog post has been interesting. If you have comments, questions, or would like to add something to this discussion, please post a message in the comment section below!

==
Philippe Fournier-Viger is a professor, data mining researcher and the founder of the SPMF data mining software, which includes more than 150 algorithms for pattern mining.

(video) Mining Frequent Itemsets with the Apriori algorithm

This is a video presentation of the Apriori algorithm for discovering frequent itemsets in data.

The Java source code of the Apriori algorithm and datasets for evaluating its performance are available in the SPMF software.

If you want to know more about itemset mining, you can read my survey of itemset mining, which provides a good introduction to the topic.

That is all for today. More data mining videos will be posted soon!

==
Philippe Fournier-Viger is a professor, data mining researcher and the founder of the SPMF data mining software, which includes more than 150 algorithms for pattern mining.

MLDM 2019… still not in New York!

A few years ago, I decided to give a try at the MLDM 2016 conference, which I had never attended. It was not a bad conference, although quite small and the registration is quite expensive for a conference (about 650 euros). I submitted a paper to MLDM because it is published by Springer and the timing was good.

The conference itself was not bad, but as several other attendees, I have been disappointed by the MLDM conference location, which was supposed to be New York, but was instead Newark, New Jersey!

Why is this a problem? The problem is that Newark is about 45 minutes by train from New York. Moreover, the location of the MLDM conference was one of the worst among all conferences that I have attended. The Ramada Hotel was located in the middle of highways, and there was basically nowhere to walk around. To go to New York we had to take a shuttle back to the Newark airport to then take a 40 minute train to New York.

Because of the misleading information about the conference being held in New York on the MLDM website, some attendees even booked airplanes to the JFK airport or Laguardia airport, which are in New York, and had to travel about 1 hour by train to get out of New York to arrive in Newark for the MLDM conference. Some of those persons were quite frustrated by the location.

The real location of MLDM 2019 is in Newark

A few years later, one could expect that things have changed. I did not submit papers but I decided to check. On the 28th February, I had a look at the webpage of MLDM 2019.

The deadline for submitting papers had passed. But the conference is again advertised as being in New York City. There are even some picture of New York on the website.

On March 14th 2019, I checked again. I clicked on the location section of the MLDM conference website, and it still advertised as being held in New York city (see below). And the exact conference location is *** not available ***.

On April 19 2019, I checked again. The deadline for submitting papers has passed since a long time now. The website has been updated, and if we look carefully, it is said that the MLDM conference will be held in Newark, New Jersey rather than New York. Thus again, the conference will not be in New York! It will be held in the same Ramada Hotel as in 2016, in Newark.

It is important to note that it is written only at one place on the website that the MLDM conference will be in Newark, while “New York City” is written everywhere else on the website, and there are many pictures of New York. Thus if someone does not read carefully, it is very easy to be mislead and think that the conference is in New York. Besides, since it is announced to be in Newark after the deadline for submitting papers, authors are already somewhat committed to attending the conference, and they expect it to still be in New York.

In my opinion, it should be announced **before the deadline* that the conference is in Newark.

This pattern of announcing that the conference is in New York before moving it to Newark seems to be repeating itself every year since 2016. I could not find the website of MLDM 2017 and MLDM 2018 because it is offline, but the proceedings of MLDM 2017 and MLDM 2018 claim that it was in New York. But it was probably also in Newark.

Conclusion

Although I do not submit papers to MLDM since 2016, I have written this blog post because I think that it should be clearly announced that the MLDM conference is located in Newark rather than New York. This will avoid disappointment of attendees who have submitted a paper and expected the conference to be held in New York.

How to write a research paper?

How to write good research papers and publish them in excellent conferences? I have written a series of blog posts to answer these questions and more. To make this information easy to find, here is the list of these blog posts:

Hope that the information is useful. 🙂 If you have any comments, please share them below!

—-
Philippe Fournier-Viger is a professor of Computer Science and also the founder of the open-source data mining software SPMF, offering more than 145 data mining algorithms.

The PAKDD 2019 conference (a brief report)

This year, I am attending the PAKDD 2019 conference (23rd Pacific Asia Conference on Knowledge Discovery and Data Mining), in Macau, China, from the 14th to the 17th April 2019. In this blog post, I will provide information about the conference.

About the PAKDD conference

PAKDD is one of the most important international conference on data mining, especially for Asia and the pacific area. I have attended this conference several times in recent years. I have written reports about the PAKDD 2014, PAKDD 2015, PAKDD 2017 and PAKDD 2018 conferences.

The proceedings of PAKDD are published in the Springer Lectures Notes on Artificial Intelligence (LNAI) series, which ensures good visibility for the paper. Until the end of May 2019, the proceedings of PAKDD 2019 can be downloaded for free.

This year, PAKDD 2019 received a record of 567 submissions from 46 countries. 25 papers were rejected because they did not follow the guidelines of the conference. Then, other papers were reviewed each by at least 3 reviewers. 137 papers have been accepted. Thus the acceptance rate is 24.1 %.

Location

The conference was held at The Parisian hotel, a 5 stars hotel in Macau, China. Macau is a very nice city, located in the south of China. It has nice weather and some of its major industries are casinos and tourism. Macau was once occupied by Portugal before being returned to China. As a result, there is a certain Portuguese influence in Macau.

The Parisian Hotel, Macau

Day 0: Registration

On the first day, I arrived at the hotel and registered. The staff was very friendly. Below are some pictures of the registration area, the conference bags and materials. The bag is good-looking and contains the proceedings on a USB, the program, as well as some delicious local food as a gift.

The PAKDD 2019 conference bag
The conference material and gift!
The PAKDD 2019 Registration Desk

Day 1 : Tutorial: IoT BigData Stream Mining

In the morning, I have attended the IoT Big Data Stream Mining tutorial by Joao Gama, Albert Bifet, and Latifur Khan.

IoT Big Data Stream tutorial

It was first discussed that IoT is a very important topic nowadays. According to Google Trends, IoT (Internet of Things) has became more popular than “Big Data”.

IoT Applications

In traditional data mining, we often assume that we have a dataset to train a model. A key difference between traditional data mining and analyzing the data of IoT is that the data may not be a static dataset but a stream of data, coming from multiple devices. A data stream is a “continous flow of data generated at high-speed from a dynamic time-changing environment”. When dealing with a stream, we need to build a model that is updated in real-time and can fit in a limited amount of memory, to be able to do anytime predictions. Various tasks can be done on data streams such as classification, clustering, regression and pattern mining. Some key idea in stream mining is to extract summaries of the stream because all the data of a stream cannot be stored in memory. Then, the goal is to provide approximate predictions based on these summaries and provide an estimation of the error. It is also possible to not look at all the data but to take some data samples, and to estimate the error based on the sample size.

If you are interested in this topics, slides of this tutorial can be found here.

Day 1: Welcome reception

After the workshops and tutorials, there was a welcome reception in the evening at the Galaxy Hotel. There were drinks and food. It was a good opportunity for discussing with other researchers. I met several researchers that I knew and met several people that I did not knew.

The PAKDD 2019 Welcome Reception

Day 2: Conference Opening

The second day started with the conference opening, where a traditional lion dance was first performed.

Then, the organizers talked. It was announced that there was more than 300 participants to the conference this year.

The PC chair gave information about the conference. Here are some pictures of some slides:

Then, there was a keynote about relational AI by Dr. Jennifer L. Neville. It was about the analysis of graph or networks such as social networks.

Then, there was several research paper presentations for the rest of the day. We presented a paper about high utility itemset mining called “Efficiently Finding High Utility-Frequent Itemsets using Cut off and Suffix Utility“.

In the evening, there were no activities were planned, so I went with other researcher to eat at a restaurant in the Taipa area.

Day 3: Keynote on Talent Analytics

In the morning, there was a keynote by prof. Hui Xiong about “Talent Analytics: Prospects and Opportunities”. The talk is about how to identify and manage talents, which is very important for companies.

A talent is some “experienced professional with deep knowledge”. This is in contrast with personnel that do simple standardized work and have simple knowledge and may in the future be replaced by machines. Talents are team players and elite talents also have leadership. Leadership means to have vision about the current situation and what will happen in the next five years, be able to manage a team and manage risks. In terms of team management, it is important to find talents for the right positions and manage the team well.

The presenter explained that intelligent talent management (ITM) means to use data with an objective, and to take decisions based on data, and to offer specific solution to complex scenarios and be able to do recommendations and predictions. Some examples of tasks are to predict when talents will leave, do intelligent recruitment, do intelligent talent development, management, organization, and risk control. Doing this well requires big data technical knowledge and human resource management knowledge.

Then, there was paper presentations.

Day 3: Excursion and banquet

In the afternoon, there was a 4 hour city tour of St. Paul Ruin, Senado Square, A Ma temple and the Lotus flower square. Here are a few pictures.

Finally, the conference banquet was held in the evening. Several awards were announced.

Ee-Peng Lim received the Distinguished Contributions Award
Shengrui Wang et al. received the best application paper award
The best Student paper award went to Heng-Yi Li et al.
The Best Paper Award went to Yinghua Zhang

And there was some music and show during the banquet:

Day 4: Keynote Talk on Big Data Privacy

In the morning, there was a keynote talk by Josep Domingo-Ferrer about how to reconcile privacy with data analytics. He explained what is big data anonymization, limitation of the state of the art techniques, how to empower subjects, users and controllers, and opportunities for research.

It was first discussed that several novels have anticipated the problem of data privacy, and nowadays many countries have adopted laws to protect data. A few principles are proposed to handle data: (1) only collect data that is needed that and keep it only as long as possible, (2) let the user give specific and explicit consent, and (3) limit collected data to some purpose,  (4) the process should be open and transparent, (5) the ability to erase or rectify data, (6) protect data from security threats, (7) accountability, and (8) privacy should be in the design of the system.  

But it is sometimes complicated to comply with these principles. It seems to be in conflict with the use of big data.

A solution is data anonymization. After we anonymize data, it may be easier to use the data for secondary uses. Thus a challenge is to create these anonymized big data sets.

Statistical disclosure control is a set of techniques to anonymize data. It is used to reduce the risk that data is re-identified. A goal is often to anonymize the data to reduce the risks of disclosure while preserving the usefulness of the data (utility).

On the other hand, privacy-first models ensure that the anonymized data meet some minimum requirements. One of the most famous approach is called “k-anonymity“.

Other approaches are “differential privacy” techniques.

Some challenges related to privacy for big data is to ensure privacy in dynamic data (data streams). For big data, there are methods that anonymize data locally (e.g. by adding noise or generalization) before sending them to controller.

Some limitations of state-of-the-art techniques are as follows:

There was then some discussion of some proposals for privacy preserving big data analytics. I will not report all the details. The conclusions of the talk:

Day 4 – afternoon

In the afternoon, there was a PAKDD most influential paper award presentation on Extreme Support Vector Machine by Prof. Qing He, as well as the PAKDD 2019 Challenge Award presentation.

Conclusion

Overall, this was an excellent conference. It was well-organized. I met many researchers, listened to several interesting talks. Looking to PAKDD 2020 next year in Singapore.


Philippe Fournier-Viger is a computer science professor and founder of the SPMF open-source data mining library, which offers more than 170 algorithms for analyzing data, implemented in Java.

The 7th China International Technology Expo – CITE 2019 (a brief report)

This week, I have attended the 7th China International Technology Expo (CITE 2019), which was held at the Shenzhen Convention and Exhibition Center in the city of Shenzhen, China from the 9th to the 11th April 2019. In this blog post, I will give a brief overview of this fair, where various companies were showing their new products and services. 

China Information Technology Expo 2019

The event is organized as a fair, where companies have booths, separated by themes: (1) Smart Home, Smart City, Smart Terminal, (2) New Display, (3) Intelligent Manufacturing and 3D printing, (4) Robot and Intelligent Systems, (5) Artificial Intelligence and Intelligent Hardware, (6) IOT, Blockchain, Cyber security, (7) Automative electronics, battery, New energy, (8) Basic electronics, components, equipments and materials.

There was numerous Chinese companies as well as some international companies. And it was quite interesting to see the various products on display. The CITE 2019 fair is reasonably big but not as big as some other technology fairs in China such as the BIG DATA expo.

Below, I show some selected pictures from the CITE 2019 fair:

Robot for cleaning windows
There was many specialised machines
Curved displays
A robot fish was swimming
Robots for assembly lines
Another assembly line robot
More displays including on transparent glasses
LED displays
More machines
Multiplayer virtual reality games
3D printers were also on display
There was many types of robots for kids and home
Realistic looking robots that can move
8K displays
Flexible displays
Some of the booths at CITE 2019

Conclusion

This was just a short blog post to give a glimpse of this event. I think it is quite interesting to attend such event to see what is happening in the industry. Hope you have enjoyed reading this blog post about CITE 2019. If you want to get notified about next blog posts, you can follow me on Twitter at@philfv.


Philippe Fournier-Viger is a professor of Computer Science and also the founder of the open-source data mining software SPMF, offering more than 145 data mining algorithms.

Interview with Prof. Rage Uday Kiran about Data Mining

Today, I have the pleasure to interview Rage Uday Kiran researcher at the National Institute of Informatics in Tokyo, Japan.  R. Uday Kiran is an Indian researcher who has been working in Japan for several years. He has been active mainly in the field of data mining, and is a well-known researcher on the topic of discovering patterns in databases. He has taken the time to answer several questions for this interview.

1) Could you please give a brief overview of your most important contributions?

Frequent itemset mining is an important model in data mining. Its mining algorithms discover all itemsets in the data that satisfy the user-specified minimum support (minSup) constraint. The minSup controls the minimum number of transactions that an itemset must cover within the data. Since only a single minSup threshold is used for the entire data, the model implicitly assumes that all items within the data have uniform frequency. However, this is the seldom case in many real-world applications. In many applications, some items appear very frequently within the data, while others rarely appear. If the frequencies of items vary greatly, then we encounter the following two problems:

  • If minSup is set too high, we miss those itemsets that involve rare items in the data.
  • In order to find the itemsets that involve both frequent and rare items, we have to set minSup very low. However, this may cause a combinatorial explosion, producing too many itemsets, because those frequent items associate with one another in all possible ways and many of them are meaningless depending upon the user and/or application requirements.

This dilemma is known as the rare item problem.   During my PhD, I have tried to address this problem by developing frequent itemset models based on multiple minimum supports.

Periodic itemsets are an important class of regularities that exist within the data. Most previous studies have tried to find periodic itemsets based on an implicit assumption that all transactions within the data occur at a fixed time interval.  However, in many real-world applications, transactions occur irregularly within the data.  For the past few years, I am developing models to discover different types of periodic itemsets in irregular time series/temporal databases.

2) What do you think are the key problems that remain to be solved in the field of pattern mining?

1. Rare Item problem is still a major problem which needs to be addressed in many pattern mining models.

2.  Non-support measures, such as occupancy, have to be investigated to assess the interestingness of an itemset.

3. Tuning is a common practice in pattern mining. So disk based algorithms have been investigated to lower the operational cost.

3) What do you expect to achieve in the next 5 years?

In the near future,  IoT devices become the main source of data. The data generated by these IoTs is often large (petabytes of data) and typically have spatiotemporal characteristics.  In the next few years, I would like to develop models that can extract useful information in spatiotemporal databases. In addition, I would like to investigate parallel and disk-based algorithms to find useful information in very large databases efficiently.

4) Do you think that it is important to collaborate with the industry? What are the keys to a successful collaboration?

Yes. I firmly believe it is important for an academician to collaborate with the industry persons. Industrial collaboration facilitates an academician to know the limitations of current research on a particular topic, thereby, enabling an academician to develop models and algorithms that can cater to the industrial requirements. Mutual trust, regular discussions and openness are crucial factors for a successful collaboration.

5) What is the current state of data mining and artificial intelligence technology in Japan?

In my opinion, this is the hardest question to answer. Japanese government has initiated a project, called Society 5.0, which is a human-centered society that balances economic advancement with the resolution of social problems by a system that highly integrates cyberspace and physical space. In this context, most researchers in Japan are working on developing parallel deep neural network algorithms that can analyze the real-world data effectively.  In my lab at the University of Tokyo, researchers are working on language translation using deep neural networks.

6) Which conferences do you like to attend? Why?

I generally wish to attend top international conferences (e.g. KDD, CIKM, PAKDD, SSDBM, EDBT, DASFAA and DEXA). The reasons are as follows : (1) To know about the hot research problems  which are being addressed by the researchers. (2) Interact with the speakers/authors to gain in-depth perception on the interested topics. (3) Collaboration with fellow researchers working on similar topics.

7) Do you have some advices for young researchers?

Have an open mind. Read as many research papers as possible, and ensure that you are covering many topics. Try to get the grasp of implicit and explicit assumptions made by authors in every research paper. Carefully manage the time. Try to collaborate with the senior research students/persons in your lab.

Thanks for participating to this interview!

The best data mining mailing lists for researchers

Today, I will list a few useful mailing lists related to data mining and big data. Subscribing to these mailing list is useful for PhD students and researchers, as many jobs, conferences, special issues and other opportunities are advertised on these mailing lists. It is also good to post your own announcements for jobs, call for papers, etc.

Here is the list:

If you think that I have missed some important mailing lists, please share it in the comment section, and I will update the page. Thanks for reading!


Philippe Fournier-Viger is a full professor working in China and founder of the SPMF open source data mining software.

How to write an academic book?

Have you ever wanted to write an academic book or wondered what are the steps to write one? In this blog post, I will give an overview of the steps to write an academic book, and mention some lessons learned while writing my recent book on high utility pattern mining.

how to write an academic book?

Step 1. Think about a good book idea.
The first step for writing a book is to think about the topic of the book and who will be the target audience. The topic should be something that will be interesting for an audience. If a book focuses on a topic that is too narrow or target a small audience, the impact may be less than if a more general topic is chosen or if a larger audience is targeted.

One should also think about the content of the book, evaluate how much time it would take to write the book, and think about the benefits of making the book versus spending that time to do something else. It is also important to determine the book type. There are three main types of academic books:

  • First, one may publish a textbook, reference book or handbook. Such book must be carefully planned and written in a structured way. The aim is to write a book that can be used for teaching or used as a reference by researchers and practitioners. Because such book must be well-organized, all chapters are often written by the same authors.
  • Second, one may publish an edited book, which is a collection of chapters written by different authors. In that case, the editors typically write one or two chapters and then ask other authors to write the remaining chapters. This is sometimes done by publishing a “call for chapters” online, which invite potential authors to submit a chapter proposal. Then, the editor evaluates the proposal and select some chapters for the book. Writing such book is generally less time-consuming than writing a whole book by oneself because the editors do not need to write all the chapters. However, a drawback of such book is that chapters may contain redundancy and have different writing styles. Thus, the book may be less consistent than a book entirely written by the same authors. A common type of edited book is also the conference or workshop proceedings.
  • Third, one may publish his Ph.D. thesis as a book if the thesis is well-written. In that case, one should be careful to choose a good publisher because several predatory publishers offer to publish theses with a very low quality control, while taking all the copyrights, and then selling the theses at very expensive prices.

Step 2. Submit a book proposal
After finding a good idea for a book, the next step is to choose a publisher. Ideally, one should choose a famous publisher or a publisher that has a good reputation. This will give credibility to the book, and will help to convince potential authors to write chapters for the book if it is an edited book.

After choosing a publisher, one should write a book proposal and send it to the publisher. Several publishers have specific forms for submitting a book proposal, which can be found on their website or by contacting the publisher. A book proposal will request various information such as: (1) information about the authors or editors, (2) some sample chapter (if some have been written), (3) is there similar books on the market?, (4) who will be the primary and secondary audience?, (5) information about the conference or workshop if it is a proceedings book, (6) how many pages, illustrations and figures the book will contain?, (7) what is the expected completion date?, and (8) a short summary of your book idea and the chapter titles.

The book proposal will be evaluated by the publisher and if it is accepted, the publisher will ask to sign a contract. One should read the contract carefully and then sign it if it is satisfying.

Step 3. Write the book
Then the next step is to write the book, which is generally the most time-consuming part. In the case of a book written all by the same authors, this can require a few months. But for an edited book, it can take much less time. Editor must still find authors for writing the chapters and perhaps also write a few chapters.

After the book have been written, it should be checked carefully for errors and consistency. A good idea is to ask peers to check the book to see if something need to be improved. For an edited book, a review process can be organized by recruiting reviewers to review each chapter. The editors should also spend some time to put all the chapters together and combine them in a book. This can take quite a lot of time, especially if the authors did not respect the required format. For this reason, it is important to give very clear instructions to authors with respect to the format of their chapters before they start writing.

Step 4. Submit the book the publisher
After the book is written, it is submitted to the publisher. The publisher will check the content and the format and may offer other services such as creating a book index or revising the English. A publisher may take a month or two to process a book before publishing it.

Step 5. Promote the book
After writing a book, it is important to promote it in an appropriate on the web, social media, or at academic conferences. This will ensure that the book is successful. Of course, if one choose a good publisher, the book will get more visibility.

Lessons learned
This year, I published an edited book on high utility pattern mining with Springer. I followed all the above steps to edit that book. I first submitted a book proposal to Springer, which was accepted. Then, I signed the contract, and posted a call for chapters. I received several chapter proposals and also asked other researchers to write chapters. The writing part took a bit of time because although I edited the book, I still participated to the writing of six of the twelve chapters. Moreover, I also asked various people to review the chapters. Then, it took me about 2 weeks to put all the chapters together and fix the formatting issues. Overall, the whole process was done over about 1 year and half, but I spent perhaps 1 or 2 months of my time. Would I do it again? Yes, because I think it is a good for my career, and I have some other ideas for books.

The most important lesson that I learned is to give more clear instructions to authors to reduce formatting problems and other issues arising when putting all chapters together.

Conclusion
In this blog post, I have discussed how to write an academic book. Hope you have learned something! Please share your comments below. Thanks for reading!


Philippe Fournier-Viger is a full professor working in China and founder of the SPMF open source data mining software.