Today, I want to show you a new interactive demo of the KNN (K-Nearest Neighbors) algorithm that I have added to my website. It is designed to be used for teaching purpose to illustrate how the K-Nearest Neighbors algorithm works.
In the section 1 of the webpage, you can enter some data that is a list of records or instances to be used by KNN to make predictions. The first line is the list of attributes. Then, each following line is a record, wich is a list of values separated by single spaces. The values can be categorical or numerical.
Then, the value of K can be selected in section 2 of the webpage. For the purpose of teaching, values of K are restricted to be between 1 to 100.
Then, you can provide an instance to classify in section 3 of the webpage. The instance to classify is a list of attribute values but one of them must be replaced by ? meaning that we want to predict this attribute value using KNN.
Finally, by clicking Run KNN button, the result are displayed like this:
It indicates the K most similar instances, the calculated distances between those instances and the instance, and the predicted attribute value.
It is possible to run the demo with different values of K and different data to observe the result, which can be good for learning.
Conclusion
This tool is for teaching purpose. If you want to try a more efficient implementation in Java, you could try the one from the SPMF data mining software, which is free and open source.
The K-Means demo, first let you enter a list of 2 dimensional data points in the range of [0,10] or to generate 100 random data points:
Then the user can choose the value of K, adjusts other settings, and run the K-Means algorithm.
The result is then displayed for each iteration, step by step. Each cluster is represented by a different color. The SSE (Sum of Squared Error) is displayed, and the centroids of clusters are illustrated by the + symbol. For example, this is the result on the provided example dataset:
Because K-Means is a randomized algorithm, if we run it again the result may be different:
Now, let me show you the feature of generating random points. If I click the button for generating a random dataset and run K-Means, the result may look like this:
And again, because K-Means is randomized, I may execute it again on the same random dataset and get a different result:
I think that this simple tool can be useful for illustrating how the K-Means algorithm works to students. You may try it. It is simple to use and allows to visualize the result and clustering process. Hope that it will be useful! — Philippe Fournier-Viger is a distinguished professor working in China and founder of the SPMF open source data mining software.
Hi all, I just write a quick message to say that I did not write on the blog recently due to my very busy schedule recently. However, everything is going well. I will be back on the blog with more content soon, and I will start to add more videos to the YouTube channel soon. Also, I am working on the next version of SPMF. 😉
Today, I will talk about the MDLM 2023 conference (International Conference on Machine Learning and Data Mining). Several years ago, I have attended MLDM 2016 (report here) because at that time it was published by Springer. But I was unhappy that the MLDM conference was advertised as being held in New York, while it was finally held in a small hotel 40 minutes away in another city called Newark, close to nothing. And that conference was really expensive at around 650 euros… I have thus never attended that conference again, but I have observed that during the following years, that conference was still advertised as being in New York, while being held in that same hotel in Newark every year (see my blog post about MLDM 2019). Thus, I never submitted a paper again to that conference, and I also observed that it is not published by Springer anymore (it might be because of that). This for me is another reason to not publish there anymore.
Why I talk about MDLM 2023? Because, this year I received a comment on my blog reporting that the conference was not even held and that people went to Newark and found that there was no conference at all! Here is the comment that report this:
And a related tweet:
I did not verify whether this comment is true, but I believe that it is, given that this MLDM conference has repeatedly mislead people about the location of the conference.
Update 2023-10: And here is another comment that I just received on this blog that also had a similar bad experience with MLDM 2023:
By looking at the Internet Wayback Archive, I can see that the conference website of MLDM 2023 still advertised the conference as being in New York:
But when we click on “Location”, as usual, we find that it is not in New York city but instead in the city of Newark:
For those who are not familiar with the map of US, Newark and New York are two cities from two different states:
Thus, these two cities should not be confused!
This blog post is just to give an update about this MLDM conference.
If you have any information about this MLDM 2023 conference and whether it was really held or not or have any other interesting experience to share, please leave a comment below.
Today, I would like to talk about something that happens sometimes in academic journals, which is fake reviews. While many reviewers spend time to write reviews that provide a fair evaluation of papers, some reviewers have a very unethical behavior and submit fake reviews.
This recently happened for a paper that I submitted to journal. I will not say the name of the journal, but mention that it is a Q1 journal (a top 25% journal) of Elsevier, that is quite highly ranked.
We submitted a paper that was rejected and received two reviews. The first review contained somewhat minor criticisms that I can accept. But the second review is long but only contain some very general criticisms that do not mention anything related to our paper. Upon reading that latter review a few times, I thought that is very strange because that review sounds so generic. Here is the review:
As you can see, this review is very generic. It could be applied to almost any papers. Besides the review contains some unusual choices of words such as in the point (6), where the reviewer calls our paper “an essay”.
Thus, I searched on Google, and found that this exact review appears on a website (https://www.qeios.com/read/LUCUU6) for another paper:
Upon seeing this, it is clear that the reviewer submitted a fake review. Thus, I sent an email to the editor to complain about that fake review, and hopefully, this will be taken into account and that reviewer will be punished.
This is a very unethical behavior. And it leads to the question: Why a reviewer would do this? It is likely because the reviewer wants to review more papers and do this as quickly as possible, as the number of reviews is displayed on some websites such as Publons. However, this shows that the reviewer is selfish as he does not care about the authors who receive such fake reviews and the consequences of these fake reviews, such as the paper being rejected after waiting several months. Hopefully, the editor will punish that reviewer.
This was a blog post to talk about this issue, which probably happens more often than we think. I think I have seen this at least another time in the last few years. Besides, there are many other unethical behaviors that can be observed in academic journals such as reviewers that ask authors to cite many of their papers to boost their citations. I saw this several times and every time I have reported this situation to the editor.
Hope that this blost post has been interesting. If you want to share your story, please post in the comments below.
Today, I would like to introduce an upcoming feature that will be released in the next version of SPMF (v. 2.60). It is a tool called the Memory Viewer. This tool is very simple yet useful for investigating the performance of algorithms. Here is a preview of how it works.
To launch the Memory Viewer in the user interface of SPMF, we need to select the algorithm “Memory Viewer” .
and then click “Run algorithm”.
This will open a separated window for monitoring the current memory of the Java Virtual Machine (JVM) in real-time. This window is shown here on the right:
That window displays the current memory usage of the JVM and updates it every second. It displays the last 100 seconds. After opening it, you can run a data mining algorithm and the memory usage will then be updated through the algorithm’s execution like this:
This is a very simple tool, but I think it is quite useful to get some insights about the performance of the algorithms that are running.
Note that this tool will only monitor the performance of algorithms that are running in the same JVM as SPMF. Thus, if you select the option “Run in a separated process” of SPMF to run an algorithm in a separated JVM, the memory will not be monitored.
Update: Thanks for the feedback. I have also added a slider to let the user change the refresh rate of the Memory Viewer:
That is all for today. I just wanted to show you some upcoming feature, as I am currently working on the next release of SPMF. If you have some suggestions, please leave them in the comment section below.
Related to this, if you want to try the EFIM algorithm, the original source code and datasets for testing are available in the SPMF open-source software.
And the video also gives a quick overview of those additional topics:
Episode rule mining
Top-K episode mining
High-utility episode mining
Hope you will enjoy the video!
By the way, to try episode mining, you can also check out the SPMF software, which is open-source and free. It offers efficient implementations of many episode mining algorithms and also provide example datasets. You can use this software in your research (I am the founder by the way).
Also, if you want to know more about episode mining, you can check our survey paper:
Ouarem, O., Nouioua, F., Fournier-Viger, P. (2023). A Survey of Episode Mining. WIREs Data Mining and Knowledge Discovery, Wiley, to appear.
I often say that as an invited speaker for a conference or as a teacher for a course, we need to be ready for the unexpected and be prepared for every situation that could happen. This means for example, to bring special cables or adapters that may be needed to give a talk in a new location, to have at least two copies of our presentation on different supports (e.g. laptop, USB, or email), and to arrive earlier to avoid being late.
Today, is such a day where the unexpected happened. I was a keynote speaker yesterday at an AI Innovation Think Thank forum in Shanghai, and was supposed to fly immediately after to another city (Changchun) in the evening to give another talk the next day. Long story short, the flight was delayed from 8 PM to 10 PM, and then to 4 AM before being cancelled. Thus, I only slept a few hours, and had to deal with many problems to obtain refunds from the airline company. And given that I would visibly be unable to attend the conference on time, I contacted the organizers early so that we arrange for my talk to be online. I also recorded a video of my talk in the morning that I sent to the organizers, so that they could play it if the network connection is bad for whatever reasons. This is something that was not requested but can truly make a difference as I often saw online talks in conferences were we could barely hear the speaker due to a poor internet connection, and I don’t want this to happen!
Then, as I still had to still fly from airport, I had to give my keynote talk from the airport, and find a quiet place to do it from there before boarding another flight to return home. Thus, I went early to airport to find a suitable place, and the internet connection was very good and I installed myself on a cart in a quiet place.
Also, it helps that I carry with me a portable RODE shotgun microphone that I can use to give a professional sound to my talks while on the go. This type of microphone is very good for an environment like an airport as it focuses on the sound that is directly in front of the microphone and mostly ignore surrounding noise.
I also carry with an excellent pair of headphones.
And sometimes, I also carry a tripod, a portable light, and a noise filter for my microphone as well (but not this time). But here is some pictures of different accessories that I sometimes use with a portable tripod in different situations:
I also like to carry with me a laptop stand for working on the go:
And something very useful is to have a mouse. But not any mouse. I personally highly recommend the Logitech MX Anywhere. It is a portable computer mouse that can work on basically any kind of materials,even on glass, clothing, … anything!
This is perfect when travelling. You can be sure that the mouse can be used anywhere.
So this was just a short blog post to say, that it is always better to be ready for the unexpected 🙂 If you had some similar stories of unexpected things that happened to you, please share with me in the comments below.
By the way, I did not write on the blog for a little while as I had a lot of things going on recently. Now, it is better. I will post more in the coming weeks. — Philippe Fournier-Viger is a distinguished professor working in China and founder of the SPMF open source data mining software.
This is a short blog post to talk about two common errors in pattern mining research papers.
1) The first error is:
“mining frequent itemsets from a database” “mining patterns from a stream” “mining patterns over a database” “mining patterns over a data stream”
In English, we don’t mine something from something else or over something else. We mine something in something else. So it should be “mining frequent itemsets in a database” and “mining patterns in a stream“