Sneak peak at the new user interface of SPMF (part 3)

Today, I would like to talk to you about another upcoming feature of the next version of SPMF (2.60), which will be released soon. It will be a Workflow Editor that will allow the user to select multiple algorithms from the user interface and run them one after the other such that each algorithm will take as input the output of the preceding algorithm. This will solve one limitation of SPMF, which is that the user can only run one algorithm from the user interface.

Here is a brief overview of this new feature. The user interface looks like this:

On the left, there is a space to visualize a workflow consisting of multiple algorithms and on the right details are displayed about the current algorithm.

To use it, first, we click on “Add an algorithm”. This will create a new node for an algorithm like this:

Then, on the right, we need to select the algorithm and set its parameters. For example, I will choose Eclat and set its parameter to 0.8:

Then, as you observe on the left, two orange boxes have appeared that symbolize the input and output of the algorithm. I can click on the input box and then choose an input file:

Then, I could also set the output file name in the same way. Now after that, I can also add another algorithm to be run after Eclat. I can click again on “Add an algorithm” and choose some algorithm:

This means that after running Eclat, the output file will be open by the system text editor.

Now, the workflow has two algorithms. I can click on another button called “Run” (which I did not show until now to execute the workflow and information about the execution will be displayed in a console:

This is just a preview of some new feature of SPMF called the Workflow Editor. It works already but there are still a few bugs that need to be fixed before it can be released. The user interface may change in the final release and if you have any suggestions please leave your comments below!

Posted in Data Mining, open-source, spmf | Tagged , , , , , , , , , , | Leave a comment

ChatGPT, LLMs and homework

Today, I want to talk briefly about ChatGPT and similar large language models (LLMs) and how they are used by students in universities. From what I observe, I believe that many students are using LLMs in universities nowadays. Among this, some students use LLMs to get ideas and suggestions on their work. But other students will rather use LLMs to avoid working and quickly generate reports and essays, as well as to write code for their assignments. These students often believe that text generated by LLMs cannot be detected by teachers.

But this is false. From my experience, it is quite easy to know which documents submitted by students have been generated by an LLM because of three mains factors:

  • First, there is the writing style. Text written by LLMs will often be written too well, which will raise suspicions. Then, after that, the teacher might look more closely at the document to see if there are other problems.
  • Second, texts generated by LLMs may look real but when a teacher look at them closely, the teacher can find that they often contain fake information and other inconsistencies, which makes the teacher realize that the content is all fake. For example, I know a professor in another university who asked students to write project reports in a course, and then he found that several reports contained a reference section with research papers that did not exist. It is then obvious that the text was generated by an LLM and that the fake bibliography was a so-called “hallucination” of the LLM. Such signs are clear indicators that a LLM was used.
  • Third, text generated by LLMs will often not follow the requirements. For example, a student may use a LLM to generate a very convincing essay, but that essay may still fail to meet all the homework’s requirements. Thus, the student may still lose points for not following the requirements.

Thus, what I want to say is that students using LLMs to do their homework are taking risks as LLMs can easily generate fake, inconsistent and incorrect content, which may also not meet the requirements.

This was just a short blog post to talk about this topic. Hope it has been interesting. Please share your perspective, opinions, or comments in the comment section, below.

Posted in Academia, Plagiarism | Tagged , , , , | Leave a comment

When ChatGPT is used to write papers…

Today, I want to share with you something funny but also alarming. It is that some papers published in academic journals contains text indicating that parts were apparently written by LLMs.

The first example is this paper “The three-dimensional porous mesh structure of Cu-based metal-organic-framework – aramid cellulose separator enhances the electrochemical performance of lithium metal anode batteries” in the journal Surfaces and Materials of Elsevier. The first sentence of the introduction is “Certainly, here is a possible introduction for your topic:

It is quite surprising that authors and reviewers did not see this!

A second example of such problem is case report “Successful management of anlatrogenic portal vein and hepatic artery injury in a 4-month-oldfemale patient: A case report and literature review published in the open-access Elsevier journal Radiology and Case Reports:

Again, it is surprising that this has passed through the review process unnoticed by the editor, reviewers or authors.

Have you found other similar cases? If so please share in the comment section!

Posted in Academia, Machine Learning | Tagged , , , , | Leave a comment

Sneak peak at the new user interface of SPMF (part 2)

Today, I will continue to show you some upcoming features of SPMF 2.60, on which some work is ongoing. This new version of SPMF should be released in the coming weeks. The new feature that I will talk about today is the Timeline Viewer. It is a powerful tool for visualizing temporal data. Let me now show you this in more details.

The Timeline Viewer can first display event sequences, which are files taken as input by episode mining algorithms, among others. For example, we can use the TimeLineViewer to see a visual representation of this input file:

@CONVERTED_FROM_TEXT
@ITEM=1=apple
@ITEM=2=orange
@ITEM=3=tomato
@ITEM=4=milk
1|1
1|2
1 2|3
1|6
1 2|7
3|8
2|9
4|11

To do that, we first select the input file “contextEMMA.txt” using SPMF (1) and then click on the new “view dataset” button (2):

This open a table representation of the dataset:

Then, we click on the “View with Timeline Viewer” button (3) to see the visual representation:

The Timeline Viewer provides several options such as exporting to image, changing the tick interval, and the minimum and maximum timestamps, as well as applying a scaling ratio on the X axis. Moreover, the Timeline Viewer has a built-in custom algorithm to automatically determine the best parameters to ensure a good visualization. Here are some of the options available:

The second feature of the Timeline Viewer is to view time-interval datasets such as those taken as input by the FastTIRP and VertTIRP algorithms (to be released in SPMF 2.60). To use this feature, we again select an input file (1) and click on the “View dataset” button (2) :

Then, we obtain a Table representation of the dataset and click on the “View with Timeline Viewer” button (3) to see the visual representation:

The result is like this:

At the bottom, we have the timeline. On the left side, we can see the sequence IDs (S0, S1, S2, S3…) and we can see the time intervals from each sequences depicted using a different color for easier visualization. We can also adjust various parameters to customize the visualization and export the picture as a PNG file.

Here is another example with a smaller data file containing three time interval sequences:

OK, so that’s all for today. I just wanted to give you a preview of upcoming features in SPMF. Hope that it is interesting. There are still some bugs to be fixed and other improvements to be made, so that feature may still change a bit before it is released.

By the way, the Timeline Viewer is completely built from scratch to ensure efficiency (which is an important design goal of SPMF). Building a time line viewer was quite challenging. There are many special cases to consider and tricky aspects to ensure a good visualization.

If you have any comments or suggestions about this feature or what you would like to have in SPMF, please leave a comment below or send me a message.


Philippe Fournier-Viger is a distinguished professor working in China and founder of the SPMF open source data mining software.

Posted in spmf | Tagged , , , , , , , , , , , , , | Leave a comment

Sneak peak at the new user interface of SPMF (part 1)

I am currently working on the next version of SPMF, which will be called 2.60. There will be several improvements to the user interface of SPMF. Here is an overview of some of the improvements to give you a sneak peak at what is coming. Note that, more changes may still occur before the next version is released ;-P

The new VIEW button is one of the most important new features of the upcoming SPMF 2.60. It provides many different views of various types of input files. For example, if we open an input file for high utility itemset mining, the view is like this:

There are also many other viewers that are integrated in the new version of SPMF, that cover all the main types of data available in SPMF.

Hope that this is interesting. This is just to give you a preview of what is coming in SPMF. Of course, this might still be a little different when it is released as I am still thinking about other possible improvements.


Posted in Big data, Data Mining, Data science, spmf | Leave a comment

UDML 2024 Accepted papers

Today, I want to talk to you about the upcoming UDML 2024 workshop at the PAKDD 2024 conference. This year is the 6th edition of the UDML workshop. I am happy to say that this year, we received a record number of submissions (23 submissions), which shows that the workshop and this research direction of utility mining and learning is going well.

As a result of the number of submissions, the selection process has been quite competitive, with many good papers, and some could not be accepted even if they were actually very good.

The list of the 10 accepted papers is as follows:

This will be certainly a very interesting workshop this year at PAKDD.


Philippe Fournier-Viger is a distinguished professor working in China and founder of the SPMF open source data mining software.

Posted in Conference, Data Mining, Data science, Pattern Mining, Utility Mining | Tagged , , , , , | Leave a comment

SPMF 2.60 is coming soon!

Today, I want to talk a little bit about the next version of SPMF that is coming very soon. Here is some highlights of the upcoming features:

1) A Memory Viewer to help monitor the performance of algorithms in real-time:

Also, the popular MemoryLogger class of SPMF is also improved to provide the option of saving all recorded memory values to a file when it is set in recording mode and a file path is provided. This is done using two new methods “startRecordingMode” and “stopRecordingMode”. The MemoryLogger will then write the memory usage values to a file every time that an algorithm calls the checkMemory method. You can stop the recording mode by calling the stopRecordingMode method.

2) A tool to generate cluster datasets using different data distributions such as Normal and Uniform distribution. Here some screenshots of it:

3) A simple tool to visualize transactions datasets. This tool is simple but can be useful for quickly exploring a datasets and see the content. It provides various information. This is an early version. More features will be considered.

The tool has two visualization features, to viewthe frequency distribution of transaction according to their lengths, as well as the frequency distribution of items according to their support:

4) A simple tool to visualize sequence datasets. This is similar to the above tool but for sequence datasets.

5) A new tool to visualize the frequency distribution of patterns found by an algorithm. To use this feature, when running an algorithm select the “Pattern viewer” for opening the output file. Then, select the support #SUP and click “View”. This will open a new window that will display the frequency distribution of support values, as show below. This feature also works with other measures besides the support such as the confidence, and utility.

6) A tool to compute statistics about graph database files in SPMF format. This is a feature that was missing in previous version of SPMF but is actually useful when working with graph datasets.

7) Several new data mining algorithm implementations. Of course, several algorithms for data mining will be added. Some that are ready are FastTIRP, VertTIRP, Krimp, and SLIM. Others are under integration.

8) A new set of highly efficient data structures implemented using primitive types to further improve the performance of data mining algorithms by replacing standard collection classes from Java. Some of those are visible in the picture below. Using those structure can improve the performance of algorithm implementations. It actually took weeks of work to develop these classes and make it compatible with comparators and other expected features of collections in the Java language.

Conclusion

This is just to give you an overview about the upcoming version of SPMF. I hope to release it in the next week or two. By the way, if anyone has implemented some algorithms and would them to be included in SPMF, please send me an e-mail at philfv AT qq DOT com.


Philippe Fournier-Viger is a distinguished professor working in China and founder of the SPMF open source data mining software.

Posted in Data Mining, Pattern Mining, spmf | Tagged , , , , , , , , , , , , , | Leave a comment

The importance of using standard terminology in research papers

Today, I will talk about the importance of using standard terminology in research papers in computer science. The idea to talk about this on the blog came after reading an interesting letter about research on optimization called “Metaphor‑based metaheuristics, a call for action: the elephant in the room” by Aranha et al. (DOI: 10.1007/s11721-021-00202-9).

This paper explains that in the field of optimization, there have been a growing list of articles in the last decade proposing seemingly new approach for optimization but explained using a wide range of metaphors some related to animals (e.g. bats, grey wolves, termites, spiders), natural phenomena (e.g. invasive weed, the big bang, river erosion), and many other weird sources of inspirations (e.g. how musicians play music together, how interior design is carried and the political behavior of countries).

A key issue pointed by the authors and other researchers is that many metaphor-based optimization algorithms introduce new terminology that are unnecessary to explain the new algorithms, as they could be explained more simply using the existing terminology. For example, it was shown by Camacho-Villalon (DOI: 10.1007/s11721-019-00165-y) that some optimization algorithms such as Intelligent Water Drop (IWD) optimization are nothing but a special case of Ant Colony Optimization (ACO). However, the terminology is changed and pheromone in ACO is now called the soil in IWD, and ants are water drops, and so on. Another example is black hole optimization, which was shown to be a special case of particle swarm optimization.

The main problem with authors proposing seemingly new algorithms using non standard terminology is as Aranha explains: ” (i) creating confusion in the literature, (ii) hindering our understanding of the existing metaphor-based metaheuristics, and (iii) making extremely difficult to compare metaheuristics both theoretically and experimentally.”

This problem has become quite big in optimization research with several papers proposing new metaphors that are unrealistic or unnecessary to explain small modifications to existing algorithms, so as to publish more papers with little innovation. However, this problem also appears in other fields of computer science where researchers use non standard terminology in their papers. As a result, it often become difficult to verify where an idea truly came from, some work may be duplicated, and finding other papers related to an idea can become quite difficult (if several papers use different terminology.

This is why, it is important to always use standard terminology when proposing a new paper, and also to clearly indicate the relationship with previous papers, and give credit when credit is due. This helps the research community in making it easier to find papers and understanding the relationships between them.

Hope that this has been an interesting blog post. If you have time, you may read the above papers that I have mentioned. They are quite interesting and highlight this issue.


Philippe Fournier-Viger is a distinguished professor working in China and founder of the SPMF open source data mining software.

Posted in Academia | Tagged , , , , | Leave a comment

UDML 2024 @ PAKDD 2024 (deadline extended)

This is a short blog post to let you know that the deadline for submitting your papers to the UDML 2024 workshop at the PAKDD 2024 conference has been extended to the 7th February.

Website: https://www.philippe-fournier-viger.com/utility_mining_workshop_2024/

Note that this year, all accepted papers from UDML 2024 will be invited for an extended version in a special issue of the Expert Systems journal.

So this is a very good opportunity for your papers at PAKDD!

And happy new year 2024!


Philippe Fournier-Viger is a distinguished professor working in China and founder of the SPMF open source data mining software.

Posted in cfp | Tagged , , , , , , | Leave a comment

Your social network on DBLP as a graph

Today, I discovered an interesting function of DBLP which is to draw your social network as a graph (assuming that you have a DBLP page). To use that feature, it is simple. Open your DBLP webpage, and then click here at the bottom of the page:

Then, your social network will be displayed (it can take a little while). For example, this is mine:

What is interesting is that it shows not only the direct co-authorship links, but also some transitive links thus highlighting some potential connections that one could create through his current network.

In the above picture, the graph is quite dense since I have 390 co-authors on DBLP.

By observing this graph, we can also see some strange structures like this one:

This structure seems too perfect (all the authors are connected between themselves). Thus, I have investigated why. The reason is simple. It is a paper that I participated in, where there was 8 authors and most of them were not from computer science. Thus, most of the authors on that paper had only one paper on DBLP, which was the same.

There is also a dense cluster here:

which is mostly European researchers.

I just wanted to share this interesting function with you in this blog post, as I have discovered it today (but it might have been available for a while!).

Posted in Academia, Research | Tagged , , , , , , | Leave a comment