In this blog post, I will talk about the vision of the Semantic Web that was proposed in the years 2000s, and why it failed. Then, I will talk about how it has been replaced today by the use of data mining and machine learning techniques.
What is the Semantic Web?
The concept of Semantic Web was proposed as an improved version of the World Wide Web. The goal was to create a Web where intelligent agents would be able to understand the content of webpages to provides useful services to humans or interact with other intelligent agents.
To achieve this goal, there was however a big challenge. It is that most of the data on the Web is unstructured, as text. Thus, it was considered difficult to design software that can understand and process this data to do some meaningful tasks.
Hence, a vision of the Semantic Web that was proposed in the years 2000s was to use various languages to add metadata to webpages that would then allow machines to understand the content of webpages and do reasoning on this content.
Several languages were designed such as RDF, OWL-Lite, OWL-DL, OWL-FULL and also some query languages like SparQL. The knowledge described using these languages is called ontologies. The idea of an ontology is to describe various concepts occurring in a document at a very high level such as car, truck, and computer, and then to link these concepts to various webpages or resources. Then, based on these ontologies, a software program could use reasoning engines to reason about the knowledge in webpage and perform various tasks based on this knowledge such as finding all car dealers in a city that sell second-hand blue cars.
The fundamental problems of the Semantic Web
So what was wrong with this vision of the Semantic Web? Many things:
- The languages for encoding metadata were too complex. Moreover, encoding metadata was time-consuming and prone to errors. The proposed languages for adding metadata to webpages and resources were difficult to use. Despite the availability of some authoring tools, describing knowledge was not easy. I have learned to use OWL and RDF during my studies, and it was complicated as OWL is based on formal logics. Thus, learning OWL required a training and it is very easy to use the language in a wrong way if we don’t understand the semantics of the provided operators. It was thus wrong to think that such a complicated language could be used at a large scale on the Web. Also because such languages are complicated, they are prone to errors.
- The reasoning engines based on logics were slow and could not scale to the size of the Web. Languages like OWL are based on logic, and in particular description logics. Why? The idea was that it would allow to use inference engines to do logical reasoning on the knowledge found in the webpages. However, most of these inference engines are very slow. In my master thesis in the years 2000s, reasoning on an OWL file with a few hundred concepts using the state-of-the-art inference engines was already slow. It could clearly not scale to the size of the Web with billions of webpages.
- Languages were very restrictive. Another problems is that since some of these languages were based on logics, they were very restrictive. To describe some very simple knowledge it would work fine. But to describe something complicated, it was actually very hard to model something properly. And many times the language would not just not allow to describe something.
- Metadata are not neutral, and can be easily tweaked to “game” the system. The concept of adding metadata to describe objects can work in a controlled environment such as to describing books in a library. However, on the Web, bad people can try to game the system by writing incorrect metadata. For example, a website could write incorrect metadata to achieve a higher ranking in search engines. Based on this, it is clear that adding metadata to webpages cannot work. This is actually the reason why most search engines today do not rely much on metadata to index documents.
- Metadata is quickly obsolete and need to be always updated.
- Metadata intereporability betwen many websites or institutions is hard. The whole idea of describing webpages using some common concepts to allow reasoning may sound great. But a major problem is that various websites would then have to agree to use the same concepts to describe their webpage, which is very hard to achieve. In real-life, what would instead happen is that a lot of people would describe their webpages in inconsistent way, and the intelligent agents would not be able to reason with these webpages as a whole
Because of these reasons, the concept of Semantic Web was never achieved as in that vision (by describing webpages with metadata and using inference engines based on logics).
What has replaced that vision of the Semantic Web?
In the last decades, we have seen the emergence of data mining (also called big data, data science) and machine learning. Using data mining techniques, it is now possible to directly extract knowledge from text. In other words, it has become largely unnecessary to write metadata and knowledge by hand using complicated authoring tools and languages.
Moreover, using predictive data mining and machine learning techniques, it has become possible to automatically do complex tasks with text documents without having to even extract knowledge from these documents. For example, there is no need to specify an ontology or metadata about a document to be able to translate it from one language to another (although it requires some training data about other documents). Thus, the focus as shifted from reasoning with logics to use machine learning and data mining techniques.
It has to be said though that the languages and tools that were developed for the Semantic Web have some success but a much smaller scale than the Web. For example, it has been used internally by some companies. Research about logics, ontologies and related concepts is also active, and there are various applications of those concepts, and challenges that remains to be studied. But the main point of this post is that the vision that this would be used at the scale of the Web to create the Semantic Web did not happen. However, some of these technologies can be useful at a smaller scale (e.g. reasoning about books at the library).
So this is all I wanted to discuss for today. Hope this has been interesting 😉 If you want to read more, there are many other articles on this blog. and you can also follow this blog on Twitter @philfv .
Philippe Fournier-Viger is a professor of Computer Science and also the founder of the open-source data mining software SPMF, offering more than 145 data mining algorithms.