The Semantic Web vision seems to be gaining foothold rapidly. Abundant material is available to help understand it.
However I believe the problems of how applications deal with data today is not adequately highlighted. Often the benefit of a changed implementation is not appealing until the shortcomings of the current approach is made clear.
Before we get to details, let us think for a moment on what content is and its characteristics.
Content and Its Limitations
Content is a bunch of bytes whose meaning is subjective. Content closely tied to an implicit context limits its usefulness. Especially if the context by which the content gains meaning resides in a human user. This means that machines have no option but to treat any data without contextual information as a bunch of raw bytes. Machines are helpless to do anything ‘smart’ with the data.
So how do we invest content with some context so that machines can make better sense of it? The answer is metadata. Or data about data. Data whose sole purpose is to explain a little more about another piece of data.
Metadata can be, and is often, used by both humans and machines in interpreting information.
Current Means of Capturing Metadata
Now there are many means of capturing this metadata. Documents store information in custom properties of the file or in the file name or in a title within the document. The storage format of the documents could vary; popular examples include Microsoft Word .doc, Excel .xls, Adobe Acrobat .pdf and so on. Each of these is a unique format.
Databases capture this information via the DB Schema. Each table and column has a domain specific name. Each column has a data-type assigned to it that constrains the kind of values each field can accept. Most programming languages typically capture information in terms of data types for variables and classes that are domain abstractions. Data formats like XML have richer metadata expression constructs like XML Schema.
Which leads us to the question – if there are multiple ways to capture metadata then why invent something like the SemWeb at all? And require to capture data in RDF and metadata via OWL?
Schema without flow
Typically the schema information is not captured in a consistent format. Databases have it implicitly captured in how the tables and columns are named and typed. Data that flows between different layers of an application stores metadata in its own format. OO languages capture it in classes and the members they contain.
Every hop of data between two layers of an application, say from a ASP.NET front-end to a MySQL server will have to undergo a translation. A single class can be, and is often, stored in multiple tables of a RDBMS.
Developers who build OO applications that need persistence are familiar with the problem of Object Relational Impedance Mismatch. The world view of a OO language run-time and that of a RDBMS persistence engine are distinctly different.
Extrapolating this scenario we find that schema information has no flow.
It worse than you think
There is another trouble with schema/metadata as it is produced and consumed today. That is of its scarcity. Schema information is usually demanded at the point of ingestion from external sources. Even carried across various application layers, albeit with some translation and sometimes with loss of fidelity.
But this discipline of expecting content to retain schema is not maintained across the entire production pipeline. Especially at the last mile of content life cycle, at the actual point of consumption, content often loses grammar.
Imagine every application that you use, say the Office suite of products, any rich client application, most web applications..all concentrate on displaying the data with some visual cues on what each field means.
The metadata used when storing this information or when applying business rules on the data is seldom revealed to you. It is only via error messages and other exceptions that one gets to learn assumptions made by an application about the data.
Applications that export data do provide some metadata but not to the same expressiveness it is stored and manipulated within the application.
This is quite ironical in a way, just when you would expect content to carry all its original context is where it turns into raw bytes.
We need to carry context, the metadata, of content until the point of consumption. Not as something we display to the user, but as an underlay to the content. Something right beneath the surface, that can be leveraged with simple tools.
Does metadata percolate today?
Yes, it does. Consider how microformats are embedded within HTML. This allows simple domain specific metadata to be scrapped by interested applications and custom actions allowed on the fly.
And consider the textbook use case of content negotiation dance between the browser and the web server.
These examples clearly indicate it is possible to use web technologies to provide significant amount of metadata along with the content.
So what needs to change?
We need to have metadata available around the content that user interfaces display. Either as something hidden, right below the surface, or via APIs that can easily be accessed.
Metadata needs to be available in a standardized format.
Or at least be translatable into a standardized format. Use of industry standard schema is preferable but in its absence to have a some format is better.
Without this foundation of a standardized and ubiquitous metadata the dream of a global web of data that can be processed intelligently will continue to remain a dream. And data will continue to remain in silos forever withholding its insights from us.