Thursday, December 20, 2007

SEMANTIC INTEGRATION

Let’s consider the following scenario to understand the concept and need of Semantic Integration.

Scenario: I have to build a site to provide information about flights arranged by their costs based on some user selected criteria like class, flight duration, departure time, route etc. So I have to dynamically (as underlying data changing continuously) retrieve the data from different airline providers and analyze those to provide results against user query. At the same time I have to keep on integrating the new airline providers.

Big Question: Is the data warehousing approach is going to be a suitable solution to the above problem? Consider the effort required to maintain consistency using data warehousing concept while dealing the dynamic heterogeneous data coming from database, XML, HTML etc – airline companies can provide information in different ways. At the same time estimate the effort required achieving the scalability required for integrating new airlines – you may end up changing the data extraction and transformation logics even application often. Also do we have to retrieve data from all airlines every time? We have to minimize the traffic and computation as much as possible.

Don’t you feel warehousing approach is not the best solution for the above problems? Consider following architecture:



Figure 1: Architecture facilitating semantic integration of heterogeneous information sources (from http://wwwis.win.tue.nl/~houben/respub/wiiw01.pdf)

Above describes the architecture of Semantic Integration which is based on the information available in the data rather than their structure and has following main components:
Resource Broker:

  • Translates queries to the target format (e.g., SQL).
  • Converts ontological terms to terms used by the resource (e.g., database schema
  • Translates responses back into the ontological terms used in the query.


Mediator/Domain Broker:

  • Accepts queries from users (via the query generator) or other applications.
  • Partitions the query into sub-queries.
  • Distributes the sub-queries to the appropriate Resource Brokers in case needed making sure minimal network traffic.
  • Merges the results from the various Resource Brokers and passing the combined response back to the requestor. Here The Resource Description Framework Schema (RDFS) is used to represent all schema-level metadata (both domain and infrastructure) and the Resource Description Framework (RDF) is used to represent all instance information. RDFS imposes a standard to provide consistency and better interoperability.

Concept Model

  • Interlinks metadata to have more meaning full information.

Here semantic integration with on-demand driven (lazy) approach addresses the problems better, associated with the scenario of our consideration.

Companies have wrestled with the integration of legacy systems for some time, but with the explosion of web-based resources and advent of web 2.0, the interoperability of information has become an even greater problem. The essence of this problem is the implicit and frequently inconsistent semantics of the information. Because the web has made access to information much easier, the potential for companies to leverage information has grown tremendously. But, for the most part, each information resource was created for a single purpose, and the power of integration is in the merging of information, particularly in unanticipated ways.

(Ref: http://www.cs.rutgers.edu/~shklar/www11/final_submissions/paper3.pdf)