ERCIM News No.35 - October 1998
Metadata: An Overview and some Issues
by Keith G Jeffery
There are large problems for information systems today. There is a need to somehow manage / exploit the explosion of information appearing on multiple WEB sites with very variable standards of data quality and currency - and of course there is the need to know such data sources exist. This has the major aspects of Data Quality, Query Quality, Answer Quality and the Integration of Heterogeneous Sources.
Data quality: Data quality can be improved by better data collection facilities (including help and explanation with examples) and better validation controlled by constraints, with automated conversions of unit values if required or necessary.
Query quality: Query quality can be improved not only by classical query optimisation using knowledge about the database size and structure, but also by assisting the user to formulate the query best to meet the requirement - by means of online help, explanation, examples.
Answer Quality: The answer to a query commonly includes values and structures that are unfamiliar to the user; explanations and help, hyperlinked descriptions of units, precision, calibration or of similar terms could help the user to understand better the results.
Integration of Heterogeneous Sources: First, there is the need to know that a source exists and to know something of its characteristics. Heterogeneous data sources commonly have disparate schemas and there is a need to understand the differences, even when apparently reconciled by one of the integration techniques.
The Solution - Metadata
For all of the above to be realised, there is one essential and common ingredient: metadata. Let us consider briefly how it may be used in each of the cases:
Metadata for Data Collection: Metadata is necessary for validation through schema and constraints, using value-sets and domain range limits and even more sophisticated logic tests. It is necessary for online help / explanation, and - in the form of a multilingual thesaurus - for translation.
Metadata for Queries: Metadata is necessary for validation through schema and constraints, for online help / explanation, for translation, and for optimisation: both user assistance in proposing more appropriate terms (synonyms, super- / sub-terms) and performance optimisation since the metadata stores the structural indexes into the databases, optimal access paths, optimal query segmentation and distribution for parallelism and minimal network transfers.
Metadata for Answers: Metadata from the schema and associated metadata as domain ontology information (in a KBS) is necessary for answer consolidation, for online help / explanation, and translation.
Metadata for Integration: Metadata can catalogue sources of information at a high level so that they become visible. The well-known web indexing systems such as [AltaVista] or [Excite] do this in a very general way. An example in the field of CRIS (Current Research Information Systems) is the Bergen system [BergenCRIS] which points to structured information systems for CRIS. Metadata, when used with an inferencing mechanism, is the key resource to find matching data structure and content despite heterogeneous representations and languages so allowing integration across heterogeneous data sources.
Similarly, metadata provides the information necessary for customisation of standard products allowing integration into the desktop / office environment.
Metadata
At present most Information Systems make very limited use of metadata and - since it supports all the user-friendly easy-to-use features and extends the range of available information features outlined above - perhaps this explains why these Information Systems have been less successful the few information systems really using metadata. Having outlined above how useful metadata could be for Information Systems, let us consider exactly what metadata is. The aim is to decide what kinds of metadata are useful for Information Systems and how best to generate, maintain and use metadata for the benefit of end-users of Information Systems.
Metadata is data about data. It is therefore of great utility:
to any Information System which aspires to be more than a simple, inflexible unfriendly information source - use of metadata can allow dynamic optimisation and flexibility and allow integration over heterogeneous distributed information
to any end-user requiring help, explanation, data quality assurance, assistance in finding relevant information, assistance in integrating information from heterogeneous sources.
Distributed RDBMSs use metadata extensively. Web indexing systems (such as [AltaVista], [ExCite] .) are based on sophisticated metadata. Metadata is clearly of great importance. Perhaps the earliest use of metadata was in computerised library catalogue systems based on IR techniques where the catalogue card record is metadata describing the real data in the book or other primary publication. Sadly this same field of endeavour is where metadata has hardly been developed further, and yet this is the very area of Information Systems technology where metadata could exert the greatest leverage.
There have been many attempts to standardise metadata structure and content for specific application areas. In the world of libraries the [MARC] standard for catalogue records allows some interworking. Unfortunately there are more than 50 major variants and so interworking is not as easy as one might expect. Similarly, in many scientific areas -eg space science, particle physics - there are metadata standards. In the world of commerce there is [EDI] / [EDIFACT]. There have been attempts to agree a standard European Patient Medical Record. Perhaps the most successful is in the field of Engineering: the EXPRESS language describing the STEP data exchange format with commercial support [STEPTOOLS].
The increasing requirement for interworking among systems handling grey electronic literature has caused the internet community to propose as a metadata standard the [Dublin Core] and, subsequently to provide converters between the standards [UKOLN]. In the field of CRIS a common metadata form for exchange has been proposed and is now used for metadata catalogs in the ERGO Project.
The great spread of Web has increased dramatically the requirement for metadata standards to allow a global browsing and querying capability. The creation of [W3C] (World Wide Web Consortium) provided the forum for intense work on metadata [W3Cmetadata]. The main results have been PICS (Platform for Internet Content) which allows categorisation of Web information in a way similar to film censorship, and following the Netscape MCF (meta content framework) and Microsoft XML-Data proposals, the W3C standard named RDF (Resource Description Framework - which is XML based) has gained widespread acceptance and subsumes PICS.
Kinds
Here we propose that there are three main kinds of metadata: schema, navigational and associative.
Schema metadata is an intentional description of extensional instances. Typically a schema consists of: database {name, size, security authorisations}, attributes {name, type, constraints}. Some of the constraints concern the attribute domain, some are inter-attribute and as such may express relationships. The schema intension has a formal logic relationship to the data instances. This is important in ensuring data quality. It also provides a formal basis for systems.
Navigational metadata provides information on how to get to an information resource. Mechanisms include: filename, DB name + navigational algorithm, DB name + predicate (query), URL (Uniform Resource Locator), URL + predicate (query) or various combinations of them. They may also be obtained via a web-indexing mechanism (such as [AltaVista], [ExCite] ) which themselves make extensive use of metadata. Navigational metadata has no formal logic relationship to the data instances.
Associative metadata provides additional information for application assistance. The assistance may improve performance, accuracy or precision of the system and / or provide assistance to the end-user through a domain aware supportive user interface. The main kinds of associative metadata are:
- descriptive: catalogue record (eg [Dublin Core])
- restrictive: content rating (eg PICS) or security, privacy (cryptography, digital signatures) [W3C]
- supportive: dictionaries, thesauri, hyperglossaries [VHG], domain ontologies eg [PROTÉGÉ]
Associative metadata usually does not have a formal logic relationship to data instances although there may be systematic association relationships.
Metadata and Dataweb
In order to combine the benefits of universal access (Web) with the benefits of data managed and with structure and quality in a database various teams have worked on linking Web and Database systems. CLRC-RAL was early into this field and experimented with several techniques since 1993, currently basing the departmental web on Microsoft ASP technology. Now much of the information available over the web is held and managed within databases linked to the web through CGI (Common Gateway Interface) and scripts in a language such as Purl or Tcl.
The data in these structured databases behind a web interface is essentially invisible to web indexing systems such as [AltaVista] or [Excite]. Since this is usually structured, managed, high quality data its use might be preferable to authored html pages. The problem is how to make it visible to web-indexing or information-cataloguing systems, in a way that is universally acceptable and utilised.
Conclusion
The key to the Future of Information Systems is Metadata. However, there are serious issues to be addressed:
- standard form for metadata: the W3C RDF is general and uses XML as the language - is this sufficient?
- sub-forms of metadata by application domain: will they all be based on the same basic data model and language to allow cross-domain interoperation?
- Progressively dataweb technology is being adopted; is there a standard mechanism for making such structured and hopefully quality information sources visible on the web through metadata?
The set of articles within this special theme in ERCIM News document all the aspects of metadata mentioned above. They cover data quality, query quality, answer quality and heterogeneous information integration. They exhibit aspects of schema, navigational and associative metadata.These Principals can be seen in the following ways: KINE uses a knowledge based programming approach to hold metadata (see article) whereas GEN.LIB from CRCIM is using a programming library approach (see article). SARI and the W3C (page 25) work is concerned with RDF.
The use of metadata for enhanced query is discussed in the articles from ETH Zurich (see article) and the joint INRIA/FORTH work on Artemis (see article), the latter also using metadata for integration. Articles from ICS/FORTH on health care (see article) and RAL on ERGO and CERIF (see article) describe applications using metadata and Webstore from GMD (see article) details middleware for integration.
Several of the articles describe the use of intelligent agents (with associated metadata persistent stores) for reconciling heterogeneity and for assisting at other interfaces (eg query) - by many this is seen as the way forward for use of metadata to improve the effectiveness of information systems in a global setting.
Please contact:
Keith Jeffery - CLRC
Tel: +44 1235 44 6103
E-mail: kgj@rl.ac.uk