ERCIM News No.25 - April 1996 - CNR

DBT on Internet

by Eugenio Picchi, Lisa Biagini and Luca Fiorani


We describe the Internet versions of DBT, a textual database system developed in Pisa, at the Istituto di Linguistica Computazionale (ILC-CNR). The aim is to provide a system that allows linguists and language scholars to access and query textual archives located on servers throughout the world, offering them a range of search and language analysis tools.

Over the last ten years, DBT, a textual database system designed to meet the needs of literary and linguistic text analysis applications has been widely adopted by the Italian academic and research world (and not only Italian). The system procedures to structure a machine readable text in DBT format are so simple that a new text can be prepared for database inclusion within a few minutes. This means that there are now thousands of texts already structured in DBT form in the archives of university and research institutes throughout Italy and beyond, and this number is growing rapidly.

The importance of rendering language and text resources reusable is strongly felt in the scientific community. There is thus a concerted move towards making existing resources available as widely as possible, while respecting copyright and intellectual property rights. For this reason we began to study the best way of making it possible not only for local users but for scholars working anywhere in the world to consult the geographically distributed DBT archives. It was very clear that the ideal medium for this is that provided by Internet.

Two distinct approaches have been developed to make the DBT system and DBT structured archives accessible over the network. In the first approach, known as DBTNET, a client-server version of the system has been created which allows the user to directly access and query texts located on servers at geographically remote sites in the same way as when using the stand-alone DBT system. The client-server dialog uses the TCP/IP protocol. Our objective has been to offer the same user interface and the full range of sophisticated search and analysis capabilities of the stand-alone system. As much information as possible is maintained at the client site in order to reduce the client-server operations to the minimum, thus optimizing the system response times.

The alternative version, DBTWEB, developed in parallel, uses the HTTP protocol, the HTML formatting language and the most common WEB interfaces (Mosaic and Netscape). The main advantage of adopting this technology is that it facilitates navigation over the network. Information is made available in the form of pages of multimedia and hypertext data. The hypertext links point to other pages which can be located anywhere in the network. These standards are independent and do not depend on the platform, or computing system employed by the client; this means that they are directly usable by all platforms that can communicate with the Internet. This has contributed greatly to their popularity.

DBTWEB is now in an advanced stage of development. Distributed textual databases can be consulted through an IR system based on a traditional client-server model. Using CGI (Common Gateway Interface) scripts, the HTTP server can retrieve structured or compound information not generally directly accessible by most well known browsers. The gateways assume the role of interfaces, in both directions, between the Web and the database.

DBTWEB creates a hypertextual study environment, dynamically transforming the results of a generic query into an HTML page, which can in its turn be consulted by other queries. It offers all the main functionalities of the standard DBT system such retrieval of frequency values, contexts, extended contexts, structured searches, etc. Routines for user identification and authentication make it possible to organise the consultation in work sessions; save and restore facilities are also provided.


Please contact:
Eugenio Picchi - ILC-CNR
Tel: +39 50 540681
E-mail: picchi@ilc.pi.cnr.it


return to the contents page