Integrating Database and Information Retrieval: A Parallel DBMS Approach

Abstract

Continuing improvements in communication and storage technology have made terabytes of unstructured text readily available. These data must be integrated with the existing structured databases.

Roughly 90% of the United States market share of database systems use the relational model. The relational market is in access of a $5 Billion per year enterprise. Both sequential and parallel implementations exist commercially. To preserve the large investment in relational systems while providing integration of structured data and text, we model an inverted index using standard relational technology.

By developing a standard template of queries, an approach that makes it possible to integrate both structured data and text using standard, unchanged SQL was developed. Typically, the relational model is not well equipped for queries over free text. Storage overhead has also been typically the limiting factor. Our implementations, however, demonstrate that acceptable run time and storage performance using the relation model is achievable. Furthermore, accuracy results comparable to special purpose information retrieval systems occur.

While parallel relational database systems are common, parallel information retrieval engines are in their infancy. By implementing information retrieval as an application of a relational database system, we obtain a portable, parallel means of implementing a variety of common information retrieval algorithms. Using a parallel database machine, we observed good load balancing across all processors for a problem that has typically resisted parallel algorithms. Results from both parallel and sequential implementations are presented.


Philip Chan, pkc@cs.fit.edu
Last modified: Mon Jan 26 10:43:57 EST 1998