Abstract
One of the crucial problems in virtually every digital government application is locating and integrating information that is spread across many different organizations in many databases and in many different formats. Areas as diverse as crisis management, government statistics, and legislative support have all identified the issue of integrating heterogeneous information as a major step towards more effective information systems. Government employees, ordinary citizens, and small businesses would all benefit from government information systems that locate, retrieve, and integrate desired information quickly, handling transparently the details of which databases contain the information or in what format it is presented. No system should expect its patrons to trust its results unquestioningly, so these information systems should also make it easy to examine the relationships among documents and/or databases with similar content when desired. The basis for this type of system is metadata, which is data that describes data or collections of data. We propose a completely new approach to metadata, which is based on language models instead of ontologies or controlled vocabularies. Simple language models represent basic vocabulary and frequency information; more complex language models represent phrases, names, and other speech patterns. Language models are a far more detailed representation of document or database contents than a few controlled vocabulary terms. Language models also enable a system to generate descriptions (metadata) directly from the content of its databases, without trying to match database contents to a controlled vocabulary. Language models are easily updated as information is added to a database, they support an unlimited range of subjects (because they are generated directly from database contents), and they support a wide range of information seeking activities. The proposed research will demonstrate that language models are a sound and effective foundation on which to build large-scale, distributed information systems for government applications. Together with our government partners (U.S. Geological Survey, U.S. Department of Commerce, General Services Administration/Regulatory Information Service Center, and the U.S. Library of Congress), we will produce a prototype of a complete system for accessing distributed, heterogeneous, government information, and demonstrate its utility. Building this system will be an important part of evaluating the research, and the first step in transferring the new technology to government systems.
|