Bridging the data gap
But the Digital Government Research Center - a collaboration between Columbia University and USC's Information Sciences Institute - is working to change that.
A series of projects studying search ontologies and mappings across huge government databases has laid the groundwork for future methods of efficient, precise interoperability of government databases that currently are nowhere near being on speaking terms.
Researchers in the Center for Research on Information Access led by Prof. Judith Klavans are studying methods to learn about database contents from published text, while Prof. Eduard Hovy is focusing USC/ISI's team on database standardization and access technology applied to massive 100-MB to 100-GB government databases.
Beginning with the CARDGIS Energy Data Collection projects and continuing with the latest phase - "Bringing Complex Data to Users" - DGRC's studies have partnered researchers with government data collectors such as the Bureau of the Census, the Bureau of Labor Statistics (BLS), the Energy Information Administration (EIA) and the National Council on Health Statistics (NCHS) to explore ways of letting citizens make multiple, fast queries to large datasets in a variety of languages.
"The problem persists," Hovy says of the disconnect between government datasets - and ultimately, users. "The more you computerize, the more databases you have, and eventually, everybody's dealing with the problem - They can't link their databases."
Early phases of the research done in at USC/ISI focused on using scripts and XML "wrapper" agents to construct searchable pages from already-published HTML pages of data from EIA data. This led to the Automatic State Electricity Web Page Generation sub-project.
These collaborations taught the ISI team a few hard lessons about working with government data keepers, Hovy said - they are wary of releasing the very kind of unwashed datasets that researchers need, preferring instead to scrub away anomalies, inaccuracies and privacy concerns before passing on the data to researchers.
"We had sanitized data, and no user queries," he said. "After two and a half years, we focused instead on the core problem - mapping issues."
A five-month sub-project called "AskCal" developed a natural-language query engine that explored how simple questions - "What is the price of gasoline in Connecticut?" - might be interpreted through natural language understanding to deliver useful results.
Meanwhile, Columbia researchers have been focusing on presentation technology and metadata construction - the front and back ends of search interfaces that determine how users interact with databases, and how best search tools should interpret such natural-language questions.
"Once someone has searched these databases, how do you present them back to the user," said Klavans. "(Also) important is what kind of questions do users ask? How do they find data?"
Working with the Columbia School of Public Affairs, the Columbia team is studying the way users query the U.S. Bureau of the Census.
Data lilbrarians and GUI specialists are running user-behavior studies on queries made through Columbia's Electronic Data Service (EDS) - a storehouse for many kinds of data. Using questionnaires, search-interface mockups and other useability-testing tools, the Columbia team is studying how users seek information, to better frame how search engines can be configured to query very large government databases.
They have built a crawler which sifts through thousands of government web pages to recognize glossaries. Once recognized, the pages are parsed into a large common representation format in XML. Individual definitions are being parsed to load into USC/ISI's SENSUS system, for use in browsing and finding out more information on this complex data.
The researchers are also building a new system to further identify codebooks which are associated with large government surveys in order to automatically recognize terms and their definitions in text.
This research is forming the basis grant applications for two new midsize ITR grants - jointly with government partners, one project will process heterogeneous CENSUS data, and the other will focus on metadata in the EPA air emissions domain.
One key to progress in the quest for easier database interoperability may be USC/ISI's ongoing work in machine translation. After successful experiments with natural-language query systems such as AskCal, they have submitted DG grant proposals for two new projects with the U.S. Environmental Protection Agency:
The first involves mapping data relationships among disparate databases used by regional and state offices of the Air Quality Resource Board against data formats used by the EPA, says Hovy.
If funded, USC/ISI would build an automatic mapping induction system that would read data statements for the regional and Sacramento arms of the AQRB and map relationships between the way they are expressed - thus creating a system of metadata rules by which other databases might be interpreted.
Based on the Egypt machine translation toolkit that USC/ISI built in 1999 (downloadable here), the project could break new ground for database translation - and for the ultimate challenge of bringing these huge, information-rich databases to the people.
The second project - recently funded - will study ways to standardize environmental regulations.
The work has broad commercial implications, if it succeeds, Hovy says: Private companies spend millions each year to integrate, repurpose and manage heterogeneous databases - all of which could be made simpler with accurate machine translation. About the pending funding request, he says with a characteristic grin, "We're cautiously optimistic."
|This site is maintained by the Digital Government Research Center at the University of Southern California's Information Sciences Institute.|| CONTACT POLICIES|