DigitalGovernment.org - Home of the Nat'l. Science Foundation Digital Government Research Program
menu 1
menu 2
menu 3
menu 4
   

dg.o Web

DG Researchers Pool Pollution Data Sources
With EPA and California Air Resources Board, Scientists Explore Automating Construction of a National Air Quality Database
By Karen Heyman
For the DGRC

Automating Air Quality Databases
 

Institution:
Information Sciences Institute
Government Partners:

  • California Air Resources Board
  • U.S. Environmental Protection Agency
    Abstract:
    Automating the Integration of Heterogeneous Databases
    Project profile:
    Automating the Integration of EPA Databases

  • In order to develop effective air quality regulations, government agencies must develop accurate emission estimates for all sources of air pollution, ranging from the dramatic, like forest fires, to the prosaic, like dry cleaners.

    Under the Federal Clean Air Act, states must submit all their emissions inventory data to the EPA for a nationwide emissions inventory. Before the states can send in their reports, they must first gather all the data from their various data sources. The resulting databases are highly detailed, enormous and varied.

    States' emissions inventory databases may differ significantly in their scale and structure. Depending on a state's size, budget and when it installed its IT infrastructure, databases can range anywhere from custom-modified enterprise software to one scientist filling in an Excel spreadsheet.

    In all cases, switching to a standard, nationwide system would mean an enormous investment of funds and staff time. Most challengingly, every state, regardless of what software it chooses, sets up its database categories depending on strictly local concerns. It makes no sense to have fields in your database for farm equipment if you monitor emissions in suburban New Jersey. Likewise "diesel boat fuel" is not going to be a category in most of Kansas.

    It feels like trying to solve a dozen Rubik's Cubes all at once, yet every year, dedicated state and US EPA employees compile all of it into something comprehensible. It's hardly surprising that one of the federal officials most concerned, Brooke Hemming, sought to see if there might be a way to automate all of this.

    A partnership between the U.S. EPA and the University of California's Information Sciences Institute (ISI) is now working on a comprehensive pollution database management system.

    In 2002, the EPA and the NSF Digital Government program held a one-day workshop to discuss the coordination problem. There Hemming met Eduard Hovy, head of the Natural Language Group at ISI. His lab would bid successfully for a Digital Government grant to create a system that would coordinate the databases with minimal human intervention.

    Hovy, an expert in natural language processing and machine translation, has spent his career on similar challenges. Much of the work involves creating ontologies, a kind of linguistic process in which one needs to consider how words are used in particular contexts. He explains one project, "In 1996 I took two ontologies made by different people, and tried to integrate them. One said a library is a building, the other said a library is an institution. A Ôbuilding' has a physical meaning, but it doesn't have a budget. An Ôinstitution' can hire and fire, but it doesn't have a physical space."

    The solution, when you are trying to teach a computer to understand such distinctions, isn't to create a dictionary with multiple entries for each word. Instead, you create a set of logic trees, where words can be classed by contextual meaning, and then you develop a set of rules the computer can learn from, Hovy says. Rather than coding, "Robin equals color or bird," you program an understanding of the concepts of color and bird, so when the computer encounters a word like "cardinal" it can decide automatically what the word refers to. It is very much like the scene in War Games, where the computer realizes for itself that tic-tac-toe is unwinnable.

    "Ten years ago there was a revolution in machine translation research," says Hovy. "Researchers had been simply plugging in dictionaries. Then people at IBM created an entirely novel method of doing machine translation using the forward/backward algorithm. They took proceedings from the Canadian parliament that were in both French and English and had the software learn the correspondences by constantly going back and forth over the texts, until it could create reasonable expectations of what words meant in context. So for example, it might find that 90% of the time Ôchien' means 'dog,' and the other 10% it means 'imbecile.'"

    Hovy decided to use this technique to automatically discover correspondences between the EPA's databases. California was chosen as the prototype test bed, because as Michael Benjamin, Manager of the Emissions Inventory Systems Section, of the California Air Resources Board (CARB) explains, "California is kind of a mini version of the U.S. EPA, in the sense that we are the only state that has 35 local air districts that are autonomous. As a result, we have some of the same challenges in dealing with data that are present at the national level. So if the work is successful here, it can scale up to the national and international level."

    In fact, Benjamin's job has another level of challenge. The CARB is charged with monitoring mobile emissions, those, like automobile exhaust, that by their very nature can't be limited to one region of the state. The local boards monitor stationary emissions, like dry cleaners and manufacturing plants.

    Latest DG News


    dg.o 2006 Convenes May 21-24, 2006  
    dg.o 2006 Early Registration Ends April 10th!
    dg.o 2006 Issues CFP - Tutorials
    dg.o 2006 Issues CFP - Workshops
    • dg.o 2006 features Workshops on:
       eRulemaking
       GeoInformatics
    • dg.o 2006 features Tutorial on:
       •Social Network Analysis
    New DG Team Pursues eRulemaking
    IEEE ISI2006 Convenes May 22-24, 2006
    eChallenges e-2006 Issues CFP
    DG Research Helps Predict Urban Growth
    Swapping Secrets of the Double Helix
    UK and DO-Wire Launch e-Gov Best Practices wiki
    DG Team Develops "Virtual Agora" for e-Gov
    Mapping for Times of Crisis
    Exploring Detection of Crisis Hotspots
    Report: Mass eMail Campaigns Harmful
    Scenario-Based Designs for Stat Studies
    US, EU Explore Info Integration
    DG Team Develops Digital Interpreter
    DG Study Gives Teeth to FBI
    Research Smooths Road for Small Businesses
    DG Researchers Parsing in Tongues
    e-Gov Journal Issus Call for Articles

    See all news stories

    Contribute to dgOnline

    Benjamin must put together reports from the 35 autonomous districts, and then add in his own, with its very different approach. So that's 36 heterogeneous databases to compile on deadline, which begins to approach the level of those fairytales where the princess has to separate a roomful of wheat from chaff by morning. And that's not the business environmental scientists are in says Benjamin, who holds a PhD in the field, "For the most part, the districts are staffed by people who don't come from a computer science background, that's why it's a great opportunity to have people like Ed help out in terms of the knowledge base."

    Hovy and his colleague Andrew Philpot are starting with a bite-sized piece (albeit a large bite).

    Their first task is to integrate the Santa Barbara local air quality district data into the CARB data. The initial step is a process of statistical matching, in order to find out what terms, categories and distinctions recur across the Santa Barbara and corresponding CARB databases. The work is made harder because some officials have created their own unique abbreviation systems, so Hovy and Philpot are aided by the reports Benjamin compiled in previous years. "Now it's not French/EnglishÑ it's Humboldt/Los Angeles," says Hovy, "The reason why we can succeed is we have Sacramento [CARB] in the middle, which actually did all of this by hand." If this approach works, it will begin to serve as a template for incorporating the data from the rest of the districts. "If you give the system the ability to associate features, it will learn," says Hovy, "You give it the language to think in."

    "We're very grateful that this contract was granted, the funding is 50% NSF, 50% EPA so California has not committed any funds," says Benjamin. "Sometimes research projects deliver, sometimes they don't, but regardless of the pragmatic deliverables, there will be a lot of knowledge gained. We're all very excited by it. We're excited that Ed and Andrew are working on it and that California been selected as the test bed. I think if it's successful, it will have a lot of very positive outcomes. We would rather spend our time working to improve the quality of the emissions data rather than trying to collate it."