|
![]() |
|
DG Researchers Pool Pollution Data Sources
Under the Federal Clean Air Act, states must submit all their emissions inventory data to the EPA for a nationwide emissions inventory. Before the states can send in their reports, they must first gather all the data from their various data sources. The resulting databases are highly detailed, enormous and varied. States' emissions inventory databases may differ significantly in their scale and structure. Depending on a state's size, budget and when it installed its IT infrastructure, databases can range anywhere from custom-modified enterprise software to one scientist filling in an Excel spreadsheet. In all cases, switching to a standard, nationwide system would mean an enormous investment of funds and staff time. Most challengingly, every state, regardless of what software it chooses, sets up its database categories depending on strictly local concerns. It makes no sense to have fields in your database for farm equipment if you monitor emissions in suburban New Jersey. Likewise "diesel boat fuel" is not going to be a category in most of Kansas. It feels like trying to solve a dozen Rubik's Cubes all at once, yet every year, dedicated state and US EPA employees compile all of it into something comprehensible. It's hardly surprising that one of the federal officials most concerned, Brooke Hemming, sought to see if there might be a way to automate all of this. A partnership between the U.S. EPA and the University of California's Information Sciences Institute (ISI) is now working on a comprehensive pollution database management system. In 2002, the EPA and the NSF Digital Government program held a one-day workshop to discuss the coordination problem. There Hemming met Eduard Hovy, head of the Natural Language Group at ISI. His lab would bid successfully for a Digital Government grant to create a system that would coordinate the databases with minimal human intervention. Hovy, an expert in natural language processing and machine translation, has spent his career on similar challenges. Much of the work involves creating ontologies, a kind of linguistic process in which one needs to consider how words are used in particular contexts. He explains one project, "In 1996 I took two ontologies made by different people, and tried to integrate them. One said a library is a building, the other said a library is an institution. A Ôbuilding' has a physical meaning, but it doesn't have a budget. An Ôinstitution' can hire and fire, but it doesn't have a physical space." The solution, when you are trying to teach a computer to understand such distinctions, isn't to create a dictionary with multiple entries for each word. Instead, you create a set of logic trees, where words can be classed by contextual meaning, and then you develop a set of rules the computer can learn from, Hovy says. Rather than coding, "Robin equals color or bird," you program an understanding of the concepts of color and bird, so when the computer encounters a word like "cardinal" it can decide automatically what the word refers to. It is very much like the scene in War Games, where the computer realizes for itself that tic-tac-toe is unwinnable. "Ten years ago there was a revolution in machine translation research," says Hovy. "Researchers had been simply plugging in dictionaries. Then people at IBM created an entirely novel method of doing machine translation using the forward/backward algorithm. They took proceedings from the Canadian parliament that were in both French and English and had the software learn the correspondences by constantly going back and forth over the texts, until it could create reasonable expectations of what words meant in context. So for example, it might find that 90% of the time Ôchien' means 'dog,' and the other 10% it means 'imbecile.'" Hovy decided to use this technique to automatically discover correspondences between the EPA's databases. California was chosen as the prototype test bed, because as Michael Benjamin, Manager of the Emissions Inventory Systems Section, of the In fact, Benjamin's job has another level of challenge. The CARB is charged with monitoring mobile emissions, those, like automobile exhaust, that by their very nature can't be limited to one region of the state. The local boards monitor stationary emissions, like dry cleaners and manufacturing plants. Hovy and his colleague Andrew Philpot are starting with a bite-sized piece (albeit a large bite). Their first task is to integrate the Santa Barbara local air quality district data into the CARB data. The initial step is a process of statistical matching, in order to find out what terms, categories and distinctions recur across the Santa Barbara and corresponding CARB databases. The work is made harder because some officials have created their own unique abbreviation systems, so Hovy and Philpot are aided by the reports Benjamin compiled in previous years. "Now it's not French/EnglishÑ it's Humboldt/Los Angeles," says Hovy, "The reason why we can succeed is we have Sacramento [CARB] in the middle, which actually did all of this by hand." If this approach works, it will begin to serve as a template for incorporating the data from the rest of the districts. "If you give the system the ability to associate features, it will learn," says Hovy, "You give it the language to think in." "We're very grateful that this contract was granted, the funding is 50% NSF, 50% EPA so California has not committed any funds," says Benjamin. "Sometimes research projects deliver, sometimes they don't, but regardless of the pragmatic deliverables, there will be a lot of knowledge gained. We're all very excited by it. We're excited that Ed and Andrew are working on it and that California been selected as the test bed. I think if it's successful, it will have a lot of very positive outcomes. We would rather spend our time working to improve the quality of the emissions data rather than trying to collate it." | ||||||
|
This site is maintained by the Digital Government Research Center at the University of Southern California's Information Sciences Institute. |
|
CONTACT POLICIES | ||
| | |||||