Required by law to publish that data in increasing volume, quality and detail, they still cannot risk violating the privacy of the citizens whom the data represents - people whose identity and personal information might be revealed and exploited by anynone sifting the datasets with little more than a copy of Microsoft Excel.
But a recent Digital Government research project has led to development of a toolset that solves this problem with a relatively simple strategy that is gaining favor at the federal level despite its apparently heretical nature: data-swapping.
Forced to publish a database that could reveal how much some individuals earn, where they live or how well they perform in school? Then simply choose data attached to those people - say, their ages, marital status or hometown - and juggle it so that no one can tell who's who.
However, a second conundrum lies embedded in the first: How much data must you swap to protect someone's identity? And how much can you swap before you have made the bulk of the data unreliable or worthless?
An answer to that puzzle has been developed in the Data-Swapping Toolkit (DSTK), the product of an NSF Digital Government research project by the National Institute of Statistical Sciences (NISS) and the National Center for Education Statistics (NCES).
The NCES provided raw data on school pupils, which NISS used to test various methods of data-swapping, says Marilyn Seastrom, the center's chief statistician.
NCES has been examining data-swapping techniques as a way of safely releasing large datasets that might otherwise be parsed for personal information. The other method of protecting the data was table suppression - essentially removing or collapsing one or more types of data such as age or ethnicity from one table that might be used to extract personal information from another, otherwise safe table.
"The only way you can control table suppression successfully is if you can control every single table that is run off that data," Seastrom says. "We realized that since we let all of our data out, that isn't really a feasible option There's a potential for disclosure there.
"For example," she says, "identifying that there's a teacher of faculty member at a particular institution with a unique set of characteristics - such as educational information or salary information. It would take somebody who had a unique profile of race, ethnicity or education, but particularly in higher edyucation, you have people who have unique strings of those things who might be identified this way."
New policy standards issued by NCES' Disclosure Review Board in response to the 2002 Information Quality Act require now that data be released one of two ways: Either to licensed organizations such as universities through a limited-access system that restricts republishing of the data and ensures compliance with on-site inspections and other forms of oversight; or to the public using some method of data-swapping.
Enter NISS, which had collaborated with NCES on other projects and was interested in the data-swapping problem.
The trick to effective data-swapping is choosing the cells in a table that can be interchanged without ruining the core utility of the data or in some other way revealing private information, says Alan Karr, who is leading the project for NISS.
"But it turns out that in some examples we've looked at, some choices seem to be better than others, and no one before has had a tool to let you see that in a systematic way," he says.
NISS developed both a downloadable application and a set of web-based tools for data-swapping.
NISS' toolkit allows users to preview all possible choices dynamically: "There's a manual version where you can select which category - you have to select one, but you can pick two - and you can see the risk and utility associated with this. But there's also a kind of a batch facility that will do all the one-variable and two-variable swaps and produce a file which you can visualize, and compare these visually so you can see which ones are better than the others." This can be particularly useful for agencies that have to weigh the effects of swapping multiple variables in tables to ensure the greatest security and reliability of the end product, he says.
NCES is gearing up to begin using the NISS toolkit in its data releases.
But how much of published data would be subject to swapping? Seastrom declined to say, since it could give outsiders the tools they need to penetrate the veil of privacy the agency maintains so carefully around its data.
"Census, for example, has maybe half a dozen people who know what the swapping rate is," she says. "We are rather closemouthed about what we do and how much we do, for obvoius reasons."
|This site is maintained by the Digital Government Research Center at the University of Southern California's Information Sciences Institute.|| CONTACT POLICIES|