DigitalGovernment.org - Home of the Nat'l. Science Foundation Digital Government Research Program
menu 1
menu 2
menu 3
menu 4
   

dg.o Web

Protecting Your Digital Identity
DG-funded Toolkit Safeguards Privacy by Data-Swapping in Public Records
By Mack Reed
DGRC Communications Manager

NISS Data-Swapping Toolkit
  • Researcher profile: Alan Karr, NISS
• Project profile: A Web-Based Query System for Disclosure-Limited Statistical Analysis of Confidential Data
• Gov't Partner: National Center for Education Statistics
Project home page
• Download: NISS Data-Swapping Toolkit

illustration by Mack Reed

Government agencies rich in data find themselves on the horns of a dilemma:

Required by law to publish that data in increasing volume, quality and detail, they still cannot risk violating the privacy of the citizens whom the data represents - people whose identity and personal information might be revealed and exploited by anynone sifting the datasets with little more than a copy of Microsoft Excel.

But a recent Digital Government research project has led to development of a toolset that solves this problem with a relatively simple strategy that is gaining favor at the federal level despite its apparently heretical nature: data-swapping.

Forced to publish a database that could reveal how much some individuals earn, where they live or how well they perform in school? Then simply choose data attached to those people - say, their ages, marital status or hometown - and juggle it so that no one can tell who's who.

However, a second conundrum lies embedded in the first: How much data must you swap to protect someone's identity? And how much can you swap before you have made the bulk of the data unreliable or worthless?

An answer to that puzzle has been developed in the Data-Swapping Toolkit (DSTK), the product of an NSF Digital Government research project by the National Institute of Statistical Sciences (NISS) and the National Center for Education Statistics (NCES).

The NCES provided raw data on school pupils, which NISS used to test various methods of data-swapping, says Marilyn Seastrom, the center's chief statistician.

NCES has been examining data-swapping techniques as a way of safely releasing large datasets that might otherwise be parsed for personal information. The other method of protecting the data was table suppression - essentially removing or collapsing one or more types of data such as age or ethnicity from one table that might be used to extract personal information from another, otherwise safe table.

"The only way you can control table suppression successfully is if you can control every single table that is run off that data," Seastrom says. "We realized that since we let all of our data out, that isn't really a feasible option There's a potential for disclosure there.

"For example," she says, "identifying that there's a teacher of faculty member at a particular institution with a unique set of characteristics - such as educational information or salary information. It would take somebody who had a unique profile of race, ethnicity or education, but particularly in higher edyucation, you have people who have unique strings of those things who might be identified this way."

New policy standards issued by NCES' Disclosure Review Board in response to the 2002 Information Quality Act require now that data be released one of two ways: Either to licensed organizations such as universities through a limited-access system that restricts republishing of the data and ensures compliance with on-site inspections and other forms of oversight; or to the public using some method of data-swapping.

Enter NISS, which had collaborated with NCES on other projects and was interested in the data-swapping problem.

The trick to effective data-swapping is choosing the cells in a table that can be interchanged without ruining the core utility of the data or in some other way revealing private information, says Alan Karr, who is leading the project for NISS.

Latest DG News


dg.o 2006 Convenes May 21-24, 2006  
dg.o 2006 Early Registration Ends April 10th!
dg.o 2006 Issues CFP - Tutorials
dg.o 2006 Issues CFP - Workshops
• dg.o 2006 features Workshops on:
   eRulemaking
   GeoInformatics
• dg.o 2006 features Tutorial on:
   •Social Network Analysis
New DG Team Pursues eRulemaking
IEEE ISI2006 Convenes May 22-24, 2006
eChallenges e-2006 Issues CFP
DG Research Helps Predict Urban Growth
Swapping Secrets of the Double Helix
UK and DO-Wire Launch e-Gov Best Practices wiki
DG Team Develops "Virtual Agora" for e-Gov
Mapping for Times of Crisis
Exploring Detection of Crisis Hotspots
Report: Mass eMail Campaigns Harmful
Scenario-Based Designs for Stat Studies
US, EU Explore Info Integration
DG Team Develops Digital Interpreter
DG Study Gives Teeth to FBI
Research Smooths Road for Small Businesses
DG Researchers Parsing in Tongues
e-Gov Journal Issus Call for Articles

See all news stories

Contribute to dgOnline

"A good choice would be one that creates a high level of protection, but a low level of distortion in the data," Karr says. "You want to avoid things where you create 4-year-olds with 10 children [but] essentially, it's always going to be the case that the more protection you have, the more distortion you have. There's just no way around that.

"But it turns out that in some examples we've looked at, some choices seem to be better than others, and no one before has had a tool to let you see that in a systematic way," he says.

NISS developed both a downloadable application and a set of web-based tools for data-swapping.

NISS' toolkit allows users to preview all possible choices dynamically: "There's a manual version where you can select which category - you have to select one, but you can pick two - and you can see the risk and utility associated with this. But there's also a kind of a batch facility that will do all the one-variable and two-variable swaps and produce a file which you can visualize, and compare these visually so you can see which ones are better than the others." This can be particularly useful for agencies that have to weigh the effects of swapping multiple variables in tables to ensure the greatest security and reliability of the end product, he says.

NCES is gearing up to begin using the NISS toolkit in its data releases.

But how much of published data would be subject to swapping? Seastrom declined to say, since it could give outsiders the tools they need to penetrate the veil of privacy the agency maintains so carefully around its data.

"Census, for example, has maybe half a dozen people who know what the swapping rate is," she says. "We are rather closemouthed about what we do and how much we do, for obvoius reasons."