Census Bureau’s use of ‘synthetic data’ worries researchers
ORLANDO, Fla. — First came the “noise” — minor errors the U.S. Census Bureau introduced into the 2020 census data to protect participants’ privacy. Now the bureau is looking into “synthetic data,” manipulating the numbers widely used for economic and demographic research to obscure people’s identities who provided information.
The moves have some researchers in arms, worried that the statistical agency could sacrifice accuracy to protect privacy. At a virtual conference last week, Census Bureau statisticians disclosed that they will work toward developing a method to create “synthetic data” for files on individuals and homes that already are devoid of personalized information over the next three years. These files, known as American Community Survey microdata, are used by researchers to create customized tables tailored to their research.
Census Bureau statisticians said more privacy protections are needed as technological innovations magnify the threat of people being identified through their confidential survey answers. Computing power is now so vast that it can quickly crunch third-party data sets that combine personal information from credit rating and social media companies, purchasing records, voting patterns, and public documents, among other things.
“It’s a balancing act. The law requires us to do competing things. We need to release national statistics to allow people to make good decisions. But we also have to protect the privacy of our respondents,” said Rolando Rodriguez, a Census Bureau statistician, at the conference.
But critics say the proposal, coupled with an ongoing effort to add minor inaccuracies to the 2020 census data to protect participants’ privacy, undermines the Census Bureau’s credibility as the go-to provider of precise data about the U.S. population.
University of Minnesota demographer Steven Ruggles said bluntly that synthetic data “will not be suitable for research.” “The Census Bureau is inventing imaginary threats to confidentiality to reduce public access to data sharply,” Ruggles said. “I do not think this will stand because society needs information to function.”
The microdata is gathered annually from the American Community Survey with a sample size of 3.5 million households, extrapolated across populations of all sizes, from the entire nation down to neighborhoods. This provides various estimates of the nation’s demographic makeup and housing characteristics. Ruggles said that microdata is used to draft around 12,000 research papers a year.
The synthetic data are created by taking variables in the microdata to build models recreating the variables’ interrelationships and then constructing a simulated population based on the models. Scholars would conduct their research using the simulated population — or the synthetic data — and then submit it, if they want, to the Census Bureau for double-checking against the actual data to make sure their analyses are correct.