Plans to make transcribed data from the 2015 Hugo nominating ballots available upon request have been put on hold.
E Pluribus Hugo advocates, who want to use the data to demonstrate the EPH vote tallying method is effective at coping with slates, got the Sasquan business meeting to pass a non-binding resolution (item B.2.3) asking for the release of anonymized raw nominating data from the 2015 Hugo Awards.
When the resolution passed, Sasquan Vice-Chair Glenn Glazer announced Sasquan would comply with the request. The intent was to provide equal access to the data, and those interested in receiving a copy were invited to e-mail the committee.
However, Glazer confirms he recently e-mailed the following update to a person who requested the data, as reported by Vox Day:
Back at Sasquan, the BM passed a non-binding resolution to request that Sasquan provide anonymized nomination data from the 2015 Hugo Awards. I stood before the BM and said, as its official representative, that we would comply with such requests. However, new information has come in which has caused us to reverse that decision. Specifically, upon review, the administration team believes it may not be possible to anonymize the nominating data sufficiently to allow for a public release. We are investigating alternatives.
Thank you for your patience in this matter. While we truly wish to comply with the resolution and fundamentally believe in transparent processes, we must hold the privacy of our members paramount and I hope that you understand this set of priorities.
Best, Glenn Glazer
Vice-Chair, Business and Finance
Sasquan, the 73rd World Science Fiction Convention
And Hugo Administrator John Lorentz added information in this follow-up e-mail:
What wasn’t included in Glenn’s statement is that this year’s Hugo system administrators are working with a committee composed of proponents of EPH, so that proposal can be tested without any privacy violations that might occur by releasing the data with no controls.
As Hugo administrators, we have always assure members that their votes are private and secret, and we don’t want to do something that might change that. That is our primary responsibility.
Sasquan Hugo Administrator
On September 1, in an exchange between several commenters, Lorentz remarked the difficulties of anonymizing voter data, here at File 770:
[Commenter] “With the Hugo data, the only identifying info is the membership number. Remove that, and the ballot has been anonymized.”
[Brian C] No, it’s not nearly that simple.
You also need to eliminate any nominations that are unique to one or a handful of people, as otherwise those nominations could be used to identify people. But then those ballots aren’t actually representative for the purpose of testing the algorithm. So you need to actually replace those with other nominations, that happen not to perturb the algorithm in any way.
[John Lorentz]And that is the problem that our Hugo system admin folks have been running into. When one of them generated a draft of anonymized nominating data, it didn’t take the other very long to determine who some of the voters were, simply from the voting patterns.
Vox Day terms the latest development a “scandal.” Peter Grant was equally prompt to accuse Sasquan of having something to hide in “What, precisely, is going on with the Hugo Awards data?”
Folks, back in the 1980’s I was a Systems Engineer at IBM. I’ve had well over a decade in the commercial information technology and computer systems business, in positions ranging from Operator to Project Manager, from Programmer to End-User Computing Analyst to a directorship in a small IT company. Speaking from that background, let me assure you: I can ‘anonymize’ almost any data set in a couple of hours, no matter how complicated it may be. To allege that ‘it may not be possible to anonymize the nominating data sufficiently to allow for a public release’ is complete and utter BULL. Period. End of story.
However, one of Grant’s commenters pointed out: “Anonymizing data is harder than you think, if your goal is to actually make it truly anonymous. See what happened when AOL tried to anonymize search results, or when Netflix tried to anonymize movie recommendations.” And he cited a 2009 ArsTechnica article, adding “and metadata analysis hasn’t exactly gotten worse since then.”
The article says —
Examples of the anonymization failures aren’t hard to find.
When AOL researchers released a massive dataset of search queries, they first “anonymized” the data by scrubbing user IDs and IP addresses. When Netflix made a huge database of movie recommendations available for study, it spent time doing the same thing. Despite scrubbing the obviously identifiable information from the data, computer scientists were able to identify individual users in both datasets. (The Netflix team then moved on to Twitter users.)…
The Netflix case illustrates another principle, which is that the data itself might seem anonymous, but when paired with other existing data, reidentification becomes possible. A pair of computer scientists famously proved this point by combing movie recommendations found on the Internet Movie Database with the Netflix data, and they learned that people could quite easily be picked from the Netflix data.
EPH backers want to use the data to demonstrate their voting system. In comparison, a commenter at Vox Popoli said he wants to analyze the data to learn —
- How many slates there were in competition
- How good party discipline was for the various slates
- How many voted mixed slates of sad/rabid, TOR/SJW, etc.
- How the 4/6 and EPH proposals would have affected the outcome of the competing slates
Update 09/08/2015: Corrected the attribution of Brian C’s comment.