Benefiting from Big Data While Protecting Individual Privacy
Khaled El Emam
Most people would agree we are entering the age of “big data.” This is a time where large amounts of data from multiple sources are being collected and linked to perform sophisticated analytics for many different purposes. The data tends to be personal, in that it characterizes individual human behaviors and conditions such as their Internet surfing patterns, purchasing behavior in stores, individual health information, details on financial transactions, and physical movements, to name just a few examples. All this personal information, especially when combined, paints a detailed picture about individuals: their likes and dislikes; what they do; and when and where they do it.
Many discussions about big data center around the technology that is needed to process such large volumes of information. Our traditional data management and data processing tools cannot handle the large volumes of data that are being collected. Therefore, completely new systems and algorithms are being developed to process big data efficiently and accurately to “find the signal in the noise.” Particular challenges include extracting information from unstructured data (e.g., free-form text instead of fields in a database), and linking data from multiple sources accurately to obtain detailed profiles about individuals.
The analytics performed on big data can be beneficial to the individuals themselves, and to society as a whole. For example, analytics can recommend to individuals, products they may be interested in and may need. Similarly, analytics on linked health data may identify interventions that are beneficial to people with a particular disease or condition, or detect adverse drug events that are serious and warrant removing a drug from the market or restricting the indications for a drug or device.
One of the questions that comes up when we talk about big data is, where does all of this information come from in the first place? Some of it is customer data collected by the various organizations that are providing different products and services. Another large source of data is freely available online as individuals provide more details about their lives and interests on social-networking sites, on blogs, and in their “tweets” (via Twitter). In some cases, it is possible to buy individual-level data, for example, about magazine subscriptions or financial transactions. Government registries also provide useful information, such as date of birth information and data on things such as liens.
Aggregate or summary data (e.g., averages or percentages) can be helpful for this kind of analytics as well. For example, by just knowing an individual’s zip or postal code, it is possible to get a good estimate of an individual’s income, level of education, and number of children using just aggregate data. Existing legal frameworks allow the collection, use, and disclosure of personal information as long as it is de-identified (or anonymized) and there is no requirement to obtain individuals’ consent if this is the case. However, the de-identification not only applies to original data, it also applies to data that has been linked with other information. Therefore, as different data sources are integrated, there is a constant need to evaluate identifiability to ensure that the risk of re-identification remains acceptably low.
One advantage of having lots of data, or big data, to analyze is that it makes de-identification easier to achieve. The reason is that there is a greater likelihood there are more similar people in a big data set than in a smaller one. By definition, smaller data sets are more challenging to manage from an identifiability perspective because it is easier to be unique in smaller databases. In order to more fully understand the nuances around de-identification practice and de-identification regulations, it is important to understand the distinction between “identity disclosure” and “attribute disclosure.” Privacy laws only regulate identity disclosure, which is when the identity of an individual can be determined by examining a database. For example, if an “adversary” is someone who tries to re-identify a record in the data set and can determine that record number 7 belongs to Bob Smith, then this would be considered “identity disclosure” because the identity of record number 7 is now known to be Bob’s.
“Attribute disclosure” is less straightforward to understand, but this example, pertaining to vaccination of teenage girls against HPV (the human papillomavirus, a virus believed to cause cervical cancer) should serve this purpose. If someone were to perform some analysis on an HPV data set that included information on religious affiliation, he or she might discover that most people of religion “A” do not vaccinate their teenage daughters against HPV, because HPV is correlated with sexual activity and therefore argue they do not need it, then this is an example of “attribute disclosure.” Here we discover that a particular group, characterized by their religion, in this instance, has a particular attribute or behavior.
Although no individual records in the database were identified, if it is known that Bob Smith follows religion “A,” one can learn something new about him whether he is in the database or not. We can generalize this example to, say, retail. From analyzing a large retail database linked with a magazine subscription list, we can discover that the majority of forty-year-old women who are stay-at-home moms in zip code 12345 like tea, read a particular type of magazine, and have a particular political affiliation. This conclusion does not identify any individuals, but we are still able to come to certain conclusions about these women and their lifestyles. With this information, it is possible to precisely target advertisements to these women, even though no one’s identity was revealed to draw a conclusion from the database. As mentioned, privacy laws do not regulate attribute disclosure. Therefore, drawing inferences from databases is still a valid exercise, as long as the original data and any linked data sets are convincingly de-identified. In fact, an examination of the evidence on real-world identification attacks reveals that they are all “identity disclosure,” which is the main type of attack one needs to pragmatically protect against. But to address concerns about such “attribute disclosure” inferences, transparency is important. By transparency, I mean letting the individuals know what data is being collected about them, what data is being linked or would possibly be linked to it, and how it is being used. Giving individuals in a database an opt-out option would not be practical because the data would already be de-identified.