Think of the last time you had food poisoning. Did you tweet about it? Did you Google your symptoms? Or did you write an angry review on Yelp?
Every day, people use the internet to seek and share health information. This opens up exciting new ways for scientists to study the health of a population, an approach known as digital epidemiology.
But, in most cases, we do not know much about the individuals who post this information. We don’t know if the data include people from poor households, or how the data break down according to race, gender or age group. We also don’t know if they include those who are most vulnerable to the disease of interest in a particular study.
Before we can start addressing disparities in digital data, we need to show that these disparities exist. Our study of more than one million Yelp reviews suggests that poorer populations are being left out of digital data used for disease surveillance.
Since digital data are generated in nearly real time, they can be a valuable way for researchers to track disease trends.
For example, health departments in New York, Las Vegas and Chicago can now choose which restaurants to inspect by tracking reports of food poisoning on Twitter and Yelp. They also use the data to monitor disease outbreaks from contaminated food.
However, the evidence suggests that these cutting-edge techniques overlook a major segment of the population. For example, in 2015, a research team led by biologist Samuel Scarpino brought together digital and nondigital data sources to model influenza – from Google searches to ILINet, a nondigital government project that monitors outpatient health care providers.
ILINet might not cover lower socioeconomic populations, Scarpino told me. Adding social media data into his model didn’t seem to mitigate these disparities in representation. The team discovered they could accurately predict influenza hospitalizations for wealthier zip codes in the U.S., but not for poorer ones.
This suggests that other public health surveillance systems powered by digital data likely suffer from the same problems. We wanted to see if similar patterns were reflected in data from Yelp.com, and to better understand what factors are most correlated with U.S. restaurant reviews.
Yelp provided us with more than 1.5 million reviews posted between 2004 and 2014 for food service businesses in Oregon, Massachusetts and Georgia. We looked at how the volume of reviews changed with seasons and day of the week. We also studied the most recent data, from June 2013 through May 2014, to assess food poisoning reporting at the county level.
Since the Yelp data included both good and bad reviews, we built a machine learning algorithm to extract the bad reviews that talked about food poisoning. Next, we estimated the correlation between reports of food poisoning and various socioeconomic factors, demographic factors and the geographic concentration of food service establishments.
We discovered that traits typically associated with higher socioeconomic status (such as high percentage of residents with higher education or higher income) were consistently positively correlated with reports of food poisoning. For example, the strength of the relationship between the percentage of residents with a bachelor’s degree and reporting of foodborne illness was 0.44, where 0 indicates no association and 1 is a perfect association.
Meanwhile, people who were unemployed, didn’t have health insurance or were living in poverty were less likely to report food poisoning from restaurants.
Our models suggested that counties with a high concentration of restaurants and people with bachelor degrees were most likely to report food poisoning on Yelp.
However, this does not imply that these counties have higher incidences of foodborne illness or outbreaks. Data from the Centers for Disease Control and Prevention suggest that populations of low socioeconomic status have higher incidences of foodborne illness. The disparities in reporting of illness could be explained by differences in access to the internet or health and computer literacy.
Dealing with disparities
If we do not monitor populations of low socioeconomic status, then we cannot adequately address their public health concerns.
While these studies hint at disparities in digital data, we need additional demographic data to properly quantify the representation of different populations.
Knowing the demographic and socioeconomic breakdown of the data can inform us about its biases toward particular groups. That can shape how we design research studies and public health surveillance systems.
It can also help us to develop methods to address data limitations, so that we do not continue perpetuating existing health disparities due to poverty, educational inequalities and other factors.