What reading 3.5 million books tells us about gender stereotypes

"We have been able to confirm a widespread perception, only now at a statistical level," a lead researcher says

By Nicole Karlis

Senior Writer

Published August 28, 2019 6:00PM (EDT)

Books (Getty/Sandy Huffaker)
Books (Getty/Sandy Huffaker)

Huge social questions like "how are men and women perceived differently" cannot be easily answered without analyzing rhetoric on a massive scale. But what if we could analyze millions of words, all at once, to get a sense of what patterns emerge in how men and women were described?  It wasn't until recently that machine learning algorithms could help researchers do just that.

In a recent study, Dr. Isabelle Augenstein, a computer scientist at the University of Copenhagen, worked with fellow researchers from the United States to analyze 11 billion words in an effort to find out whether there was a difference between the adjectives used to describe men and women in literature. The researchers examined a dataset of 3.5 million books, all published in English between 1900 to 2008. 

They found that “beautiful” and “sexy” were the two more frequently used words to describe women in literature. The most frequent adjectives used to describe men were “righteous,” “rational” and “brave.”

"We are clearly able to see that the words used for women refer much more to their appearances than the words used to describe men,” Augenstein said. “Thus, we have been able to confirm a widespread perception, only now at a statistical level.”

Using machine learning, the researchers extracted adjectives and verbs connected to gender-specific nouns, like “daughter.” Then the researchers analyzed whether the words had a positive, negative or neutral point of view.

The analysis determined that negative verbs associated with appearance are used five times more for women than men. Likewise, positive and neutral adjectives relating to one’s body appearance occur twice as often in descriptions of women. The adjectives used to describe men in literature are more frequently ones that describe behavior and personal qualities.

Researchers noted that, despite the fact that many of the analyzed books were published decades ago, they still play an active role in fomenting gender discrimination, particularly when it comes to machine learning sorting in a professional setting.

"The algorithms work to identify patterns, and whenever one is observed, it is perceived that something is ‘true’. If any of these patterns refer to biased language, the result will also be biased,” Augenstein said.  “The systems adopt, so to speak, the language that we people use, and thus, our gender stereotypes and prejudices.”

Augenstein explained this can be problematic if, for example, machine learning is used to sift through employee recommendations for a promotion.

"If the language we use to describe men and women differs, in employee recommendations for example, it will influence who is offered a job when companies use IT systems to sort through job applications,” she said.

There are some limitations to the analysis. First, that the report does not factor in who wrote the passages that were analyzed and the degrees in which gender biases already existed. It also does not distinguish genres.

As artificial intelligence and computational linguistics become more prominent in our world, it is important to be aware of gendered language. Research shows that language can influence a society’s attitudes toward gender equality. Likewise, algorithms can often unintentionally reinforce existing biases.

The study confirms what previous research has found: that much of literature exhibits gender discrimination. For example, in  2011, a study, "Gender in Twentieth-Century Children's Books," found that males are central characters in 57 percent of children's books published each year.

"The messages conveyed through representation of males and females in books contribute to children's ideas of what it means to be a boy, girl, man, or woman. The disparities we find point to the symbolic annihilation of women and girls, and particularly female animals, in 20th-century children's literature, suggesting to children that these characters are less important than their male counterparts," the authors of that study wrote. "The disproportionate numbers of males in central roles may encourage children to accept the invisibility of women and girls and to believe they are less important than men and boys, thereby reinforcing the gender system."

Augenstein said she hopes her research will raise awareness about how gendered language and stereotypes will impact future machine learning algorithms.

"We can try to take this into account when developing machine-learning models by either using less biased text or by forcing models to ignore or counteract bias,” Augenstein said. “All three things are possible."


By Nicole Karlis

Nicole Karlis is a senior writer at Salon, specializing in health and science. Tweet her @nicolekarlis.

MORE FROM Nicole Karlis