University Libraries project uses machine learning to identify racist language in state laws
Two years ago, a high school social studies teacher in Caldwell County, North Carolina, approached Sarah Carrier, North Carolina research and instruction librarian at the University Libraries, in search of a resource for teaching about the era of Jim Crow. Had anyone produced a comprehensive list of all the Jim Crow laws passed in the state of North Carolina?
Carrier’s short answer to his question was no. The closest source would be Pauli Murray’s “States’ Laws on Race and Color,” published in 1951.
Though volumes of public and private North Carolina session laws have been digitized, their pages exist as online pictures, with no way to analyze the text they contain.
“Helping teachers is a big part of what I do, and I try to do it as fast and efficiently as possible. But downloading and searching through files to find race-based legislation was incredibly time consuming. It wasn’t feasible,” explains Carrier.
“I was taking a workshop to learn more about text analysis,” she recalls, “and I brought this to Matt Jansen, our data analysis librarian. Was this something we could do?”
With an interdisciplinary group of librarians with expertise in special collections, data analysis, digital research and data visualization, plus subject matter experts in African American history and African American studies, the answer to this question was yes. The result is On the Books: Jim Crow and Algorithms of Resistance, a project that uses text mining and machine learning to identify racist language in legal documents.
The first iteration of On the Books went live in August 2020. Viewers can read or search through all the Jim Crow laws that the project identified. The site also includes a downloadable text file of the laws; a separate file of all North Carolina statutes from 1866 to 1967; the computer programs written for the project; a white paper describing the project’s methods; and resources for educators and researchers that contextualize North Carolina segregation laws.
The Andrew W. Mellon Foundation supported the first phase of On the Books through the Collections as Data—Part to Whole initiative, based at the University of Nevada, Las Vegas and the University of Iowa. The project continues thanks to a grant from the Association of Research Libraries.
Project lead and co-principal investigator Amanda Henley, head of the University Libraries’ Digital Research Services, says projects such as On the Books treat library collections as rich data sources. It’s one way that libraries can lead in the emerging field of data science.
“For the first phase, we were putting together the best corpus we could. This took everyone,” says Henley.
“We had to collect all the images from more than 100 years of laws, prepare them to be read, removing blank pages and marginalia on the page edges, smoothing and brightening the images to get the best optical character recognition and dividing text into individual laws,” she explains.
An important step was engaging with scholars to analyze the laws. William Sturkey, associate professor of history at Carolina and an expert on the history of race in the American South, and Kimber Thomas, CLIR postdoctoral fellow in data curation for African American collections at the University Libraries, provided the analysis that the project team relied on.
Sturkey and Thomas went through a large sample of laws and categorized them as “Jim Crow” or “not Jim Crow.” The classified laws served as a training set to teach the computer program to identify additional laws on its own.
“In many of the laws, there’s no question about the intent—the law segregating schools is clearly a Jim Crow law. Other laws might be up for interpretation. We were looking for anything that required racial segregation or stratification in any way,” says Sturkey.
The machine learning model eventually uncovered more than 900 laws that could be classified as Jim Crow. Being able to see and study them as a single body clarifies the extensive scope of Jim Crow in the American South and can help people to understand the history of race in America.
“This wasn’t just a couple of laws from the 1860s,” says Sturkey. “These laws were pervasive, inconvenient and unconstitutional, and they were the result of intricate, detailed planning to build the system of Jim Crow. These laws intended to maintain white supremacy, and they went on for decades and decades.”
Jansen, co-principal investigator and technical lead of the project, says that by publishing the team’s process and scripts, they can help others tackle similar text analysis projects focusing on laws elsewhere.
“Reading and understanding someone else’s code or adapting someone else’s code can be harder than writing your own,” he says. “We have provided explanations and examples to go along with our scripts to make them as easy to understand as possible. We continue to improve the corpus and identify additional laws, as the project is funded into 2021. This makes our outcomes better and helps support future users.”
“The library is the lab for liberal arts scholars,” Henley says. “We have unique collections, and through this project we’ve gained the expertise to make them available for computational use. Now that we have this corpus, what other kinds of research questions can be asked of it to dig deeper?
Story by Courtney Mitchell
This story originally appeared in the fall/winter 2020 issue of Windows, the magazine of the Friends of the Library at the University Libraries.