News

How The Guardian and AFP used machine learning to understand quotes

2022-05-11. Quotes have always been used in news articles to bring life to a story and, more importantly, to add authenticity, accuracy and balance. Data scientists at The Guardian and AFP have found a way to give quotes a life of their own and to ensure that the accuracy and attribution of the sources are ironclad – thanks in part to AI.

by Neha Gupta neha.gupta@wan-ifra.org | May 11, 2022

In 2021, The Guardian took part in the Journalism AI Collab Challenges, a project connecting global newsrooms to understand how artificial intelligence can improve journalism. 

One particular challenge was to answer the question “How might we use modular journalism and AI to assemble new storytelling formats and reach underserved audiences?”

Anna Vissens, Lead Scientist, and Michel Schammel, Senior Data Scientist, at Guardian News & Media, United Kingdom, joined WAN-IFRA’s virtual Newsroom Summit in late April to talk about the learnings from this project.

What are quotes?

The team defined modules as fragments of a story that live independently but can be repurposed, or even replaced, by another fragment. Based on this definition, quotes are strongly qualified as modules.

Taking Wikipedia as the starting point, here’s how the team defined a quote: 

A quotation is the repetition of a sentence, phrase, or passage from speech or text that someone has said or written. In oral speech, it is the representation of an utterance that is introduced by a quotative marker, such as a verb of saying. For example: John said: “I saw Mary today.” 

In written text, quotations are signaled by quotation marks.

“It looks simple but we wrestled with questions like – what about song lyrics? Or poems? Are they quotes? What if someone doesn’t say it but thinks about it? Do we treat thoughts as we would speech?” said Vissens.

Why are they doing this?

There are several use cases for this type of module – from creating new content to tracking shifting opinions on the same subject over time. 

According to Vissens, fact-checking and investigation teams might also benefit from such a tool. 

Another interesting use case the team found was revealing hidden insights about the diversity of their content, which raised several questions around The Guardian’s sources, their diversity, how often the brand quotes the same people, and if different ethnic groups and genders receive the same exposure. 

Deciding what to keep and what to exclude

The team listed about eight exclusions. For instance, they decided not to label text without quotation marks as a quote. They also made a design decision to clearly separate paraphrases and quotes, and focus their efforts on identifying the text in quotation marks only. 

However, at the same time, Vissens and Schammel wanted to teach their model to distinguish between quotes and random words within quotation marks. “Our aim from a machine learning perspective was to accurately detect real quotes and bring context surrounding these quotes back later,” Vissens said.

The team created a clear and concise guide for annotating the company’s data. This was done to help multiple annotators understand the task in a uniform way to minimise noise and uncertainty in the training data set. 

“We started by looking at data to find out how quotes are constructed and found around 15 different constructs. The main challenge in building the training data set was navigating the ambiguity of different journalistic styles,” Vissens said.

Annotation workflow

Along with AFP, the team annotated nearly 1,000 news articles with three entities – content (the quote, in quotation marks), source (people, organisations etc), cue (usually a verb phrase indicating the act of speech or expression). These annotations were then used to train a named entity recognition model.

Annotation tool Prodigy’s UI with three labels for source, content and cue.

The team used two tools created by Explosion to train the model to identify quotes in text. 

  • Spacy: An open-source library for advanced natural language processing using deep neural networks.
  • Prodigy: An annotation tool that provides an easy-to-use web interface for quick and efficient labelling of training data.

“After manually annotating those 1,000 articles, we had our first baseline model ready. We put it in the loop so we could already see predictions coming in,” Vissens said. Not only did the prototype model speed up the team’s workflow, it also gave them insight into where the model was lacking or not working altogether.  

She said it was interesting to observe the improvement of the model over time, because it also helped the team become better annotators through the learning process.  

The first batch of the team’s annotations turned out to be noisy and inconsistent but they got increasingly better with each iteration. Once the team had collected enough training data, it launched the final version of the model. 

Findings from the final model

The trained model managed to correctly identify all three entities in 90 percent of the cases. 

  • The cue showed the highest precision at 96 percent. 
  • It was followed by content at 91 percent,  
  • and source at 82 percent. 

To evaluate the model, the team used the strictest way of measuring the performance of named entity recognition, where each predicted entity needed to match exactly (from start to end) with respect to the annotated data. Even in cases where the model was getting it wrong, the team often found it managed to partially match the entity. This was especially true for source entities.

Schammel pointed out the difference between entities was not surprising.

“The content entity has an advantage that it has a strong signal coming from the quotation marks and the difficulty is to distinguish between real quotes and quota text for stylistic reasons,” he said. 

“However, the model has also learned to exclude phrases in quotation marks that are not real quotes,” he continued. “For source and cue, we have false positives. Sometimes the model would flag up source cue pairs without associated content, and we are aiming to overcome this issue with co-reference and the post-processing step.”

What’s next?

Moving forward, the team aims to build a robust coreference resolution system, which is the process of identifying sources that are mentioned only by pronouns. After looking at various machine learning approaches based on existing libraries and none working out, the team ended up building its own coreference model.

The team at Agence France-Presse (AFP) built a prototype of a quote search engine called QuoteMachine. An application like this could enable journalists to surface previous quotes quickly to check them against current statements and to enrich their articles. 

A prototype of QuoteMachine, a user-facing tool, built by AFP’s Arnaud Pichon and Fred Bourgeais.

Schammel acknowledged another challenge would be to identify meaningful quotes.

“We are confident that a combination of machine learning, existing metadata about articles, and additional information extracted from sources and content might give us a strong signal for classifying quotes,” he said. 

Neha Gupta

Research Editor

neha.gupta@wan-ifra.org