Sibel Adalı: Research Blog -> NELA Toolkit and Datasets

NELA Toolkit and Datasets

One of the main problems in studying misinformation is that misinformation comes in many forms. Not only as completely fabricated information, but also stories that distort the truth, remove certain context and exaggerate the importance of certain facts. As a result, study of misinformation must not only concentrate on factually false or fabricated information, but intentionally misleading and biased coverage of news. Furthermore, often media sources operate with complex incentives from propagating incorrect narratives to creation confusion for political reasons to simple financial gain motives. All of these impact how information is created, presented, published and copied over the various networks.

To help better study this complex landscape, we have created a toolkit called NELA Toolkit (nelatoolkit.science) that was demonstrated in WWW 2018 Conference that incorporates a number of components. The first component is a combination of classifiers for analyzing news content based on stylistic markers and word usage, that aims to estimate the potential objectivity and reliability of a story. The reliability is based on the differences between the writing style of the article from mainstream news sources. As described in our previous work, this method is quite effective in distinguishing between different types of sources. The second component of the toolkit is a visualization of different news and media sources along many different axes using our large set of content based analysis features. This allows us to see how sources differ from each other along a spectrum of features. It also allows a resercher to investigate the most highly engaged news content published by the source in different time frames. This toolkit is continuously being developed and expanded by new modules.

To better understand news production and consumption, we also have collected and published two broad based datasets (NELA2018 in ICWSM 2019 and NELA2017 in ICWSM 2018), containing all news stories from many different types of sources: mainstream, satire, hyperpartisan, sources that are known to have published completely fabricated stories. For each source, we provide all political news by that source from the moment the collection has started through their RSS feed. This allows us to study not just the activity around a single topic, but the overall behavior over time. NELA 2017 contains 136K news articles from 92 sources between April 2017 and October 2017, while NELA 2018 contains 713k articles from 194 sources between February 2018 and November 2018. In NELA 2018, we also collect ground truth labels for sources from a 8 different assessmentsites covering multiple dimensions of veracity, including reliability, bias, transparency, adherence to journalistic standards, and consumer trust.