A New Content Agnostic Solution for Fake News Detection

Exploring Fake News Detection as a Service

Automated or machine-learning solutions for fake news detection are both necessary and challenging in the fight against misinformation. This post explores the first automatic, content-agnostic approach to fake news detection, FNDaaS, which considers both new and unstudied website features.

The Challenges of Fake News Detection Using Current Methods

From misleading narratives to outright lies in journalism, fake news isn’t exactly a new problem, but the proliferation of social media and the growing pains of society’s adjustment to the Internet create a fertile ground for malicious actors who wish to spread misinformation. The effects of this misinformation are as apparent as they are varied. Here are just a few fun examples:

  • An armed man storms a pizza parlor.
  • People drink bleach to cure COVID-19.
  • Parents fail to vaccinate their children.
  • And much more!

These effects and the misinformation that causes them are quite alarming to those who want to maintain some level of journalistic integrity on the Internet. This situation has led to a sort of “arms race” between those who wish to detect and moderate fake news, and those who wish to spread it.

A significant amount of effort and research has already been applied toward automated detection of fake news. These approaches generally use textual and content analysis to detect dubious truthfulness, and while they may be effective in their own ways, they are limited in their ability to cross language and semantic barriers. For example, a machine-learning algorithm that is good at detecting fake vaccine news in English will struggle to detect fake cryptocurrency news in Chinese.

This paper proposes a novel fake news detection method that uses only network and structural characteristics of the website hosting the articles in order to determine if it is fake.

How to Detect Fake News Without Reading It

The paper’s authors (Papadopoulos, Spithouris, Markatos, Kourtellis) took a sample of 637 fake and 1183 real news websites and documented numerous features about the architectural characteristics of the sites. Some of the more notable characteristics include:

  • Number of days the site’s domain has been in existence
  • Number of days a given IP address has been associated with the specific domain
  • DOM loading time
  • JavaScript heap size
  • Classes in HTML page

The authors compared these characteristics of the sample sites and analyzed the data to observe statistically relevant differences between fake news sites versus real news sites. Some characteristics, such as domain age, were found to be very good at predicting whether a site was fake or real. Others, such as the number of websites owned by the same entity, showed very little statistical difference between the sites. Here are some of the observations made about these characteristics.

  • Domains of fake news sites have a much smaller median lifetime than real news websites. So much so, that almost 90% of the fake news sites have a domain age of under 24 hours.
  • There is a significant difference between the average IP age of fake and real news sites. Fake news sites generally keep the same IP address for a much shorter timeframe than real news sites.
  • It generally takes real news sites longer to load the DOM tree than fake news sites.
  • Real news sites have more HTML nodes per page than fake news sites.
  • The JavaScript heap size of real sites is larger than fake news sites.

The authors applied these observations to a variety of machine learning algorithms and found that a Random Forest method, which is one type of machine learning method, could achieve about 90% accuracy in detecting fake news from their set of sample websites. They went on to test this algorithm on 100 newly flagged fake/real news sites (from December 2022) and found that the algorithm detected fake news with 77% accuracy. These results demonstrate that their content-agnostic approach to fake news detection is a useful tool in the fight against fake news.

FNDaaS Implementation for Servers and Browsers

The paper goes on to outline a prototype implementation of this fake news detector in the form of a centralized server and browser extension (called FNDaaS: Fake News Detection as a Service). The server crawls websites to collect their characteristics and implements the ML algorithm to build a filter-list which contains a whitelist of credible news sources and a blacklist of fake news sites. The client-side browser extension is used to enforce the filter-list in the user’s browser by warning users when they visit a blacklisted domain. Trusted super-users can also manually label sites as fake or credible. These labels are stored locally in the filter-list and pushed to the server to periodically retrain the ML algorithm.

Thinking Adversarially When It Comes to Fake News Detection

Considering the arms race between those supplying fake news and those who want to moderate it, the content-agnostic method of detection outlined in this paper is a valuable tool for those on the side of information integrity. Unfortunately, this tool does have its weaknesses.

The paper points out that sometimes new websites must be classified in real-time, meaning that only website characteristics that are available to the browser at runtime can be used for fake news detection. These include features such as JavaScript heap size and HTML node count but exclude useful characteristics pertaining to IP and DNS activity. Since these runtime features are more easily controlled by the creators of fake news sites (as opposed to DNS and IP records which are much harder to fudge), it is possible for a savvy purveyor of fake news to perform similar statistical analysis on real news sites and modify their fake news sites to have indistinguishable runtime characteristics. Doing so would weaken the efficacy of the fake news detection algorithm, especially in the real-time detection case.

Additionally, the FNDaaS server administrators must perform security due diligence that is typical for any client-server communication. For example, the browser extension communicates with the server using JSON, so it is important to make sure that the channel is secured over TLS and no injection vulnerabilities exist in server logic that would allow a malicious normal user to affect the filter-list database. The paper points out that super-users are authorized to the server using JWTs, so proper JWT implementation must be ensured to prevent attackers from impersonating trusted users and sending false flags to the ML algorithm and ruining its training.

The authors also take into consideration the performance of the extension. As the size of the filter-list increases, so does the local storage and the processing power required to query the list. In its current state, the extension uses around 50MB of memory and takes about 400 ms to perform each search with a filter-list of 100,000. Considering that popular ad blocking extensions have blacklists of around 10,000 – 100,000 rules, the FNDaaS extension performs reasonably well given the current state of the art. Still, according to the performance metrics shown in the paper, the extension’s search time increases significantly as the filter-list size pushes 1,000,000 and beyond. While it is unlikely that we will have to worry about needing to filter millions of fake news sites in the near future, it’s important to note that as the Internet grows in complexity, tools such as this one must maintain good performance otherwise end users will become frustrated with the latency and opt out of using the tool entirely.

Key Takeaways

Overall, I found this paper to be an interesting insight into the struggle of making concise observations about websites on a mass scale. I was especially impressed by the approach of using trends across structural elements of websites as opposed to just the contents and wonder if such an approach could be used in a more security-relevant manner. For instance, would it be possible to flag sites as more likely to have XSS vulnerabilities simply based on characteristics like DNS age and DOM size?