Detecting Misinformation with AskNews and Qdrant — CrediRAG

Emergent Methods
6 min readNov 27, 2024

--

If you suspect a Reddit post might be misinformation, how would you check? You could search for relevant information in the news across all languages, countries, and through time. You might build out a timeline of events, and extract the factual consistencies and inconsistencies present in the news articles. You may even look at the Reddit post author and commenters to check which other posts they have written or commented on.

Sounds laborious.

In an era where fake news spreads faster than ever, misinformation poses a distinct threat to public discourse and decision-making. Contrary to popular belief, the latest tools, like LLMs and Vector databases, are stymieing the propagation misinformation rather than facilitating it.

For example, Qdrant’s unique ability to accurately index millions of multi-lingual news articles, enables fast retrieval of highly relevant news context across numerous languages and sources. It may not be obvious, but when hundreds or thousands of relevant news articles can be retrieved quickly across any language, we enter the new world where we can feed all those competing perspectives to a state of the art LLM, building a timeline of events and extracting factual consistences/inconsistencies, all in a matter of seconds.

That all sets the stage for a highly resolved picture of multi-lingual context surrounding a possible fake news post — more context equals higher accuracy when classifying misinformation.

Indeed, these tools and methods yield a 95% misinformation detection accuracy (26% higher than previous state of the art methods). Check out the full paper here: https://arxiv.org/pdf/2410.12061, written by some of the best minds in the UTexas Center for Autonomy:

We use the AskNews corpus to dynamically retrieve relevant similar news articles to the submission body of a post [43]. The AskNews API provides an efficient way to integrate real-time news into LLM applications. Using its RAG architecture, AskNews processes over 300,000 articles daily, embedding them into a vector database that can be queried with natural language

The Challenge: Fighting Misinformation with Precision

The main issue with traditional misinformation detection methods is their reliance on either static classifiers or text-based approaches that lack the ability to incorporate the rich, dynamic context provided by evolving news stories, user interactions, and network structures on social media. These methods often suffer from:

  1. Omission of Social Context: They do not utilize user interaction data, such as comment threads and network relationships, which can provide critical insights into the propagation and credibility of information.
  2. Static Data Dependence: Many approaches rely on pre-curated datasets that quickly become outdated, failing to adapt to the evolving nature of misinformation.
  3. Limited Explainability: Methods such as language model-based classifiers or graph neural networks often lack explainability, which is essential for trust and understanding of detection results.
  4. Sensitivity to Noise: Short or poorly articulated posts and adversarial examples can significantly challenge these models’ ability to make accurate predictions.

CrediRAG addresses these limitations by combining language models with graph-based approaches that leverage both textual content and user interaction data:

AskNews and Qdrant: A Dynamic Duo

AskNews is collecting millions of news articles every week, enriching them with metadata such as “key people”, “key statements”, “key evidence”, “reporting voice”, “sentiment”, translations, entity relationship extraction, and much more. This swath of data is embedded and indexed in Qdrant, where it awaits future retrieval.

This is no simple task. In the present scenario, misinformation detection backtesting requires strong temporal filtering constraints to ensure scientific rigor. In other words, we want to understand if CrediRAG can detect misinformation in real-time, not after-the-fact when it may have already been publicly debunked! Here’s an example of how Qdrant allows us to control the articles by time:

from qdrant_client import QdrantClient
from qdrant_client.models import (
Filter,
FieldCondition,
SparseVector,
MatchAny,
SearchRequest,
DatetimeRange
)

# build out the qdrant filters to accommodate the backtesting daterange
must = []
must.append(
models.FieldCondition(
key="metadata.article.pub_date",
range=models.DatetimeRange(
gt="2024-10-08T10:49:00Z",
lte="2024-11-31T10:14:31Z",
),
)
)

Beyond temporal filtering, in the misinformation fighting world, we may also need to filter out any articles that have “sensational” or “persuasive” reporting voice — since those hold less credibility than “objective” or “analytical” reporting voices.

This ends up looking like this inside Qdrant:

# constrain reporting voice
reporting_voice=["Objective", "Analytical", "Investigative"]
must.append(
models.FieldCondition(
key="metadata.article.reporting_voice",
match=MatchAny(any=reporting_voice)
)
)

filter = Filter(
must=must
)

# now instantiate your client and query the collection
client = QdrantClient(url="http://localhost:6333")

records = client.query_points(
collection_name="news_articles",
query=embedded_post,
query_filter=filter
)

AskNews wraps it up into a single line of code, instead (without the need to collect millions of articles per day or embed the Reddit post either!):

from asknews import AskNewsSDK

ask = AskNewsSDK()

relevant_news = ask.news.search_news(
query=reddit_post,
n_articles=k,
reporting_voice=["Objective", "Analytical", "Investigative"],
start_timestamp=1728418521,
stop_timestamp=1733084121,
return_type="dicts",
method="both",
diversify_sources=True,
historical=True,
similarity_score_threshold=0.8
)

Between temporal filters, reporting voice filters, and vast similarity search, it’s all a walk in the park for AskNews + Qdrant in the fight against misinformation.

Here’s how it works in plain terms:

  1. Input a Reddit Post: We take any Reddit post and pass it through AskNews.
  2. Retrieve Relevant News: AskNews scours its expansive Qdrant database for enriched news articles that match the post’s content and conform to time/reporting voice filters.
  3. Compare and Classify: The retrieved articles are compared against the original post to classify it as factual or misinformative.

This innovative method boosted misinformation detection accuracy from 0.75 to 0.95 — a remarkable 26% improvement over previous state of the art methods.

Why Qdrant is a Game-Changer

Qdrant’s technology is pivotal to our success. By embedding news articles into a high-dimensional vector space, it enables lightning-fast semantic searches. This means AskNews can retrieve the most contextually relevant articles, even from datasets updated as frequently as every five minutes. The result? Real-time misinformation detection that’s both scalable and reliable.

Head to https://qdrant.tech to start building your own misinformation detector.

Further, backtesting these methods is non-optional. We want to test if we *could have* detected misinformation using the news articles we had at the time of the original post. This can only be achieved by precise time filtering of news articles.

Bridging Retrieval and Networks

Photo by Timothy Kolczak on Unsplash

CrediRAG takes it a step further. Using network-based refinements, it is possible to build a post-to-post graph that considers relationships between Reddit posts — such as shared commenters and interaction patterns. This graph, combined with retrieval insights, refines the classification process, significantly enhancing its accuracy.

CrediRAG is more than an algorithm; it’s a vision for a more informed and resilient digital society. By integrating cutting-edge AI with robust datasets and dynamic retrieval, we are setting a new benchmark for misinformation detection.

The Road Ahead

While the current system of methods and tools demonstrates the power of AI in tackling misinformation, the journey is far from over. Future innovations will focus on expanding this technology to other social platforms and refining it to detect increasingly complex forms of misinformation.

--

--

Emergent Methods
Emergent Methods

Written by Emergent Methods

A computational science company focused on applied machine learning for real-time adaptive modeling of dynamic systems.

No responses yet