AskNews’ Transparency Protocols
We created AskNews to help readers navigate the unprecedented amounts of news generated every day across the world. Getting a broad and diverse overview of what is going on often requires going to multiple news outlets, avoiding geo-fences, navigating ad-heavy sites, translating other languages, and reading long reports. Staying up to date has become a daunting task, and AskNews is the tool to help.
So how is AskNews tackling transparency? As we discussed in our previous blog post, we believe transparency in news is grounded in maintaining a healthy balance of competing perspectives. We are committed to explaining, in simple terms, how we pick, process, and curate our news balance.
This blog post aims to provide a deeper look into our transparency protocols, so that you have a better sense of what goes into our reporting.
Our editorial process
The primary goal of our editorial process is to introduce and maintain competing perspectives from around the world. This is what we call “enforcing diversity,” and it takes shape at several steps in our editorial pipeline.
Quality Control
We start with quality control. We filter out paid-content, <150 character content, and any languages beyond our translation capacity (we currently support 14). We also filter out republishers when possible, by detecting duplicates and checking for the original.
We are also identifying and removing what we call “bad scrapes,” by using a Large Language Model (LLM) to evaluate whether or not the Title matches the Content. Once the articles have passed quality control, they head to the diversification stage.
Diversity enforcement
We monitor close to 1 million articles per day from news outlets all over the world. During this stage, we need to identify the subset of these articles that we want to enrich and include in our editorial process. The subset headed for enrichment is selected by evaluating the distribution of geographical origin and sources that have been monitored, and we use this distribution to select articles for enrichment.
In simple terms, say that, out of the 1 million articles, 40% are from United States, and 60% comprise remaining countries such as France (15%), Russia (5%), and South Africa (1%). When we sample 300k, we then ensure that 120k articles are from the US, 45k are from France, 15k are from Russia, and 3k are from South Africa.
When deciding which articles to keep from each country, we sort by page rank and select the top. This means we prefer news outlets with many backlinks (citations from other websites).
Content enrichment
This curated subset of 300k articles is then headed for enrichment. When we use the term “enrich” we mean that the article will receive the following processing:
- Statement/evidence/attribution extraction 📑
- Entity extraction (categorizing key components of the article) 🧩
- Sentiment analysis 😊
- Reporting voice analysis and reporting style analysis 🧑🏽🏫️
- Topic classification 🏰
- Summarization 📃 (if the original content is not in English, this step also includes Translation 🦜)
- Indexing 🧑🏽💻️ (this is the process of storing and making the content searchable in the future; for the AI audience, this is the embedding/vectorDB stage)
The translation 🦜 and summarization 📃 is done by an LLM (Llama3 at the time of writing, but subject to change as improved open-source models are released) that is instructed to abide by journalistic guidelines such as supporting statements with evidence and giving proper source attribution 📑.
On top of summarization, we also ask for an assessment of the writing style 🧑🏽🏫️— is the reporting provocative or does it not attempt to provoke an emotional response from the reader?, and what “voice” is the article written in — is it objective? subjective? investigative? Knowing these details helps highlight, filter, and categorize sources/articles for potential bias in the reporting.
For each summary, we extract a topic classification 🏰, a set of keywords, and a sentiment score 😊. We then use our fine-tuned GLiNER Named Entity Recognition (NER) model (our state of the art entity extractor, available on Hugging Face 🤗 for anyone to use!) to label entities in the summary 🧩.
Vector database for semantic and keyword search
The 300k article enrichments are subsequently embedded and stored in a Qdrant Vector Database 🧑🏽💻️. This process involves the use of both dense and sparse embeddings, allowing us to search for articles using either semantic search or keywords, or both!
Orchestration
The entire process, from article selection, to enrichment, to embedding, is orchestrated using our open-source software, Flowdapt, which is designed for highly parallelized real-time operations. For full transparency, all source code is available at the Flowdapt github.
Creating AskNews Stories
Every four hours we pull new a new bucket of enriched articles from the vector database and we group related reports based on semantic similarity. With our current setup, we end up with 20–40 separate groups, each group with 80+ articles in a group reporting on the same topic.
Competing perspectives
Similar to how we decided on which articles to index in the first editorial step, we go through each group and check the distribution of countries and outlets. We then enforce that a minimum number of different outlets report on the topic to ensure that we get a diverse reporting; if too few outlets report, we don’t write a story. Using the distribution of countries, we pick a subset of the articles such that they represent the full country representation and we use these articles to:
- Identify alignment 🔎: Which statements are agreed upon by all sources?
- Identify contradictions 👾: Which sources contradict each other, and what are the contradictions?
- Analyze reporting styles ✅: What is the overall reporting style underpinning this story?
- Write a story 📖: Write an over-arching story about what is going on, based on all competing perspectives. Cite ALL statements with original sources.
The LLM is provided with (i.e., prompted with) rigorous journalistic guidelines on how to properly distill the information in the articles. Many of these guidelines follow the AP-stlye guide, but they are also targeting patterned common errors that our human Editor-in-chief identifies. For example, the LLM need guidance on:
- Minimizing ambiguous statements
- Avoiding non sequiturs
- Leaving obvious questions unanswered
- Properly citing statements with evidence and attribution
- Properly handling statement that is “alleged”
Finally, we get the sentiment and use our GLiNER model to extract the mentioned entities.
Bias disclosure
Is AskNews unbiased? Deterministically speaking, no. And we don’t believe it is possible for any outlet to be 100% unbiased. Instead, we believe the important aspect is identifying bias, tracking how bias propagates through your system, minimizing it as much as possible, and reporting it publicly (hence the present blog article).
Geographical bias
Our underlying coverage is biased to the geographical and linguistic distributions reported in our Transparency Dashboard. Our underlying sources are derived from a strong global distribution, geographically speaking, where the weakest representation exists in Africa. Linguistically speaking, we have a strong bias towards Latin-based western languages, but still support Germanic, Arabic, and Cyrillic languages. While we do get coverage from many countries in Asia that have English-reporting outlets (there are many), we are missing a direct representation of languages like Chinese, Japanese, Thai, and Hindi. Improving this is on our roadmap, and is limited by our strict quality control guidelines surrounding translation.
Funding bias
At the time of writing, we have no external funding, are fully bootstrapped and supported by our community of readers. However, we have a strict investment policy in place to prepare for investment — this policy highlights the standards by which our investors must meet before being allowed to invest in our company. These standards include complete lack of association with any governmental body, and no investments in any politically motivated organization.
Technological bias
Our LLMs are biased toward the underlying distributions of their training data. This is a known source of bias — and we combat this by training our own models aimed at reducing bias by enforcing cultural diversity (see our paper “Curating Grounded Synthetic Data with Global Perspectives for Equitable AI” on ArXiv).
Philosophical bias
While the majority of our editorial process is algorithmic (and we believe that is what makes us the most transparent, least bias, and most predictable source of news), we still need to make some editorial decisions along the way. For example, our Editor-in-chief needs to identify when stories do not fit into our broad curation — like horoscope articles. We specifically do not believe this content is valuable in our curation, and we therefore filter this kind of content out. This is technically bias.
Additionally, we are operating internally in predominantly English (and some French). This means that everything we build and create — no matter how much effort we put into removing bias — is filtered through an English culture. This is a much deeper type of bias — it affects *how* we believe news should be reported. Our culture motivates us to *seek* alignment and *seek* contradiction — which is rooted in a “freedom of speech” and “freedom of press” and “truth seeking” philosophy. This kind of philosophy leads to us allowing, algorithmically speaking, known-propaganda outlets to be included in our reports.
Many countries and cultures (even some jurisdictions of the United States) may not agree that this is the proper way of reporting and evaluating news. Therefore — we admit that our philosophical approach is introducing bias into our news reports.
Compliance with the EU AI Act
The EU AI Act, officially known as the Artificial Intelligence Act, is a legislative proposal by the European Commission that aims to ensure that AI systems are used safely and responsibly. It is the first comprehensive regulation on AI by a major regulator and was officially adopted on 21 May, 2024.
We believe that regulation is important for establishing accountability, consistency, and ethical standards, which is why we welcome this legislation.
The EU AI Act encompasses a variety of objectives, one of them being Transparency and Accountability. For AskNews, the relevant requirements are:
- Outlining news selection processes 📰
- Labeling AI generated content 🏷
- Observing copyright law ©️
- Overseeing content generation with humans 👁👁
- Documenting algorithmic practices 📝
- Disclosing bias and funding sources 👐
Do we comply with these requirements? Yes, we do!
📰 Transparent news selection process
How we decide what news articles we provide is clearly stated, both in this blogpost, at the AskNews website, and in our public presentations and publications.
🏷 Labeling AI generated content
The article summaries, stories, and chat replies are all clearly labeled as being generated by AI.
️©️ Observing copyright law
Copyright in journalism and news reporting is no different than in any other domain. What is important to note is that what is protected is the expression and not the idea, procedure, process, system, method of operation, concept, principle, or discovery that is reported on. AskNews content creation falls under fair use. Here is why:
🦋 We transform the full original article, from its original language, into a set of English statements and evidence contained in a unique summary. We analyze the report for reporting voice, sentiment, and we extract entities. This constitutes an individual analysis of the facts that were stated in the article, and it ensures that we are not reproducing the expression but simply reporting on the events. In fact, our summaries are further from the original articles than many other news aggregators who use simpler sentence extractor models for their summaries. Our Stories are even further from the original expression as they are distillations of multiple of unique our enrichments.
🥈 Further, we are only using content that is publicly available. This means that we are not able to “scoop”. We don’t write news, we report on the news other outlets have released and so there is no way for us to be the first at reporting on an event.
👁👁 Human oversight
While the quality of LLMs is improving at an astounding speed, not even the biggest models are good enough to be used without human oversight. That is why we have a human Editor-in-Chief, who was an editor at the Rocky Mountain News for 20 years, and a nationally recognized journalist for another 20, to ensure the quality of our content. We have also implemented ways for our readers to flag content that is incorrect or has some other issue.
📝 Documented algorithmic practices
For us, this is the same as Outlining news selection processes 📰.
👐 Bias and funding disclosure
Our bias disclosure is fulfilled through the Documenting algorithmic practices 📝 and Outlining news selection processes 📰. As we described above in our Bias disclosure, the LLMs we use result in some bias. These aspects are clearly stated on the AskNews website.
In terms of funding, we are entirely bootstrapped and funded by the users of AskNews.app and the AskNews API.
Want to know more?
We love sharing what we do and how we do it! 😍
🎥 Watch some of our presentations about how we parse close to 1M articles per day and track narratives through time:
- Production-scale Context Engineering for Real-Time News Distillation
- Multilingual narrative tracking in the news — real-time experiments
- Engineering Transparency in the News with GenAI
📜 Read about how you can use AskNews data:
- Context is King — Evaluating real-time LLM context quality with Ragas
- Identifying media bias with AI: Russian news coverage of the death of Alexei Navalny
- Curating Grounded Synthetic Data with Global Perspectives for Equitable AI
- Infusing any LLM with news — one line of code
🙌 Or check out our other publications and presentations here!