Outperforming Claude 3.5 Sonnet with Phi-3-mini-4k for graph entity relationship extraction tasks
High-efficiency production-scale entity relationship extraction
At AskNews, we are re-imagining the way people and LLMs consume and understand news. One of the features we offer is a rich visual representation of the relationships between entities for all our event narratives and news articles. These entity relationship graphs, also referred to as knowledge graphs, offer our users powerful ways to explore and interact with our expansive news database. In fact, we are officially hosting the largest searchable news knowledge graph representation in the world. But how do we handle the generation of 500k knowledge graphs per day? The following blog post highlights the key component underpinning the entire knowledge graph building/indexing process — our fine-tuned Phi-3-mini-4k-instruct-graph.
Authors: Wagner Costa Santos, Elin Törnquist, and Robert Caulk from Emergent Methods
tl;dr
We fine-tuned Phi-3-mini-4k to exceed Claude Sonnet 3.5 for graph extraction quality by up to 20% and to reduce cost by orders of magnitude. Further, we improve upon the already impressive JSON output structure of Phi-3-mini-4k, reducing parsing error rate from 2.5% to 0. We also release two additional versions, Phi-3-medium-4k-instruct-graph and Phi-3-medium-128k-instruct-graph, aimed at high reasoning and longer contexts.
We also setup a HuggingFace space hosting our fine-tuned model, which is designed to ingest any text and visualize the output as a graph:
Graphs are all the rage these days
The volume of news pumping through the AskNews system reaches a staggering 500k articles per day. Indexing these with a vector database enables vast semantic exploration, but indexing them with a knowledge graph presents another level of complexity. While vector databases are often coupled with a small embedding model aimed at embedding semantics into vector spaces, knowledge graphs require high-level reasoning, general world knowledge, and context, to be properly constructed.
The latest tools now enable such general world knowledge and reasoning to be applied cost effectively to 500k news articles per day. Phi-3-mini-4k is a capable Small Language Model being applied across the board for tasks like summarization, translation, code generation, and entity extraction. Simultaneously, GPT-4o is a state of the art Large Language Model capable of much higher reasoning tasks. Coupling these models with the latest fine-tuning methods and libraries, we can effectively transfer knowledge from GPT-4o to Phi-3-mini-4k, retaining graph quality and accuracy at a fraction of the cost.
By extracting 500k entity relationship graphs per day with Phi-3-mini-4k-graph, we benefit from a powerful knowledge graph representation of the news that promotes:
- Complex search queries across complimentary Vector x Graph indices (e.g. RAG)
- Identifying temporal characteristics and trends between entities and relationships which bolster predictive modeling (real-time forecasting)
- Tracking hidden insights from secondary and tertiary relationships
In the following sections, we’ll explore our methodology, including our approach to metric formulation, and share the results of our post-training evaluation.
Methodology
Dataset engineering
We constructed our dataset from the AskNews API “Events”. Events are clusters of 100s of semantically similar synthetic news article summaries representing a single event. The clustering process inherently identifies disparate topics, which helps the topic/vocabulary diversification of our data set. Each Event cluster can be connected to other Event clusters occurring at a different point in time. This is called narrative tracking, we won’t go into detail here, but suffice to say, two events connected in time are usually updates of evolving news stories.
The goal is to diversify the topics and vocabulary as much as we can. So we:
- Select an even distribution of samples from all unique events
- Select maximum 3 events connected in time
- Ensure that the train/test/validation data sets each contain their own subset of unique events, and do not cross over in time
We pull these Events from the AskNews API, feed the synthetic summaries to GPT-4o to generate the entity-relationship graph, then combine the synthetic Event summary with the GPT-4o generated label to build the full training dataset.
This dataset was split into 90% for training, 5% for validation, and 5% for testing — with a total of 4,000 unique samples. By maintaining this strict separation between events, we obtained a well distributed representation of the target parameter space that our LLM will be operating in.
Loss and validation metrics
Our primary objective was to train Phi-3 to generate story graphs with a similar pattern to GPT-4o, while also ensuring the correct generation of JSON structures.
Evaluation Metrics: Initially, we experimented with standard text similarity metrics such as BLEU and ROUGE to validate and test our model’s performance. However, after several training iterations, we discovered that custom metrics tailored to our specific use case yielded better results:
- JSON Similarity: We developed a custom metric to compare the nodes and edges generated by Phi-3 with the reference model (GPT-4o).
- JSON Consistency: This additional metric allows us to check if each edge entity data has an associated node. We found that in some cases, even state-of-the-art models like Claude 3.5 generated graphs with orphaned edges.
Here’s a small example of a JSON output:
{
"nodes": [
{
"id": "Viktor Orban",
"type": "person",
"detailed_type": "hungarian prime minister"
},
{
"id": "United States",
"type": "country",
"detailed_type": "nation"
}
],
"edges": [
{
"from": "Viktor Orban",
"to": "United States",
"label": "stated influence"
}
]
}
Foundation model selection
We chose the Phi-3-mini-4k-instruct model for this project. Phi-3 is classified as a Small Language Model (SLM) and is exceptionally efficient. Even workstations with NVIDIA RTX GPUs or PCs with GeForce RTX GPUs can run the model locally. This model received a significant update in June 2024, which proved to provide substantial improvements in instruction following, structured output, and reasoning capabilities. According with the model release notes, the JSON Structure Output performance increased dramatically from 11.5 to 52.3 on public and internal benchmark datasets, among other enhancements. Notably for our use case, this improvement is significant.
These improvements reinforced Phi-3 as an excellent choice for our task. The combination of its efficiency as an SLM and its enhanced capabilities in structured output aligns perfectly with our goal of processing large volumes of news articles while maintaining high-quality entity relationship extraction.
Our fine-tuning approach leverages Transformers, which includes: SFTTrainer for efficient supervised training, PEFT for parameter-efficient fine-tuning, and QLoRA for quantized low-rank adaptation, enabling effective adaptation of Phi-3 to our task while optimizing computational resources.
The code snippet below illustrates key parts of our implementation:
from transformers import (AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig,
TrainingArguments, EarlyStoppingCallback)
from trl import SFTTrainer
from peft import LoraConfig
class ModelTrainer:
def setup_model_and_tokenizer(self):
bnb_config = BitsAndBytesConfig(
load_in_4bit=True, bnb_4bit_use_double_quant=True, bnb_4bit_quant_type="nf4", bnb_4bit_compute_dtype=torch.bfloat16
)
self.model = AutoModelForCausalLM.from_pretrained(
self.model_id,
device_map="auto",
attn_implementation="flash_attention_2",
torch_dtype=torch.bfloat16,
quantization_config=bnb_config,
trust_remote_code=True,
use_cache=False
)
self.tokenizer = AutoTokenizer.from_pretrained(self.tokenizer_id, trust_remote_code=True)
def setup_trainer(self, train_dataset, eval_dataset):
peft_config = LoraConfig(
lora_alpha=8,
lora_dropout=0.05,
r=6,
bias="none",
target_modules="all-linear",
task_type="CAUSAL_LM",
)
args = TrainingArguments(
output_dir=self.output_dir,
num_train_epochs=5,
per_device_train_batch_size=4,
gradient_accumulation_steps=1,
per_device_eval_batch_size=8,
eval_accumulation_steps=2,
gradient_checkpointing=True,
logging_steps=10,
save_strategy="steps",
evaluation_strategy="steps",
eval_steps=100,
save_steps=100,
bf16=True,
tf32=True,
learning_rate=2e-4,
max_grad_norm=0.3,
warmup_ratio=0.03,
lr_scheduler_type="cosine",
metric_for_best_model="json_similarity_avg",
greater_is_better=True,
)
self.trainer = SFTTrainer(
model=self.model,
args=args,
train_dataset=train_dataset,
eval_dataset=eval_dataset,
peft_config=peft_config,
max_seq_length=3072,
tokenizer=self.tokenizer,
packing=False,
compute_metrics=self.compute_metrics,
)
def main():
trainer = ModelTrainer(
model_id="microsoft/Phi-3-mini-4k-instruct",
tokenizer_id="microsoft/Phi-3-mini-4k-instruct",
dataset_path="src/dataset/ds_1ea20812-afa8-4fab-8fba-dfb566c4775f",
output_dir=f"models/{output_model_name}"
)
trainer.setup_model_and_tokenizer()
trainer.setup_trainer(chatml_train_dataset, chatml_eval_dataset)
trainer.train_model()
trainer.save_model()
For testing and result evaluation, we established GPT-4o as the ground truth. We ran comparisons between our fine-tuned model, the original Phi-3 without fine-tuning, and Claude Sonnet 3.5 (considered state-of-the-art alongside GPT-4o). We compared the previously mentioned indicators across these models. Additionally, the JSON consistency indicator proved valuable as it doesn’t require a ground truth for comparison, allowing for more independent evaluation of each model.
Comparison Examples
Before we jump into the details of our post-training evaluation, let’s look at some real-world examples to get a better grasp of our results and what our metrics actually mean. We’ll walk through three specific cases to show you how our JSON similarity metric and consistency check work in practice, and why they matter.
1) High similarity
The first example is of a response with high JSON similarity (this means greater similarity to the model we want to imitate, which is GPT-4o).
Story: In a coordinated law enforcement operation, Vasily Burakov was apprehended in the Tver region for the fatal shooting of two police officers in Shchelkovo, a suburb of Moscow. The attack, which took place on April 7, led to the death of one officer and left another in serious condition. Burakov, who fled the scene and was in hiding, was located following a six-hour manhunt that ended with his arrest in a local forest. Upon his capture, Burakov admitted to the crime and has since been charged with attempted murder of law enforcement officers and illegal possession of firearms. The incident has stirred significant concern as it highlights the dangers faced by officers in the line of duty, particularly in operations related to drug trafficking. The Russian Ministry of Internal Affairs and the Russian Federal Security Service are continuing their investigation into the circumstances surrounding the shooting and the subsequent flight and arrest of Burakov.
GPT-4o output:
Phi-3-mini-instruct-graph output:
As we can see in the comparison, the entities are exactly the same with some differences in the description of relationships. However, even with these differences, the correct meaning of the story is conveyed in both versions.
2) Low similarity
The second example is of a response with low JSON similarity (this means less similarity to the model we want to imitate, which is GPT-4o).
Story: In a surprising display of resilience, the US labor market added 272,000 jobs in May, far exceeding the Dow Jones consensus estimate of 190,000 and countering the narrative of a labor market slowdown. Despite the robust job growth, the unemployment rate ticked up to 4%, the highest it has been since January 2022. The sectors of healthcare, government, and leisure and hospitality were the main drivers of this growth, while average hourly earnings increased, suggesting a continued trend of wage growth. The implications of the strong job report are significant for the Federal Reserve’s monetary policy. Initially, there were expectations that the Fed might cut interest rates to support the economy. However, the unexpected surge in job creation and wage increases may lead the Fed to hold off on any rate cuts, with some experts now predicting that the first rate cut might not occur until September. The labor market’s strength is seen as a key factor that could keep the Fed in a holding pattern. The stock and bond markets reacted negatively to the report, with S&P 500 futures dropping and government bond yields rising, reflecting investor concerns that the Fed might delay interest rate cuts due to the strong job market data. The current situation presents a complex scenario for the Fed, which is balancing the need to manage inflation with the desire to support economic growth and maintain labor market strength.
GPT-4o output:
Phi-3-mini-instruct-graph output:
As we can see in this comparison, even with low similarity, Phi3 Fine-tuned manages to better translate the details of the story through entities and their relationships, keeping the US labor market as the main entity and maintaining the connection to the Federal Reserve (which doesn’t happen with GPT-4o).
3) Json consistency
The json_consistency metric aims to measure the consistency of the JSON by verifying if all edges have existing entities, showing the coherence between nodes and edges in the extraction. Our fine-tuned model achieved 99% in this metric, proving superior to Claude 3.5 Sonnet (97%). We will show a specific example of low consistency in Claude 3.5 and compare it with the fine-tuned Phi3.
Story: In a tragic confrontation, a 16-year-old boy was fatally shot by police in Perth, Western Australia, after stabbing a man and refusing to surrender his weapon. The incident, which took place in the suburb of Willetton, has been characterized by authorities as having ‘hallmarks’ of terrorism due to the boy’s reported online radicalization. He had been previously identified as a risk and was part of a de-radicalization program. The police had been tipped off about a potential attack the evening before, but were unable to prevent the stabbing. When they arrived at the scene, the boy, armed with a 30-centimeter kitchen knife, charged at the officers despite being tased twice and was subsequently shot. The victim of the stabbing is currently in a stable but critical condition. This event has raised significant concerns regarding the spread of radicalization among young people in Australia and the challenges of intervening effectively. Western Australia Premier Roger Cook and Australian Prime Minister Anthony Albanese have both addressed the incident, emphasizing the country’s commitment to combating violent extremism. The incident is under investigation, and a meeting between religious leaders and city authorities has been scheduled to address community concerns.
Claude 3.5 Sonnet output:
Phi-3-mini-instruct-graph output:
In this case, we have 57% consistency in Claude 3.5 Sonnet and 100% in Phi3 fine-tuned. Out of a total of 7 edges in Claude, 3 became invalid due to referencing a non-existent node called “Incident”. As a result, we can see that the fine-tuned Phi3 version manages to convey more details of the story.
Post training evaluation
The results of our post training evaluation, presented below through comparative charts, demonstrate the effectiveness of our fine-tuning process and offer valuable insights into the capabilities and limitations of each model in the task of generating entity relationship graphs. As we can see in the table below, both Claude Sonnet 3.5 and Phi-3-mini-instruct-graph show zero errors, indicating perfect performance in this metric. However, the Phi-3-mini-4k-instruct (base) model exhibits 5 errors, which represents 2.5% of the total. This suggests that our fine-tuning process significantly improved the base Phi-3 model’s performance, bringing it on par with the more advanced Claude Sonnet 3.5 in terms of JSON output error reduction.
Results
This set of bar charts provides a comprehensive comparison of four key metrics across the three models. For Node Similarity, Phi-3 Fine-tuned outperforms both Claude Sonnet 3.5 and the base Phi-3 model, achieving a score of 0.78. In Edge Similarity, Phi-3 Fine-tuned again leads with a score of 0.49, showing significant improvement over the base model. The JSON Consistency metric reveals high performance across all models, with Phi-3 Fine-tuned slightly edging out the others at 0.99. Finally, the JSON Similarity Average, which is calculated as the mean of Node Similarity, Edge Similarity, and JSON Consistency, shows Phi-3 Fine-tuned maintaining its lead with a score of 0.75. This composite metric provides a holistic view of each model’s performance across all aspects of JSON structure and content similarity. These results demonstrate that our fine-tuning process successfully enhanced Phi-3’s performance across all measured aspects, often surpassing the capabilities of Claude Sonnet 3.5.
The box plots offer a more detailed view of the distribution of scores for each metric. For Node Similarity, Phi-3 Fine-tuned shows a higher median and a tighter interquartile range, indicating more consistent performance. In Edge Similarity, while Phi-3 Fine-tuned has a higher median, it also shows more variability, suggesting room for further improvement in consistency. The JSON Consistency plot reveals that all models perform exceptionally well, with Phi-3 Fine-tuned showing the least variability. Lastly, the JSON Similarity Average plot, representing the combined performance across Nodes, Edges, and Consistency metrics, demonstrates that Phi-3 Fine-tuned not only has the highest median score but also maintains a relatively tight distribution. This showcases its robust and consistent performance across various test cases, balancing strengths in all three component metrics. These detailed distributions reinforce the success of our fine-tuning approach while also highlighting areas for potential future enhancements.
Cost comparison
As we mentioned earlier, our goal is to obtain entity relationships for approximately 500,000 articles per day. Using an LLM via API makes this a costly venture. Let’s simulate a cost comparison (with July 2024 values) between GPT4o and our fine-tuned Phi3. For this, we’ll assume an average of 905 prompt tokens and 525 output tokens, which is the average we had when running tests with fine-tuned Phi3 on our test dataset. We know there may be variations in token counting methods depending on the tokenizer, but we’ll use Phi3’s as the basis for comparison.
Hosting 2x A100 SXM (runpod.io)
$ 3.88/ hour => $93.12 / day
OpenAI (GPT-4o)
Input pricing: 271.5M (905 x 300k) Tokens ($5.00 / 1M): Total $ 1357,00 / day
Output pricing: 157.5M (525 x 300k) Tokens ($15.00 / 1M): Total $ 2362,50 / day
Total: $ 3719.50 / day
The cost comparison reveals a significant difference between using a fine-tuned Phi3 model hosted on a single A100 GPU versus utilizing OpenAI’s GPT-4o API. Hosting the fine-tuned model costs approximately $46.56 per day, while using the GPT-4o API for the same volume of work would cost around $3,719.50 per day. This represents a cost reduction of over 98% by using the fine-tuned model. The substantial savings are primarily due to the elimination of per-token pricing and the ability to process a large volume of requests on dedicated hardware. However, it’s important to note that this comparison doesn’t account for the initial costs of fine-tuning or potential differences in output quality. For example, our training process for Phi-3 mini took almost 3 hours on an A100 SXM (currently, each hour costs $1.94 on runpod.io), which adds to the upfront investment. Nevertheless, for high-volume applications like ours, the economic benefits of using a fine-tuned model are clear and significant.
Limitations
Most approaches to entity and relationship extraction limit the types of entities and relationships. As our main goal is to use this model to explain articles, stories, and capture as many details as possible, we decided to let the model freely define these types. We understand that this may bring disadvantages due to lack of standardization, however, the result is a model richer in details.
Another important point is that we don’t have a human-guaranteed ground truth. In our case, we used GPT-4o, which was the state-of-the-art model when our dataset was generated. The fine-tuning was done using this reference. However, even if we had a better ground truth for our dataset, it might still be biased and not “perfect” because, as we showed in the qualitative comparisons above, entity relationships are very complex and subjective in many ways.
Models
We are pleased to make our models publicly available on Hugging Face in 3 versions as per the links below: