Machine Reading Comprehension Using Knowledge Graph

🔍 The Problem#

Legal and medical text is exactly where keyword search breaks down — the meaning lives in relationships between entities (which clause modifies which obligation, which drug interacts with which condition), not in matching strings. Linear, keyword-based retrieval over 10M+ documents was both semantically wrong and too slow to be usable at that scale.

🧠 The Decision#

Rather than a separate model for each sub-task, I trained one multi-task RoBERTa model to handle entity recognition, relation extraction, and coreference resolution simultaneously, fine-tuned on legal and medical corpora. A multi-task architecture is harder to train well than three separate single-purpose models — the tasks can interfere with each other during training — but it's what let a single fine-tuning pass produce output structured enough to assemble directly into a graph, instead of reconciling three independent models' outputs downstream.

For the query side, once the graph reached millions of entities, naive traversal wasn't going to hold up. I added GNN-based query optimization with indexed traversal and semantic caching specifically for the complex, multi-hop relationship queries that dominate legal and medical lookups — a plain search index handles simple lookups fine, but falls over on "find everything connected to X through a specific relationship type," which is the query shape this domain actually needed.

🏗️ How It Was Built#

Knowledge graph construction pipeline — the big picture, then two independent loops: entity extraction and graph query

Two loops:

Extract. Domain documents go through the fine-tuned multi-task RoBERTa model for entity + relation extraction, then automated graph assembly, then an ML-based validation pass that checks consistency before anything is added to the production graph.

Query. The graph — sharded to handle millions of entities — is queried through GNN-based traversal and semantic caching, with transformer embeddings providing contextually-aware search rather than plain keyword matching.

📈 Impact & Results#

92% F1-score in domain-specific entity extraction, from the fine-tuned multi-task RoBERTa model
60% faster queries via indexed graph traversal and semantic caching, at a scale of 10M+ processed documents and millions of graph entities
25% better recommendation accuracy from semantic graph representations replacing keyword-based retrieval

This work established an end-to-end approach — raw text to queryable knowledge graph — that's held up across both legal and medical domains, and has been presented at academic conferences.

Machine Reading Comprehension Using Knowledge Graph

🔍 The Problem#

🧠 The Decision#

🏗️ How It Was Built#

📈 Impact & Results#

Key Achievements