LLMs, Knowledge Graphs, and the Fine-Tuning Landscape

The Incompatibility of Language Models and Knowledge Graphs

Large language models and knowledge graphs solve fundamentally different problems with fundamentally different architectures. LLMs encode probabilistic relationships between tokens learned from billions of text sequences. Knowledge graphs encode explicit, structured relationships between entities stored as directed graphs with typed edges. They are, in a deep sense, incompatible.

This incompatibility is not a practical engineering problem to be solved with a clever adapter. It is a structural mismatch. An LLM's knowledge is distributed across parameters in a dense, compressed format. A knowledge graph's knowledge is localized in nodes and edges in a sparse, explicit format. Bridging them in either direction requires translating between dense distributed representations and sparse discrete structures.

The question then becomes: can we make them work together anyway?

Neo4j Queries as a RAG Alternative

The most direct approach is to have the LLM generate knowledge graph queries instead of retrieving text documents. Rather than embedding a query into a vector store and finding similar text chunks (standard RAG), the model generates a Neo4j Cypher query that traverses the graph structure, retrieves entities and their relationships, and returns structured results.

The appeal is straightforward. Knowledge graphs excel at multi-hop reasoning across entities — exactly the kind of structured relational reasoning that LLMs often fail at. If the model can generate Cypher queries, it gains access to precise, deterministic relationship traversal without the hallucination risk of relying solely on parametric memory.

The challenge is query generation quality. Generating correct Cypher requires understanding the graph schema, the types of nodes and edges, and the semantics of the traversal. For small, well-defined graphs, this is tractable. For large, heterogeneous graphs, it is difficult. But the approach is worth pursuing because it sidesteps the harder problem of graph embedding altogether.

Why Graph Embedding Is Not Practical at Industry Scale

Graph embedding — mapping nodes and edges into a continuous vector space for retrieval — is a well-studied research topic. In practice, it rarely works at scale. The reasons are computational and architectural:

Graph embeddings require training on the graph structure itself, which means retraining whenever the graph changes.
The embedding space is specific to the graph topology. Different graphs produce incompatible embeddings.
Combining graph embeddings with text embeddings for unified retrieval requires either joint training (expensive) or cross-modal alignment (fragile).
Fine-tuning a full LLM on domain-specific graph embeddings would require compute budgets that most organizations cannot sustain.

For academic demonstrations on small graphs, graph embedding works well. For operational systems with millions of entities and frequent updates, it does not.

Natural Language as a Universal Interface

Despite its limitations, natural language has one property that makes it indispensable: it is the common communication format between diverse models and external query systems. The fact that an LLM understands English and a database accepts SQL-like queries means there is a bridge, however imperfect. The model translates intent into query, the system returns structured results, and the model reconstructs a natural language response.

This translation pipeline is lossy, but it is the only pipeline that works universally across systems that were designed independently.

Domain-Specific Models: From-Scratch vs. Fine-Tuning

The question of how to give domain-specific knowledge to an LLM has two answers, one practical and one theoretical.

From-scratch training of a small model on domain data is theoretically sound. If you have a modest corpus of domain-specific text, you can train a small model that encodes that knowledge in its parameters. This is what blockLLM and similar approaches demonstrate — domain-specialized small models trained on curated corpora. But from-scratch training has diminishing returns once you want the model to also be a general-purpose assistant capable of dialogue, coding, and reasoning.

Fine-tuning an existing LLM on domain data is what most organizations actually do. The idea is simple: take a capable open model, fine-tune it on domain documents plus dialogue data, and deploy the result. The domain knowledge gets encoded in the parameters, and the post-training alignment gives it conversational ability.

Evidence suggests this approach often outperforms a larger model with in-context learning alone. A fine-tuned 7B model with domain knowledge in its weights can be more accurate and more consistent than a 70B model prompted with domain documents, because the knowledge is not subject to retrieval failure or context dilution.

Is Pre-Training Supervised or Unsupervised?

This is a question that has no clean answer because it depends on the level of abstraction.

By definition, pre-training is supervised learning. The model computes cross-entropy loss over token prediction. For each input sequence, the target is the next token. This is a supervised prediction task. The label for each position is the actual token at that position.

By process, pre-training is self-supervised. No human labeling is involved. The data generates its own targets from its own structure. The model learns statistical relationships between tokens without any external supervision signal beyond the data itself.

Both descriptions are correct. Pre-training is supervised in its loss function and self-supervised in its data pipeline. The distinction matters less in practice than in taxonomy, but it illuminates what the model is actually learning: token-level statistical dependencies in the absence of semantic labels.

Applied to domain-specific fine-tuning, the same logic applies. Phase one — encoding domain knowledge through masked/token prediction on domain documents — is self-supervised learning of domain-specific token distributions. Phase two — instruction tuning and dialogue alignment — is supervised learning of response format and interaction patterns.

The Erosion of Fine-Tuning Accessibility

OpenAI once offered a service where users could upload their own corpora and train custom models through the API. This was available through the ChatGPT early days and extended through models like 4-Turbo. As of the most recent model releases, this custom fine-tuning capability has been deprioritized, and advanced models do not support user-level fine-tuning.

More broadly, the trend toward quantized model deployment has made fine-tuning harder across the board. Most fine-tuning methods require full-precision weights. Full-precision 7B models require substantial GPU memory. LoRA and other PEFT methods reduce memory requirements but operate within a narrow low-rank subspace, and their capacity to encode new knowledge is limited and poorly understood.

The practical consequence is that the fine-tuning approach — once the most accessible path to domain-specific models — is becoming less accessible over time. Organizations that previously fine-tuned open models are finding that quantized deployments make fine-tuning impractical, and PEFT methods may not encode enough domain knowledge to justify the effort.

The Reasoning Model Fine-Tuning Hazard

A more recent and more specific problem is the fragility of chain-of-thought reasoning under fine-tuning.

Reasoning models — models that produce a visible chain of thought between think and think tokens before generating their final answer — derive their reasoning capability from post-training alignment, not from pre-training weights. The think-token output pattern is learned during supervised fine-tuning on reasoning traces. If you then fine-tune the same model on domain data that does not include reasoning traces, the think-token capability degrades. In extreme cases, it collapses entirely.

This has been observed empirically. A fine-tuned Qwen model without sufficient CoT data in the training mix loses its reasoning mode. The model stops producing coherent chains of thought. The output quality drops, and the model's most distinctive capability — the one that justified its adoption in the first place — is destroyed.

The mitigation, as far as practitioners can tell, is a data mix ratio of roughly 3:1 — three parts reasoning-trace data to one part non-reasoning data. This preserves the think-token output pattern while allowing domain knowledge to be encoded. But acquiring sufficient reasoning-trace data for any non-trivial domain is labor-intensive.

This is a critical insight for the fine-tuning community. Fine-tuning a reasoning model is not the same as fine-tuning a standard model. The post-train capabilities are more fragile and require more careful data management.

Synthetic Data: The Path Forward for Small Labs

Given all of this, the trajectory becomes clear. Small labs and individual researchers cannot compete on compute, cannot afford full-precision fine-tuning at scale, and cannot rely on the accessibility of custom fine-tuning APIs. Their path is synthetic data.

The synthetic data pipeline works as follows:

A strong open-source model (e.g., a 70B or larger model) is given domain-specific documents in its context window through in-context learning.
The model generates reasoning traces — chain-of-thought outputs that demonstrate how to apply domain knowledge to problems.
These generated traces become training data for a smaller model.
The smaller model is fine-tuned on this synthetic CoT data, mixed with the raw domain documents.

The result is a smaller model that has encoded domain knowledge through self-supervised learning on the raw documents and has learned reasoning patterns through supervised learning on the synthesized traces. The inference cost drops by orders of magnitude, while the capability remains competitive.

The quality of the output depends on the quality of the generation, which depends on the quality of the source model and the selection of domain data. There is an ongoing loop of human curation and automated generation. But this is the most practical path for labs that cannot afford to train large models from scratch or fine-tune them at full precision.

Why Companies Fear Model Distillation

All of this explains why major model providers publicly resist distillation. When a small lab distills a strong open model's capabilities into a smaller, domain-specific model, the resulting system can deliver comparable performance at a fraction of the inference cost. If enough labs do this at scale, the demand for the original model's API decreases. The economic advantage of proprietary models erodes.

The companies' position is understandable from a business perspective. But from a technical perspective, distillation is simply the most efficient way to transfer knowledge between models of different capacities. Whether it should be restricted is a policy question. Whether it is inevitable is not.

Conclusion

The landscape of LLM deployment is shifting. Pre-training data has been exhausted. Fine-tuning is becoming less accessible. Reasoning capabilities are fragile under naive fine-tuning. Knowledge graphs and language models remain structurally incompatible despite the value of bridging them. And synthetic data distillation is becoming the dominant path for capacity-constrained researchers.

None of this is a crisis. It is a recalibration. The field is moving from the era of scaling on freely available data to the era of strategic data curation, targeted fine-tuning with careful data mix design, and systematic synthetic data generation. The models that do well in this era will not necessarily be the largest. They will be the ones whose training data is the most carefully constructed.