Methodologies and practices for building AI systems: approaches such as RAG, prompt engineering, agent design patterns and evaluation methodologies. The “how” of AI development.

Adopt

Mature, well-supported approaches ready for production use.

Classical ML

Classical machine learning approaches such as random forests, gradient boosting (XGBoost, LightGBM), linear/logistic regression and support vector machines remain the best balance of explainability and efficiency for structured data problems. These techniques routinely outperform more complex approaches on tabular data while training faster and costing less to run.

Realising these benefits requires quality training data and staff with appropriate expertise. Unlike LLM-based solutions that have democratised AI for organisations without data science teams, classical ML demands specialised knowledge in feature engineering and model selection. For organisations with the necessary capabilities, these methods work well even with smaller enterprise datasets, matching or exceeding the performance of more complex approaches while remaining more interpretable and easier to maintain.

RAG

Retrieval-Augmented Generation (RAG) combines search and text generation to produce more accurate responses, helping prevent confabulation by grounding responses in real data. The technique is particularly valuable when accuracy and traceability are crucial, such as in customer service or compliance scenarios. While implementing RAG requires attention to document processing and embedding strategies, widespread tooling has lowered the barriers to adoption.

We’re monitoring how RAG develops alongside techniques such as Self-RAG, which recognises when more evidence needs to be gathered or responses refined. This self-criticism mechanism has shown promising results in reducing confabulations.

RAG introduces an indirect prompt injection attack surface. Retrieved documents are injected into model context, so adversarial content in any document reaches the prompt via the retrieval step. Retrieval access controls and provenance tracking for ingested documents help mitigate this risk.

See also Cross-encoder reranking, Structured RAG, Hypothetical document embeddings (HyDE).

LLM-as-a-judge

LLM-as-a-judge has proven one of the most practical techniques for evaluating AI system outputs. Today’s strongest models provide nuanced, multidimensional critique that simpler evaluation methods cannot match, except for very constrained metrics such as exact match or BLEU scores.

The technique is widely adopted in both offline and online evaluation. Offline, it scales far better than human assessment, allowing teams to test thousands of outputs quickly. Online, an LLM judge can evaluate another LLM’s output in real-time, enabling dynamic workflow adjustments based on quality assessments.

Research demonstrates that frontier models provide judgements correlating strongly with human preferences across many evaluation dimensions. We recommend using a different LLM as the judge than the one being evaluated, and viewing this as an augmentation to human evaluation rather than a replacement. The strongest LLMs can identify nuanced issues in reasoning and factuality that would otherwise require substantial human review time.

See also: DeepEval

BERT variants

Bidirectional Encoder Representations from Transformers (BERT) revolutionised NLP by processing words in relation to their entire context rather than sequentially. The original BERT spawned a family of variants, with ModernBERT representing the latest evolution, improving training times and accuracy through architectural updates.

BERT-style models serve fundamentally different purposes than generative models such as GPT. Where GPT models excel at generating text, BERT models are optimised for understanding and analysis tasks such as classification and sentiment analysis. They’re particularly valuable for creating semantic vector embeddings, making them essential components in RAG systems where BERT embeddings retrieve relevant information that generative models then use for output.

We recommend DeBERTa for new NLP projects, as it handles word relationships more effectively using a disentangled attention mechanism. DistilBERT is smaller and faster whilst retaining most performance, valuable for production deployments with strict latency requirements. Domain-specific variants exist for biomedical (BioBERT) and financial text (FinBERT), though these require expertise to use effectively.

Few-shot prompting

Providing examples to guide an AI model’s responses has proven consistently effective across Large Language Models.

This is shifting. As models become more capable, interactive multi-turn approaches are gaining favour: rather than providing examples upfront, practitioners prompt models to ask clarifying questions and iterate toward a solution. This collaborative pattern often produces better results, particularly in agentic workflows where the model can refine its approach based on feedback.

However, few-shot prompting retains an important role in non-interactive contexts. System prompts and automated pipelines don’t afford clarifying dialogue. Here, well-chosen examples remain the most effective way to establish output format and domain conventions. We typically see diminishing returns beyond 3-5 examples, and the main trade-off remains token consumption.

Agentic tool use

We’ve moved agentic tool use to the Adopt ring for local, sandboxed environments. AI coding assistants that can edit files, run tests, execute shell commands and perform web searches deliver considerably more value than those limited to conversation.

The ecosystem has matured to support this. Standards such as MCP and OpenAI’s Function Calling provide reliable integration patterns, while improved observability tooling lets teams monitor what agents are doing. The Development Containers specification makes it straightforward to isolate agent execution.

The risks magnify for applications accepting external user input. Prompt injection attacks remain an unsolved problem. An agent that safely edits files for a developer becomes a liability when processing untrusted input. Our recommendation: adopt for local developer tooling and internal workflows, but proceed with caution for customer-facing systems, treating each tool permission as a potential attack vector.

See also: Visual computer use agents, Model Context Protocol, Temporal

Spec-driven development

Spec-driven development inverts the traditional workflow: specifications become the source of truth, and code is generated or verified against them. AI coding assistants work dramatically better with structured specifications than vague instructions. Without specifications, “vibe coding” works for prototypes but falls apart for production systems. Specifications constrain the AI, reduce confabulation, save tokens and produce more maintainable code.

The ecosystem is growing fast. GitHub Spec Kit supports 22+ AI agent platforms. Kiro offers a dedicated specs mode alongside conversational coding. We’re seeing this pattern emerge organically in teams using Claude Skills to codify project requirements as structured markdown that the AI references during generation.

There is a spectrum of formality, from informal markdown specifications through structured specification languages to full formal methods. The right level depends on what you need to assert about your software. The point is to give the AI an unambiguous reference rather than relying on conversational context that drifts with each interaction.

See also: Formal specification languages, Claude Skills

Trial

Promising approaches with growing adoption, worth exploring for teams ready to invest in emerging patterns.

Cross-encoder reranking

Cross-encoder reranking enhances AI search and chat systems by examining initial search results more carefully. While embedding search is fast and good at finding broadly relevant content, cross-encoder reranking excels at understanding subtle relevance signals by examining the query and potential results together.

Most teams use a two-step process: embedding search finds 50-100 potentially relevant items, then cross-encoder reranking sorts these candidates to surface the most relevant. The technique often reduces confabulations in downstream LLM responses by ensuring higher quality context selection. Implementation has become straightforward with libraries such as sentence-transformers providing ready-to-use models. Teams should be mindful of the additional latency and may need to tune the number of candidates based on performance requirements.

Ontologies for AI grounding

As AI systems scale beyond isolated experiments, shared meaning becomes critical infrastructure. Ontologies provide what LLMs lack: authoritative definitions of entities and relationships that don’t shift with statistical probability. They ground responses in agreed definitions, enable knowledge graph traversal that pure RAG cannot achieve and support the structured outputs that agentic systems require.

Traditional ontology development tends toward two failure modes: academic approaches aiming for formal completeness using OWL, and pragmatic approaches creating spreadsheets that grow unmaintainable. The key is to start lightweight and formalise selectively. Mark Burgess argues that traditional ontologies impose rigid hierarchies that don’t match how language models represent meaning, proposing alternative graph structures designed to work with vector embeddings. For organisations needing to ground AI in domain knowledge today, ontologies offer a practical path with mature tooling.

Graph databases such as Neo4j provide accessible implementation options, while LinkML offers YAML-based modelling without deep ontology expertise. Start with a painful, high-value domain rather than attempting to model the entire organisation.

See also: LinkML, Neurosymbolic AI, Prolog

Model distillation & synthetic data

Model distillation involves training a smaller, more efficient model to mimic a larger one. A common pattern uses LLMs to generate synthetic training data for the smaller model: the large LLM acts as a “teacher”, creating diverse examples that help the “student” learn desired behaviour. This makes AI deployment more practical for edge devices or resource-constrained environments.

We’re keeping it in Trial because the process requires considerable expertise. Teams need to validate the quality of generated training data and ensure the distilled model maintains acceptable performance. There is ongoing debate about amplification of biases through this approach.

Check the licence of models used for distillation. Llama forbids using its output to train other models. DeepSeek R1’s launch in January 2025 brought distillation into popular consciousness, as it has been widely assumed to represent a distillation of existing foundation models.

UMAP

UMAP (Uniform Manifold Approximation and Projection) enters our Trial ring as a promising dimensionality reduction technique that’s gaining traction in the AI community. While t-SNE has been the go-to choice for visualising high-dimensional data, UMAP offers better preservation of global structure and runs significantly faster, making it particularly valuable for large-scale AI applications such as exploring embedding spaces and analysing neural network activations.

We’re seeing successful applications across AI projects, especially for understanding LLM behaviours and exploring semantic relationships in vector spaces. Teams should invest time understanding UMAP’s parameters, which require careful tuning to avoid misleading visualisations.

The Python UMAP library provides extensive documentation and explanation, with implementations also available for Rust, Java and R.

Claude Skills

Claude Skills are reusable prompt templates that codify workflows and domain expertise into repeatable patterns for AI coding assistants. Our teams have found them valuable for drafting proposals, structured debugging, generating commit messages and writing PR descriptions. The common thread is tasks that benefit from consistent approach and structured output.

Skills provide a simpler solution than MCP servers for many problems. Where MCP requires implementing a server and managing the protocol lifecycle, Skills are markdown files that encode expertise directly. Skills work well with data that exists as files in your project, since the AI assistant can already read those. MCP extends reach to running services and systems beyond filesystem access.

We recommend teams start by identifying repetitive tasks where consistency matters, then experiment with Skills before investing in MCP server development.

Structured RAG

Structured RAG extends basic RAG by organising retrieved knowledge as graphs, schemas or typed records rather than flat text chunks. Microsoft’s GraphRAG uses an LLM to build a knowledge graph from source documents during indexing, then queries that graph at retrieval time. This addresses a weakness in standard RAG: questions requiring synthesis across many documents rather than finding a single relevant passage.

GraphRAG has matured since our last radar. LazyGraphRAG reduced indexing costs to a fraction of the original, removing the biggest barrier to adoption. Neo4j provides dedicated examples combining graph-based retrieval with LLM generation.

The trade-off remains upfront investment. Graph-based indexing requires more compute and design than vector-based RAG, and the knowledge graph must be maintained as source documents change. If your queries are primarily about finding relevant passages, standard RAG with cross-encoder reranking may suffice. If they require reasoning across documents, structured approaches earn their cost.

See also: RAG, Ontologies for AI grounding, Cross-encoder reranking

Assess

Emerging or specialised approaches that warrant investigation for specific use cases, but require careful evaluation before adoption.

Neurosymbolic AI

Neurosymbolic AI combines neural networks with symbolic reasoning to address fundamental limitations of pure LLM approaches. Neural networks excel at pattern recognition and handling ambiguity, while symbolic AI provides logical reasoning and explainable inference. LLMs understand natural language well but cannot guarantee rule compliance or explain their reasoning in auditable ways.

The root issue is architectural. LLMs operate through probabilistic pattern matching over language, not causal modelling. As Mark Burgess argues in his work on semantic spacetime, language models “paraphrase intentional knowledge” rather than tracing actual causal chains. Precise answers to precise questions require systems that explicitly encode what causes what.

This matters most in regulated sectors. Regulatory rules are non-negotiable constraints, not suggestions a model can approximate. Risk models need to know what entities are and how they relate, and compliance requires explainable decision trails. Similar pressures apply across financial services, healthcare and insurance.

Practical implementations range from lightweight to sophisticated. On the simpler end, teams constrain LLM outputs to valid ontology terms or use knowledge graphs to ground RAG retrieval. More advanced implementations use symbolic reasoning engines to validate LLM-generated conclusions. Renewed interest in Prolog reflects exploration of logic programming alongside LLMs.

We’ve placed this in Assess because production patterns are still emerging, but organisations in regulated sectors should be experimenting now.

See also: Prolog, Ontologies for AI grounding, Agentic tool use, World models

World models

World models sit in the Assess ring as an emerging alternative to pure language model architectures for tasks requiring causal reasoning and planning. Where LLMs predict the next token based on statistical patterns in text, world models build internal representations of how environments behave, enabling systems to simulate outcomes before acting.

The field is developing along several paths. Yann LeCun’s Joint Embedding Predictive Architecture (JEPA) learns by predicting missing information in an abstract embedding space rather than reconstructing raw pixels or tokens. Meta’s V-JEPA and VL-JEPA extend this to video and vision-language tasks with significantly fewer parameters than autoregressive alternatives. Karl Friston’s active inference framework, implemented by Verses AI in their AXIOM system, takes a different approach rooted in how biological systems model their environments. Rather than chasing reward signals, active inference agents build generative models and act to minimise prediction error, with Verses reporting 60% performance improvement using only 3% of comparable deep learning compute. Generative world models form a third strand, with NVIDIA Cosmos and Google DeepMind’s Genie 3 creating physically plausible simulated environments for training robots and autonomous systems.

For financial services, MarS from Microsoft Research demonstrates the pattern applied to market simulation, generating realistic interactive market scenarios for forecasting and anomaly detection without real capital at risk. The paper was accepted at ICLR 2025.

The enterprise value: these approaches offer causal modelling rather than statistical pattern matching. An LLM asked “what happens if I do X?” can only paraphrase similar scenarios from its training data. A world model can simulate the consequences. For teams wanting to experiment, Meta’s V-JEPA 2 and NVIDIA Cosmos models are available on HuggingFace under permissive licences.

See also: Neurosymbolic AI, Physical AI and robotics foundation models

LLM reproducibility

Large language models are non-deterministic even at temperature zero. This presents a fundamental challenge for regulated industries where Model Risk Management frameworks require reproducible, auditable decision-making. Banking regulations such as OCC/SR 11-7 assume a level of model stability that generative AI does not provide.

The root cause extends beyond floating-point arithmetic. Research demonstrates that batch-dependent kernel operations cause outputs to vary with server load rather than input alone. Smaller open weight models on controlled infrastructure tend to achieve more reproducible outputs than larger models served via shared APIs. Where stochastic behaviour is acceptable, the variation must be well-characterised so it can be explained to regulators as a designed property rather than an infrastructure artefact. Prompts and model versions should be treated as versioned code with change control and rollback procedures.

For teams requiring determinism from larger models, SGLang now offers deterministic inference building on batch-invariant operators, with the underlying research selected for oral presentation at NeurIPS 2025. Teams subject to MRM requirements should be actively evaluating their options now.

See also: Neurosymbolic AI, LLM-as-a-judge

Hypothetical document embeddings (HyDE)

HyDE (Hypothetical Document Embeddings) addresses a common problem in search systems: poor performance when searching content that differs from training data. HyDE asks a large language model to imagine what an ideal document answering the query might look like, bridging the gap between how users ask questions and how information is written.

The system creates several hypothetical documents, converts them into embeddings and blends them together. This averaged representation finds real documents that are mathematically similar, often leading to more relevant results than traditional methods. The approach is particularly effective within RAG systems where accurate retrieval is crucial. Teams should evaluate HyDE for cases where high-precision retrieval is needed and the additional latency is acceptable.

See also: RAG, BERT variants, Cross-encoder reranking

Fine-tuning with LoRA

Low-Rank Adaptation (LoRA) makes model customisation more practical by adding a small set of trainable parameters while keeping the original model unchanged, reducing computing requirements by 3-4 orders of magnitude while maintaining most of the performance of full fine-tuning.

Tools such as Lightning AI’s lit-gpt and axolotl support implementation. We place it in Assess rather than Trial because successfully applying LoRA still requires significant ML expertise and careful attention to training data quality. Fine-tuning ties you to a specific model architecture, and given the pace of AI advancement, tomorrow’s general-purpose models may outperform your carefully tuned older models. Migrating fine-tuned weights between architectures is particularly challenging. LoRA should only be deployed when the immediate business value clearly outweighs the technical and opportunity costs.

Physical AI and robotics foundation models

Physical AI represents the convergence of foundation model capabilities with robotics. Where traditional robotics relied on brittle, task-specific programming, robotics foundation models enable machines to generalise across tasks and adapt to novel situations.

The technical breakthrough is Vision-Language-Action (VLA) models, which extend vision-language models to include physical action outputs. NVIDIA’s Isaac GR00T N1 represents the first open humanoid robot foundation model, using a dual-system architecture that separates deliberate planning from rapid reactive control. Google’s Gemini Robotics is advancing similar capabilities. World Foundation Models complement these by enabling simulation-based training: NVIDIA Cosmos generates physically plausible synthetic environments that can train robots on scenarios too dangerous or rare to capture in the real world.

Production deployments remain concentrated in well-resourced organisations. The gap between research demonstrations and reliable industrial deployment is substantial. Hardware costs have decreased (capable platforms now available from 2,000comparedto2,000 compared to 75,000+ three years ago), but perception and control challenges in unstructured environments remain formidable. Organisations with physical AI ambitions should be experimenting, but should approach production timelines with caution.

See also: Digital twin platforms, World models

CaMeL

CaMeL (CApabilities for MachinE Learning) is a defence architecture from Google DeepMind for mitigating prompt injection in agentic systems. It splits responsibilities across two models: a privileged P-model processes only user instructions and outputs a program defining execution steps, while a quarantined Q-model handles external data but cannot call tools directly. A custom interpreter tracks data provenance, enforcing capability-based security so untrusted data cannot escalate privileges.

CaMeL solves 77% of tasks with provable security guarantees in the AgentDojo benchmark. There are limitations: users must define security policies (fatigue risk), running two models adds latency and cost, and it hasn’t been battle-tested in production.

We’ve placed CaMeL in Assess because it addresses prompt injection more rigorously than anything else we’ve seen, but production patterns haven’t emerged. Teams building agentic systems for regulated industries should study this paper.

See also: Agentic tool use, Visual computer use agents, OpenClaw

Hold

Not recommended for new projects; better alternatives exist.

Word2Vec & GloVe

We’ve placed both GloVe (Global Vectors for Word Representation) and Word2Vec (Word to Vector) in the Hold ring of our techniques quadrant. While these word embedding techniques were groundbreaking when introduced and served as fundamental building blocks for many NLP applications, they have been largely superseded by more advanced approaches.

These older embedding techniques, though computationally efficient, lack the contextual understanding that modern transformer-based models provide. Modern large language models and contextual embeddings such as BERT produce more nuanced representations that capture word meaning based on surrounding context, rather than the static embeddings that GloVe and Word2Vec generate. For new projects, we recommend exploring more recent embedding techniques (see “BERT Variants” in our Adopt ring) unless you have very specific constraints around computational resources or model size that make these older approaches necessary.

t-SNE

We’ve placed t-SNE (t-distributed Stochastic Neighbor Embedding) in the Hold ring of our techniques quadrant. While t-SNE was groundbreaking when introduced for visualising high-dimensional data in lower dimensions, particularly for understanding the internal representations of neural networks, we’re seeing its limitations become more apparent in modern AI workflows.

The core issue is that t-SNE can be misleading when interpreting AI model behaviour, as it prioritises preserving local structure at the expense of global relationships. This can lead teams to draw incorrect conclusions about their models’ decision boundaries and feature representations. We’re increasingly recommending alternatives such as UMAP (Uniform Manifold Approximation and Projection), which better preserves both local and global structure while offering superior computational performance. For projects requiring dimensionality reduction and visualisation of AI model internals, we suggest exploring these newer techniques rather than defaulting to t-SNE.

Zero-shot prompting

Zero-shot prompting, the practice of asking Large Language Models to perform tasks without examples or training, has been a quick way to get started with AI. However, we strongly recommend against using zero-shot prompts in production without appropriate guardrails and safety measures. We’ve heard of multiple incidents where unprotected prompts led to harmful or inappropriate outputs, potentially exposing organisations to significant risks.

Our view is that zero-shot prompting should always be combined with input validation and output filtering. While it can be valuable for prototyping and exploration, moving to few-shot prompting or fine-tuning with careful guardrails is a more robust approach for production systems. The current placement in Hold reflects our concern about organisations rushing to deploy unsafe prompt patterns rather than taking the time to implement proper controls.

Chain of thought (CoT)

Chain of Thought (CoT) has moved to Hold. While useful when it emerged, research from Wharton’s Generative AI Labs demonstrates diminishing returns: gains are rarely worth the time cost, and for reasoning models such as o3 and GPT-5.2, CoT prompting can decrease performance since step-by-step reasoning is already internalised at the architecture level.

For non-reasoning models, CoT still shows modest benefits on mathematical and symbolic reasoning tasks, but these are precisely the domains where better alternatives are emerging. Dedicated reasoning models handle them natively, while neurosymbolic architectures offer more reliable solutions by coupling LLMs with explicit reasoning engines.

The frontier of prompt engineering has moved to structuring problems effectively. Frameworks such as the 5 Whys and inversion now offer more value than CoT prompting. Step-by-step reasoning is now handled by the models and architectures rather than the prompts.

See also: Neurosymbolic AI

AI pull request review

AI’s code review capabilities have improved substantially. Developers accomplished at multi-turn AI conversations can now get valuable feedback across the full spectrum, from syntax issues through architectural patterns to subtle runtime concerns such as race conditions.

Yet we’ve kept AI Pull Request Review in Hold, for organisational rather than technical reasons. PR review isn’t just about finding errors; it’s a knowledge-sharing mechanism where senior developers mentor juniors and the team maintains awareness of how the codebase evolves. Teams who delegate review to AI often see a decline in collective code ownership.

We recommend using AI as a first-pass reviewer to catch issues before human review, but preserving the human step as deliberate practice for team alignment and knowledge transfer.

Get industry news, insights, research, updates and events directly to your inbox

Sign up for our newsletter