Software that enhances AI development workflows without being embedded in your application code: IDE extensions, CLI utilities, testing frameworks and observability solutions.

Adopt

Mature, well-supported tools with proven track records in production development workflows.

Software engineering copilots

AI-augmented development represents a permanent shift in software engineering. Teams not actively building capability here are falling behind.

The tooling falls into two categories. Model-agnostic interfaces let teams switch between providers: OpenCode stands out for its terminal experience and breadth of integration, Cursor, Windsurf and Zed are standalone editors, and CLI tools such as Aider and Cline work across providers. Provider-specific tools such as Claude Code, Gemini CLI and OpenAI Codex are optimised for their respective models. GitHub Copilot and Tabnine offer traditional IDE integrations.

Two approaches have emerged: free-form “vibe coding” and structured methodologies. Kiro offers both, with a conversational mode and a dedicated specs mode for drafting requirements before code generation. Cursor enables teams to codify standards through .cursorrules.

Senior engineers derive the greatest value, using AI for routine tasks whilst maintaining quality oversight. Junior developers often struggle to evaluate AI suggestions. Success correlates with intentional training around effective AI collaboration and a “trust but verify” mindset.

Provider-agnostic LLM facades

The LLM market evolves rapidly, making today’s optimal choice potentially outdated within months. We recommend implementing a facade pattern between your application and LLM providers, rather than building directly against specific APIs. This approach reduces vendor lock-in and enables easier testing of alternative models as they emerge. When considering whether to write your own code, be sure to consider tools such as the lightweight AISuite, Simon Willison’s LLM library and CLI tool, or heavyweight alternatives such as LangChain and LlamaIndex.

This recommendation reflects our team’s experience seeing projects hampered by tight coupling to specific LLM providers, and the subsequent maintenance burden when transitioning to newer, more capable models.

Notebooks

Notebooks are the de facto standard for data science and ML experimentation. Combining code execution with rich text and visualisations, they’re particularly valuable for AI/ML workflows where iterative exploration and documentation of model development are essential. We especially value how notebooks facilitate collaboration between technical and non-technical team members.

Jupyter is the most widely used, supporting multiple languages including Python and Julia. Cloud platforms provide their own implementations: Google Colab, AWS Sagemaker Notebooks, Azure Notebooks, Databricks Notebooks. Language-specific options include Pluto.jl for Julia, Clerk for Clojure and Polynote for Scala.

Trial

Promising tools with growing adoption that are worth exploring for teams building AI systems.

MLflow

MLFlow is a lightweight, modular open-source option for managing the machine learning lifecycle. It avoids the vendor lock-in of monolithic cloud-based MLOps platforms from AWS, Microsoft and Google, offering teams flexibility to maintain infrastructure control and adapt workflows as needs evolve.

Realising the benefits requires technical expertise to configure and integrate effectively. Unlike SageMaker or Vertex AI, MLFlow does not provide a plug-and-play experience; it offers modular components that must be tailored to specific use cases. We recommend it for organisations that value flexibility and have the proficiency to manage integrations.

Vector databases

Vector databases have emerged as specialised tools for managing the high-dimensional embeddings required by AI models. Prominent solutions include Pinecone, Qdrant, Milvus and Weaviate.

Their adoption should be carefully evaluated. Traditional databases may suffice for simpler operations, and alternative approaches such as Timescale’s PGAI vectorizer bring vector search directly into Postgres, avoiding the data consistency challenges of keeping embeddings synchronised across databases. If a dedicated vector database is required, Pinecone leads in production readiness but comes with managed service costs, while open-source alternatives such as Qdrant and Milvus offer greater control but demand more operational expertise.

For prototyping, Chroma offers a Python-first approach with minimal configuration. A 2025 Rust rewrite improved performance, though it remains best suited for small-to-medium scale applications. LanceDB takes a different approach as an embedded vector database, similar in philosophy to SQLite. It operates as a library within your application using Apache Arrow’s columnar format, making it compelling for local AI assistants and edge deployments where data must remain on-device. The trade-off is limited high-concurrency support.

Local model execution environments

Tools such as Ollama, LM Studio and AnythingLLM provide accessible ways to run open weight models on local hardware. These enable experimentation with models from Meta, Mistral, DeepSeek, Alibaba and OpenAI without API costs or sending data to external services. Many now support tool calling via MCP and connections to commercial APIs for hybrid workflows.

These tools serve developers testing AI features, teams comparing model responses and organisations exploring capabilities with sensitive data that cannot leave their infrastructure. They range from CLI tools such as Ollama to graphical applications such as LM Studio.

We’ve placed these in Trial as they’re particularly useful for privacy-sensitive prototyping and scenarios where extensive experimentation would be cost-prohibitive via APIs.

LLM observability tools

Modern agentic builds involve multi-step reasoning, tool orchestration, RAG retrieval and chains of LLM calls where a single user request might trigger dozens of internal operations. Debugging why an agent produced an unexpected result requires visibility into every step of that chain. This is distinct from production AI monitoring, which focuses on drift detection in deployed systems.

Phoenix, from Arize AI, has emerged as a leading open-source option. Built on OpenTelemetry, it provides tracing and evaluation with auto-instrumentation for LangChain, LlamaIndex, DSPy and direct integrations with OpenAI, Anthropic and AWS Bedrock. Langfuse is the most popular fully open-source alternative (MIT licence), combining tracing and evaluation with strong multi-turn conversation support.

For LangChain-committed teams, LangSmith provides native integration that surfaces framework internals in debugging views. Helicone takes a lightweight proxy approach: route API calls through its endpoint for observability without SDK changes. Since these tools capture prompt and response data, data sovereignty matters. Phoenix and Langfuse both offer self-hosting for teams with data residency requirements.

LLM API gateways

As organisations adopt multiple model providers, an infrastructure layer emerges between applications and providers. LLM API gateways handle routing, caching, failover, rate limiting and cost tracking at the proxy level, complementing code-level abstraction from libraries such as AISuite. A facade is a developer choice about how to call models; a gateway is a platform decision about how to manage model traffic across the organisation.

LiteLLM is the most widely adopted open-source option, with SOC-2 Type 2 and ISO 27001 certification. It provides an OpenAI-compatible proxy with spend tracking, budget controls and key management. Portkey offers a managed alternative with semantic caching and conditional routing. Kong brings enterprise API management experience to LLM traffic.

The operational value for multi-provider deployments is clear: centralised audit logging, per-team budget enforcement and automatic failover. LiteLLM can be self-hosted; Portkey’s managed service routes traffic through its infrastructure. Teams running multi-provider deployments should evaluate whether a gateway simplifies their operational story before building equivalent functionality in-house.

AI red teaming tools

The radar’s security coverage has grown this quarter, with guidance on prompt injection in our MCP and RAG entries and architectural defences such as CaMeL. What has been missing is offensive testing: tools that systematically probe AI systems for vulnerabilities before attackers do.

Promptfoo is the standout, evolving from a prompt evaluation CLI into a comprehensive red teaming platform. OpenAI acquired Promptfoo in March 2026, though it remains open-source under MIT. It ships dedicated financial services plugins for PCI DSS and banking regulation testing. Microsoft’s PyRIT focuses on multi-turn attack orchestration with Azure AI Foundry integration. NVIDIA’s Garak takes an agentic approach, autonomously probing for prompt injection, data leakage and toxicity.

The EU AI Act will require adversarial testing for high-risk systems by August 2026. We recommend starting with Promptfoo for breadth and CI/CD integration. The OpenAI acquisition raises questions about neutrality, but the MIT licence provides a backstop.

AI-assisted code migration

Large-scale code migration sits in a gap between what copilots do (assist line-by-line) and what bootstrappers do (generate new projects). Migration tools operate at codebase scale, applying thousands of coordinated changes to upgrade language versions, swap frameworks or modernise APIs. The pattern that works best combines deterministic code transformations with AI for edge cases that rules alone cannot handle.

Moderne is the leading platform, built on the open-source OpenRewrite engine. OpenRewrite provides deterministic “recipes” for common transformations (Java version upgrades, Spring Boot migrations, Jakarta EE transitions), while Moderne adds AI-assisted handling of non-standard patterns and enterprise-scale orchestration. AWS has integrated OpenRewrite into Q Code Transformation for Java modernisation, and GitHub Copilot uses it for automated dependency upgrades.

Java 8 reached end of public updates in 2022, Java 11 follows in 2026 and Spring Boot 2.x is end-of-life. These migrations are well-understood but labour-intensive, exactly where deterministic transformation augmented by AI pays off. We recommend trialling OpenRewrite on a representative repository before committing to the Moderne platform.

Assess

Emerging tools that require careful evaluation before adoption.

AI application bootstrappers

AI application bootstrappers generate complete applications from prompts or designs. Lovable (formerly GPT Engineer) has emerged as a leader alongside V0, Bolt.new and Replit Agent. Google entered the space with Firebase Studio. These tools can take projects from concept to working application in hours.

Capabilities are improving rapidly. Lovable’s visual editor allows Figma-like manipulation with automatic code updates. V0 excels at production-ready React components. Bolt.new runs full-stack development in the browser.

However, success still correlates strongly with existing engineering expertise. Senior developers use them as accelerators, understanding how to refactor generated code. Teams without this expertise risk shipping code they cannot maintain or debug. The gap between “working demo” and “production-ready system” remains substantial.

We recommend these primarily for prototyping and proof-of-concept work, with clear separation from production codebases unless your team has the engineering depth to take ownership of generated code.

Visual computer use agents

AI agents that interact with computers through visual understanding have matured but remain risky. Claude Computer Use controls desktops and browsers by seeing the screen and reasoning about interface elements. OpenAI Operator focuses on web browser automation through a managed environment. Browser Use offers an open-source alternative across multiple providers.

Reliability for bounded tasks has improved, with standard office workflows seeing success rates in the high 80s. However, prompt injection attacks, where malicious instructions hidden on web pages hijack agent behaviour, represent a systemic vulnerability. OpenAI has acknowledged this problem “may never be fully solved”.

For many automation needs, programmatic approaches via APIs and workflow automation platforms remain more reliable and secure. Visual computer use is best suited to isolated environments where the agent cannot access sensitive data. Teams should grant minimal permissions and maintain human oversight for high-stakes actions.

Lakera

Lakera was acquired by Check Point Software for approximately $300M in November 2025. Lakera Guard, its core AI safety scanning product, is being integrated into Check Point’s CloudGuard WAF as part of a broader application security offering. The underlying capability of scanning LLM inputs and outputs for prompt injection, toxic content and data leakage remains relevant, but the product context has changed substantially.

Technical limitations from our earlier evaluation still apply: scanning is text-only with no multimodal support, custom rules rely on regex patterns rather than context-aware analysis, and scanning is non-stateful with no awareness of conversation history. Teams evaluating Lakera should now assess it as part of the Check Point ecosystem rather than as a standalone product.

Structured output libraries

Libraries such as Instructor, Outlines and Marvin address a common challenge: LLMs naturally produce freeform text, but applications need structured data. These libraries constrain outputs to match specified structures through prompting, logit manipulation or grammar-based generation. Instead of hoping an LLM produces valid JSON, developers specify Pydantic models and receive guaranteed-valid objects. For agentic systems this is essential, as agents need to produce function calls and decision objects that downstream code can reliably process.

The space is evolving quickly. Instructor has gained traction for its simplicity and Pydantic integration, while Outlines offers more sophisticated constrained generation. Native structured output features from model providers (OpenAI’s JSON mode, Anthropic’s tool use) may reduce the need for external libraries in some scenarios.

A broader category of runtime guardrails has grown up alongside these libraries. NVIDIA’s NeMo Guardrails and Guardrails AI go beyond schema conformance to include prompt injection scanning, content filtering and hallucination checks. Teams building production LLM applications should evaluate both levels: structured outputs for type safety and guardrails for content safety.

Hold

Not recommended for new projects due to better alternatives or limited long-term viability.

OpenClaw

OpenClaw is an open-source agent runtime created by Peter Steinberger, who later joined OpenAI. It runs persistent, always-on AI agents that execute multi-step tasks by controlling computers: clicking, typing, navigating applications, browsing the web. OpenClaw supports Claude, DeepSeek and OpenAI as backends, and has spawned a wave of imitators. NVIDIA’s NemoClaw wraps OpenClaw in the NVIDIA Agent Toolkit with sandboxed execution against Nemotron 3 Super models. Moonshot’s KimiClaw runs natively on kimi.com with a community skill marketplace and persistent cloud memory.

The security model has not been figured out across any of these variants. Persistent agent runtimes grant broad computer access to AI agents processing potentially untrusted instructions. The same prompt injection vulnerabilities that affect all visual computer use agents apply here, amplified by the breadth of access and the always-on character of the deployment. An agent with permission to control your browser, email and file system has an enormous blast radius if compromised.

We do not recommend OpenClaw or its variants for new projects until the security model matures. Teams already using one should enforce strict sandboxing, limit accessible applications and maintain human oversight for actions involving sensitive data.

Conversational data analysis

Tools such as pandas-ai, tablegpt, promptql and Julius enable natural language querying of databases. Modern MCP servers can provide substantial context to models, including schema understanding and data contents. Our experience with JUXT’s XTDB revealed remarkable moments where models traversed complex table structures with apparent ease.

For experienced analysts, these tools represent a meaningful productivity boost, converting natural language into draft queries that can be refined. However, generated queries can be inefficient or incorrect despite appearing plausible. Uber’s QueryGPT demonstrates both the potential and complexity, highlighting the guardrails required for reliable results.

We’ve placed this in Hold because successful deployment requires users capable of understanding and validating generated queries. These tools offer substantial benefits for data teams with appropriate expertise, but should be approached cautiously by those unable to review AI-generated database queries.

Tools