Video
6 Mar
2026

AI Starts With Data: How to Prepare Your Organisation for AI Success

Most organisations begin their AI journey focused on tools, models, and use cases. The ones that succeed start somewhere else entirely: with the quality, structure, and accessibility of their data. Without that foundation in place, even the most capable AI systems will underperform.
Matt Wicks
|
10
min read
ai-starts-with-data-how-to-prepare-your-organisation-for-ai-success

There is a persistent assumption in enterprise AI conversations that success is primarily a technology question. Choose the right model, select the right platform, and the results will follow. The organisations that have learned this lesson the hard way know that this assumption is wrong.

The real foundation of every successful AI initiative is not the AI itself. It is the data that feeds it. Before any organisation can meaningfully benefit from AI, it needs to understand what data it holds, what condition that data is in, and how it needs to be structured and governed to support the outcomes it is trying to achieve. That discipline has a name: AI data readiness. And most organisations are further behind on it than they realise.

Why Data Readiness Is Still the Biggest AI Challenge

The Garbage-In, Garbage-Out Problem Has Not Gone Away

Data engineers and analysts have understood for decades that the quality of outputs depends entirely on the quality of inputs. That principle has not changed with the arrival of large language models and generative AI tools. It has become more important.

According to a Q3 2024 survey of data management leaders published by Gartner, 63% of organisations either do not have or are unsure whether they have the right data management practices in place for AI. The same research predicts that through 2026, organisations will abandon 60% of AI projects that are not supported by AI-ready data. These are not hypothetical risks. They reflect what is already happening across enterprise AI programmes.

Informatica's CDO Insights 2025 survey of 600 data leaders identified the top obstacles to moving GenAI initiatives from pilot to production as data quality and readiness (43%), lack of technical maturity (43%), and skills shortages (35%). Data problems consistently rank first, above model selection, compute costs, and every other technical consideration. Separately, two-thirds of those surveyed reported that they had been unable to transition even half of their GenAI pilots to production.

The reason is straightforward: AI is only as good as what it learns from. A well-architected model trained on poor-quality data will produce unreliable results. A simpler model trained on clean, well-labelled, well-structured data will outperform it every time.

What Poor Data Readiness Actually Looks Like

The problem is rarely a complete absence of data. Enterprises typically have substantial volumes of it. The issue is that this data exists in a condition that makes it difficult or impossible for AI systems to use effectively.

Common symptoms include fragmented data sources spread across disconnected systems, inconsistent formats where the same information is stored differently depending on which platform recorded it, poor or absent data labelling, and large repositories of unstructured content that has never been organised for analytical use. Each of these problems compounds the others. An AI system asked to draw on this kind of data does not produce nuanced output. It produces noise.

A 2024 survey of nearly 4,000 business leaders and technology practitioners conducted on behalf of Capital One found that 73% of business leaders ranked data quality as one of their top concerns in AI initiatives, second only to data security. Despite this, only 53% believed their organisation actively prioritised data management to mitigate that risk. That disconnect, between recognising the problem and acting on it, is exactly where AI projects run into trouble.

Two Ways Enterprises Use Data to Power AI

Understanding the Architectural Choice

Once an organisation begins to address its data foundations, it faces a fundamental architectural question: how should its data actually be used by AI systems? There are two primary approaches, and the right choice depends on the organisation's objectives, its data maturity, and its timeline.

Option 1: Training Custom AI Models

The first approach involves training a bespoke AI model directly on the organisation's own data. In this model, the AI learns patterns, relationships, and behaviours from large volumes of historical data specific to the organisation. The result is a model with deep embedded knowledge of the business.

The potential benefits are significant. A custom-trained model can support advanced analytics, predictive modelling, and domain-specific tasks that general-purpose models cannot replicate. Use cases include customer behaviour analysis, demand forecasting, operational anomaly detection, and complex risk modelling.

However, this approach carries significant prerequisites. It requires large volumes of data that are clean, consistently labelled, and appropriately structured. It requires mature data infrastructure capable of supporting the training process at scale. And it requires meaningful data consulting and engineering investment before any model work begins. For organisations whose data foundations are still being built, attempting to train custom models prematurely is one of the most common and costly mistakes in enterprise AI adoption.

Option 2: Retrieval-Augmented Generation

The second approach, and the one that offers a more accessible entry point for most enterprises, is Retrieval-Augmented Generation, commonly referred to as RAG.

Rather than embedding organisational knowledge into a model through training, RAG works by connecting a pre-trained AI model to external data sources at the point of use. When a user asks a question, the system retrieves the most relevant information from the organisation's knowledge base and supplies it to the model as context. The model generates a response grounded in that specific, retrieved content rather than relying solely on what it learned during training.

As IBM explains, RAG extends the capabilities of large language models to specific domains or an organisation's internal knowledge base without the need to retrain the model. It is a cost-effective approach to improving output so it remains relevant, accurate, and useful across different contexts.

The practical implications are significant for organisations at an early stage of AI data readiness. RAG does not require a fully mature data infrastructure before deployment. It can work with existing data environments, including document repositories, internal knowledge bases, and structured records, while the broader work of data engineering and quality improvement continues in parallel.

RAG as a Practical Bridge to AI Adoption

How RAG Works in an Enterprise Context

Microsoft's guidance on RAG architecture describes it as an AI framework that retrieves relevant information from external sources to inform and enhance the generation of responses. The dual capability of retrieval and generation allows RAG systems to produce more informed and reliable outputs than purely generative models working from training data alone.

In practical terms, RAG operates like a highly intelligent search and synthesis process. A member of the legal team asks a question about a specific contractual clause; the system retrieves the relevant documents, identifies the pertinent sections, and provides a grounded answer with references. An operations manager asks about the status of a supplier relationship; the system draws from current records rather than generalised knowledge. The AI does not guess. It retrieves and contextualises.

Why Many Enterprises Begin With RAG

The ability to deploy AI capabilities without waiting for full data maturity is the main reason many organisations begin their AI data strategy with RAG. It allows businesses to demonstrate AI value to internal stakeholders relatively quickly, build organisational confidence in AI-assisted workflows, and continue improving data quality in a structured way, without those improvements being a prerequisite for initial deployment.

It also provides something that custom-trained models do not always offer at deployment: transparency. Because RAG systems retrieve and cite specific sources, the basis for any AI-generated response can be traced and verified. That is a meaningful advantage in regulated industries and in any context where accountability for AI outputs matters.

Choosing the Right AI Data Strategy

Key Questions to Consider

The choice between RAG, custom model training, or a combination of both is not a binary technical decision. It is a strategic one, shaped by several interconnected factors.

Data strategy and governance is the most important starting point. Organisations with fragmented, inconsistently formatted, or poorly labelled data are not yet ready for custom model training. RAG can provide value while that maturity is being developed. Organisations with well-governed, structured, and high-volume data may be ready to invest in custom model development for specific high-value use cases.

Business objectives also shape the decision. Organisations looking to improve knowledge retrieval, accelerate internal search, or support frontline staff with faster access to accurate information are natural candidates for RAG-first approaches. Organisations seeking deep predictive insights, behavioural modelling, or advanced forecasting may ultimately require custom-trained models with sufficient data volume to support that level of sophistication.

Implementation timelines and cost constraints matter too. Custom model training is a long-horizon investment. RAG can deliver meaningful results within shorter timeframes and with lower initial infrastructure requirements.

The Case for a Blended Approach

For most enterprises, the most effective long-term AI data strategy combines elements of both approaches. RAG provides early value and operational capability while data engineering work matures the foundations required for more advanced modelling. Custom training becomes viable and worthwhile once that foundation is in place. These are not competing choices. They are sequential stages of a coherent strategy.

Why Data Strategy Is an AI Strategy

The Foundational Principle

Many organisations still approach AI as primarily a technology procurement decision. They evaluate tools, compare vendors, and select platforms without first having an honest assessment of their data readiness. This sequencing is back to front, and it is a significant reason why AI project abandonment rates remain high.

The Virtual Forge works with enterprises to start that conversation where it should begin: with the data. What does the organisation actually hold? How is it structured? Where are the quality gaps? What governance exists around it? The answers to those questions determine which AI approaches are viable, on what timescale, and with what investment.

Building the Infrastructure That AI Needs

Successful enterprise AI data preparation is not a one-time project. It requires ongoing data engineering, quality monitoring, data architecture planning, and the kind of strategic thinking that aligns data infrastructure with evolving AI objectives. As noted in this strategic guide to implementing AI in business, sustainable AI adoption requires building the right foundations before scaling deployment. That principle is consistent with what the data shows: per Gartner's February 2025 research, organisations that lack AI-ready data are not simply slower to succeed with AI. They abandon their projects entirely.

The organisations that are extracting real, sustained value from AI share a common characteristic: they invested in their data foundations before they scaled their AI ambitions. They built pipelines. They cleaned and labelled datasets. They established governance over how data is produced, stored, and accessed. They treated data readiness as a strategic priority rather than a precondition to be addressed later.

Those that skipped this stage are disproportionately represented in the failure statistics. The technology was capable. The data was not ready.

Moving Forward

AI has advanced dramatically, and it will continue to do so. The underlying principle has not changed and will not: good AI starts with good data.

Organisations that invest in AI data readiness, whether through structured data engineering pipelines, RAG architectures, custom model training, or a phased combination of all three, will see faster deployment, more reliable outputs, and more durable competitive advantage from their AI initiatives.

Those that treat data readiness as an afterthought will build on foundations that cannot support the weight of what they are trying to achieve.

Preparing your data for AI is the most important step in any AI initiative. If you are exploring how to structure your data and AI architecture effectively, our AI strategy and data services team can help you design the right approach for your organisation and the outcomes you are working towards.

Our Most Recent Blog Posts

Discover our latest thoughts, tendencies, and breakthroughs in the realm of software development and data.

Swipe to View More

Get In Touch

Have a project in mind? No need to be shy, drop us a note and tell us how we can help realise your vision.

Please fill out this field.
Please fill out this field.
Please fill out this field.
Please fill out this field.

Thank you.

We've received your message and we'll get back to you as soon as possible.
Sorry, something went wrong while sending the form.
Please try again.