AI and Data at Enterprise Scale
There is a funny thing about the way the major AI service providers address the enterprise market. Have you noticed that they love to talk about workflows and agents, but somehow assume that data access is an exercise for the user. If you imagine that enterprises have thousands of different users needing to access a myriad of different data sources for various purposes, it's not hard to see that intelligent data access at enterprise scale is a significant gap that an increasing number of IT organizations will have to address.
Yes, Anthropic introduced MCP and skills, but each of these are still protocols that can be used for data access, not solutions to making AI models understand enterprise data better and at scale. Neither MCP nor skills makes integrating AI into existing enterprise operations any easier. If anything, enterprise IT teams are faced with a growing variety of disparate data access tools and methods, few of which offer any depth of security or usage guardrails necessary to manage the new risks these tools create.
What I want to explore today is the significant impact that scale will have on the integration between AI (including agents) and enterprise data. Not simply the scale of actions taken (aka throughput), but scale in the sense of the numerous individuals, teams, agents, applications, and more making independent decisions about what data to use when, for what purpose.
Enterprise Data and AI Today
AI’s introduction into the enterprise has been somewhat rocky, not in small part to concerns about safety and security. Will this model consistently return answers to prompts we depend on to make key decisions? Can we identify the user that initiated a call to a database through a request to an agent? Can we limit the movement of data when necessary in a global workflow? All of these questions and more must be addressed—and proven—before an enterprise can trust AI as a key part of their operations.
MCP and skills, as I mentioned earlier, are incredibly useful tools for experimenting with AI, but they are not currently built for what I like to call “enterprise scale” use cases—situations in which hundreds or thousands of independently owned and executed processes are vying for the same resources. This is a concurrency problem that, unlike typical throughput issues, cannot be managed and controlled by a single team without careful thought to architecture.
The entire development platform market is built on this principle. Providing throughput, consistent results, and common guardrails to hundreds or thousands of developers requires an architecture built to enable them. It is also true of relational databases and other formal data stores. Providing common storage, retrieval, and manipulations of enterprise data is key to allowing a host of applications, services, and (now) agents to safely and securely share that data.
In the same vein, our customers are faced with the dilemma of how to enable hundreds, thousands, or perhaps even millions of independent AI agents to safely and securely access enterprise data. Each one choosing its own MCP tools, skills, or SDKs to directly access that data is just not scalable in that sense.
How Could AI Work Better With Enterprise Data
The solution isn’t to ban MCP or any other method of accessing data, but to provide a common architecture in which that task is surrounded by capabilities that both improve access and provide guardrails to prevent its misuse.
In the world of AI, improving access doesn’t just mean providing the right hooks where data can be retrieved and injected into inference. It also means giving models more context about what data means and where it resides, and more agency in determining what data needs to be retrieved from relevant sources. Both efficiency and accuracy requires building contextual support to minimize the amount of “slop” generated by a data access request—slop that can lead to hallucinations or misleading conclusions.
So, the platform that would best enable safe and secure AI access to enterprise data would need to provide the following:
- An understanding of what entities the business tracks, and how those entities are related
- The ability find the right data despite semantic differences between prompts and the terminology used by the business
- Knowledge of where data about those entities resides, in what form, and how to best access it
- An understanding of how to restrict access based on the relationship between users, agents, organizations, workflows/processes, and any of the entities that the data represents.
- Complete provenance of how answers were generated and decisions made based on that data
For example, an agent may use a prompt that doesn’t specify the exact data sources to use, or the specific tools needed to access that data. The process of retrieving the right enterprise data for that prompt might look something like this:
- A user or agent sends the prompt to an AI model
- The model determines that external data is required to complete inference
- The model initiates a retrieval process, which may employ multiple complementary techniques: vector search (semantic similarity), classic text search, and ontology-oriented graph traversal to identify relevant entities and their relationships
- Before any data is retrieved, the process verifies that the requesting user and/or agent is authorized to access the identified data sources
- The retrieval process may choose to expand the query — determining that related entities or additional context are needed — with each expansion subject to the same authorization checks before data is accessed
- The retrieval process may also perform self-inquiry to validate that the results actually address the original prompt
- The model completes inference and returns the result to the user or agent. Only the data determined to be relevant and authorized has been used by the model as context to generate the result
Of course, this is only one pattern for data retrieval when the prompt doesn’t indicate exactly what data is needed. Data can also be retrieved programmatically and added to the prompt by the user or agent. Tools, skills, or traditional RAG methods can also be used to retrieve the data more directly during inference.
But in every case, the enterprise wants to assure that data is only accessed when the requesting entities (which can include the user or an agent/application acting on the user’s behalf) have the right to access that data. This means that even tools, skills, and other programmatic means to access enterprise data should be run through access control policies that address these concerns.
Securing AI Data Access At Enterprise Scale
Which brings me to another challenge that enterprises face when AI is used at scale across their organizations: how do you manage the sheer complexity of different conditions that determine whether access should be granted?
Role-based access control served the industry well when we were talking primarily about setting access based on human access to digital domains. Applications could be designed to pass the user’s identity through to backend systems so that the role of the “actor” could be determined directly.
However, agentic systems challenge this model to the point it just isn’t enough anymore. Agents are almost like employees themselves, with their own set of functions and restrictions that often differ from those of the users on whose behalf they are acting. Attribute-based access methods can adjust to this reality, but maintaining accurate representation of a user’s tags and other attributes gets extremely unwieldy when there are dozens or even hundreds of attributes to consider. And even then, the key “attributes” may not be values at all, but relationships between people, projects, data entities, and so on.
Enter Relationship-Based Access Control. ReBAC relies on understanding the relationships between entities and concepts in your organization, and establishes policy based on those relationships. For example, policies can be set to limit access to sensitive data surrounding, say, a financial report to anyone with specific clearance working on the report during a specific time period. All of this is determined by a living ontology that constantly evaluates business data to determine these kinds of relationships with human guidance.
ReBAC doesn’t outright replace RBAC or ABAC options, but it is significantly easier to manage at scale, as the policies change as the relationships change (as well as when humans determine they should change independent of relationships). The work of keeping access aligned with policy is significantly more automated as the ontology is maintained by AI every moment of every day.
Provenance at Scale
The final problem I want to comment on today is what we like to call “action provenance”: the tracking of data demonstrating what went into an AI decision, execution of a task, or evaluation of the results of an action. This has the potential to be a data tracking problem that mirrors what we see from observability, Real User Measurement, and other detail-oriented tracking mechanisms.
However, AI systems can learn a lot from observability. I suspect we will start with tracking key signals, such as when data access occurred (including what data was accessed by whom), where policy was applied (including authentication/authorization, prompt refusal, result guardrails, and so on), and other easily identified indicators of AI action.
From there, we might go on to capture everything we can about requests, inference, and result generation, but that will generate intense amounts of data that must be stored for significant periods of time “just in case”.
Alternatively, we can evolve the core signals to highlight just what is needed to prove how a decision or other action was taken. Focus on proof of AI action, rather than debugging the inner workings of AI processing. This will scale much better, and satisfy the key need for this kind of data: evidence that can be used to defend such decisions or actions, or begin the process of correcting them if they are undesirable.
All of this should be tied into the way enterprise data is accessed, evaluated, and consumed by AI agents and applications.
Why a Platform is The Solution
As you can see, there is a lot to coordinate here, especially when you have the enterprise scaling problems we are discussing today. How does an enterprise ensure that data can be accessed consistently (and with excellent performance) across a vast array of independent agents and other uses? How can they also assure that risks to the enterprise are minimized, and behavior provable when things go wrong?
Kamiwaza addresses this in a single comprehensive AI orchestration platform focused on providing a layer between enterprise data and AI that simultaneously secures data in an enterprise setting while giving AI the context it needs to use that data safely, wisely, and with the performance necessary to keep up with the business. Learn more about how Kamiwaza addresses all of these challenges at https://www.kamiwaza.ai/. Feel free to reach out to us there, or in the comments below. We look forward to showing you how Kamiwaza is the one AI orchestration platform that lets you use all of your existing systems of record at enterprise scale.
As always, I write to learn, and look forward to hearing your thoughts and questions.