Corvus
Landscape Market · 4aecfec7

Data Infrastructure and Analytics Platform Market

The global market for the platforms, tools, and managed services that ingest, store, process, govern, and analyze enterprise data — spanning data warehouses, data lakes/lakehouses, streaming/integration, BI and analytics, data-science platforms, and vector/AI data stores.

Includes hyperscaler-native data platforms (AWS, Azure, Google Cloud), independent data-cloud platforms (Snowflake, Databricks), the BI/analytics layer (Microsoft Power BI, Tableau, Looker, Qlik), specialty databases (MongoDB, Confluent/Kafka, ClickHouse, vector DBs), and the enterprise-software incumbents (Oracle, IBM, SAP, Teradata). Excludes pure-play GenAI model providers and consumer analytics (web analytics tag managers etc.) except where they consume DIAI infrastructure.

Completed
2026-06-16 00:00 UTC

Bottom Line Up Front

The Data Infrastructure and Analytics Platform (DIAI) market is in a structural multi-year growth phase driven by generative-AI workloads — independent third-party forecasts converge on a 15-17% CAGR and a 2029-2030 magnitude in the $700-880 billion range. Competitive structure is consolidating around three hyperscalers (AWS, Microsoft Azure, Google Cloud) and two large independents (Snowflake, Databricks), with Apache Iceberg emerging as the de facto open table format. SAP's May 2026 Dremio + Prior Labs acquisitions and Microsoft Fabric's explicit anti-Snowflake/Databricks positioning at Build 2026 signal another wave of platform-stack consolidation through 2027.

§ 01

What it is

DIAI is the layer of enterprise software that turns raw operational data into governed, queryable, and AI-usable assets. It covers (i) data warehouses — centralized integrated repositories for reporting and BI (Snowflake, BigQuery, Redshift, Synapse/Fabric, Teradata, Oracle ADW); (ii) data lakes — raw-format object storage for ELT and ML (S3, GCS, ADLS); (iii) data lakehouses — the converged architecture pairing object storage with ACID open table formats such as Apache Iceberg; (iv) streaming and integration (Confluent/Kafka, Fivetran, dbt); (v) BI and analytics (Power BI, Tableau, Looker, Qlik); (vi) data-science platforms (Databricks ML, SageMaker, Vertex AI); and (vii) vector / AI data stores supporting RAG and semantic search. The market is bounded by the layer below (raw compute / storage from hyperscalers, which DIAI vendors consume rather than sell) and the layer above (LLMs / GenAI applications, which DIAI vendors increasingly host but do not author).

§ 02

Who operates in it

The DIAI landscape segments into five archetypes. (1) Hyperscaler-native platforms — AWS (Redshift, S3, EMR, Glue, Athena), Microsoft Azure (Fabric, Synapse, Power BI), and Google Cloud (BigQuery, Looker, Dataflow, Vertex AI) — bundle DIAI into their broader cloud stacks and lever distribution. (2) Independent data clouds — Snowflake and Databricks — sit on top of all three hyperscalers and have become the standalone leaders, with Databricks reportedly running $5.4B ARR at ~65% growth and in talks for a $165-175B valuation, and Snowflake reporting product revenue of ~$997M in Q1 FY26 (+26% YoY). (3) Enterprise-software incumbents — Oracle, IBM (watsonx.data), SAP (HANA, Datasphere, and now Dremio + Prior Labs as of May 2026), Salesforce (Tableau, Data Cloud), and Teradata — defend large installed bases and are increasingly Iceberg-friendly. (4) Specialty platforms — MongoDB (document/operational + analytical), Confluent (Kafka streaming), ClickHouse (real-time analytics, ~$250M ARR, IPO-bound), Cloudera (Hadoop heritage now Iceberg-aligned), Palantir (Foundry/AIP integration and analytics). (5) Open-format / open-source substrate — Apache Iceberg, Apache Spark, Apache Hadoop, and the vector-database ecosystem — increasingly the lingua franca every platform supports.

§ 03

How it works

The DIAI value chain runs from compute-and-storage suppliers (the three hyperscalers — AWS, Azure, GCP) up to applications. DIAI platforms ingest data via change-data-capture or streaming (Kafka/Confluent, Fivetran), land it in object storage (S3 / GCS / ADLS), organize it as open tables (Iceberg / Delta / Hudi) on a lakehouse, model and transform it (dbt, Spark, Snowflake SQL, BigQuery SQL), expose it via a semantic layer, and serve it to BI tools (Power BI, Tableau, Looker, Qlik), ML/AI platforms (Databricks ML, SageMaker, Vertex AI), and increasingly to LLM agents through tool-call APIs and vector retrieval. Monetization is overwhelmingly consumption-based — pay per query, per credit, per token, per ingested row — which aligns vendor revenue with customer usage and creates the high-net-revenue-retention dynamics underpinning Snowflake and Databricks's growth. The competitive lever has shifted from proprietary storage formats to proprietary compute engines + governance + AI integration, because Iceberg has made the storage layer effectively a commodity.

§ 04

Why it exists

The market grows for three reinforcing reasons. (1) Data volume — total data created globally is rising at a multi-decade exponential trajectory, mechanically expanding spend on ingestion, storage, and processing. (2) Decision economics — enterprises that can act on operational data faster than peers earn measurable margin and share advantages, so analytics modernization is treated as a cost-of-doing-business rather than discretionary spend. (3) Generative-AI demand — large-language-model workloads require governed, queryable, semantically structured corpora and embeddings, which in turn require data lakehouses, vector stores, and a semantic-layer infrastructure to be reliable. The Semantic Layer Summit 2026 explicitly framed the semantic layer as 'critical infrastructure for enterprise AI.' Counter-forces are real but limited: enterprises also push back on hyperscaler lock-in (driving Iceberg / open-format adoption), and slow macroeconomic conditions can compress short-term consumption growth, but no scenario plausibly reverses the secular trend over the forecast horizon.

§ 05

When — the chronology

The DIAI market's modern arc spans roughly four decades. Data-warehouse architecture coalesces in the late 1980s, providing the conceptual backbone for enterprise BI. The 2006 release of Apache Hadoop and AWS's launch of S3 + EC2 the same year together open the 'big data' era — commodity-cluster distributed storage on open-source code, paired with rentable hyperscale compute. The 2009-2013 wave of Apache Spark research at UC Berkeley's AMPLab seeds the Databricks generation. Snowflake (2012) and Databricks (2013) are the founding cohort of the cloud-native independent data clouds, joined by Confluent (2014) for streaming. Apache Iceberg, developed at Netflix in 2017 and donated to Apache in 2018, becomes a top-level project in May 2020 — the technical condition for the lakehouse convergence. Snowflake's September 2020 IPO marks public-markets validation of the category. The November 2022 launch of ChatGPT triggers the GenAI demand cycle that has dominated DIAI procurement decisions since 2023. The current moment (2026 Q2) is defined by Microsoft Fabric's anti-Snowflake/Databricks positioning at Build 2026, SAP's near-simultaneous Dremio + Prior Labs acquisitions announced 4 May 2026, and Databricks's reported $165-175B valuation talks at ~$5.4B ARR.

§ 06

Where

Global market, anchored in the United States. The platform vendors are overwhelmingly headquartered in the US — Microsoft (Redmond, WA), Amazon/AWS (Seattle, WA), Alphabet/Google Cloud (Mountain View, CA), Snowflake (Bozeman, MT and Menlo Park, CA), Databricks (San Francisco, CA), Palantir (Denver, CO), Confluent (Mountain View, CA), Oracle (Austin, TX), IBM (Armonk, NY), Salesforce (San Francisco, CA) — with SAP SE (Walldorf, Germany) the principal non-US incumbent. Demand is global with concentrations in North America, Western Europe, and North-East Asia; the data centers underlying the workloads are in turn concentrated in the same regions, with the US projected as the leading data-center market by 2029.

§ 07

Players

17 in the space
§ 07b

Chronology

14 events
  1. 1988-01-01 The term 'data warehouse' enters mainstream computing literature, framing the BI-substrate problem the modern DIAI market now solves at hyperscale.
  2. 2006-04-01 Apache Hadoop's initial release seeds the 'big data' era — commodity-cluster distributed storage and MapReduce processing on open-source code.
  3. 2009-01-01 Apache Spark research begins at UC Berkeley's AMPLab, producing the unified analytics engine that later becomes Databricks's commercial backbone.
  4. 2012-07-23 Snowflake Inc. is incorporated in Delaware (entity creation date per GLEIF).
  5. 2013-05-31 Databricks, Inc. is incorporated in Delaware by the original creators of Apache Spark (GLEIF entity creation; Wikipedia founder history).
  6. 2014-09-11 Confluent, Inc. is incorporated (Apache Kafka commercialization).
  7. 2017-01-01 Apache Iceberg is developed at Netflix as a high-performance open table format addressing Hive's scaling limits.
  8. 2018-01-01 Iceberg is donated to the Apache Software Foundation, starting the path to becoming the lakehouse standard.
  9. 2020-05-01 Apache Iceberg graduates to a top-level Apache project, marking maturation of the open lakehouse stack.
  10. 2020-09-16 Snowflake IPOs on the NYSE in one of the largest software public offerings to date, catalyzing investor focus on cloud-native data warehousing.
  11. 2022-11-30 ChatGPT's public launch triggers the GenAI demand cycle that becomes the dominant DIAI tailwind from 2023 onward.
  12. 2026-05-04 SAP announces near-simultaneous acquisitions of Dremio (Apache Iceberg-native agentic lakehouse) and Prior Labs (tabular-AI models), reshaping the enterprise-AI data-platform race.
  13. 2026-05-21 Microsoft Build 2026 unveils Microsoft Fabric updates explicitly positioned as the data platform for AI agents, targeting Snowflake and Databricks head-on.
  14. 2026-06-10 Reports surface that Databricks is in advanced funding talks at a $165-175 billion valuation on roughly $5.4 billion ARR growing ~65% year over year.
§ 08

Market

Independent third-party forecasts converge on a high-growth profile: Technavio sees the data-analytics market expanding by USD 375.6 billion over 2025-2030 at a 16.4% CAGR; Statista's analytics-as-a-service forecast reaches ~USD 69B in 2028; a Mordor Intelligence high-performance data analytics report projects USD 152.6B in 2026 growing to USD 398.2B by 2031 (21.14% CAGR). These individual scopes are narrower than the DIAI envelope but collectively support a $700-880B 2029-2030 magnitude with mid-teens CAGR. Concentration is moderate-to-high: three hyperscalers plus Snowflake and Databricks plausibly take a majority of new platform spend, with the long tail of incumbents (Oracle, IBM, SAP, Teradata, Salesforce) defending installed bases. The most important structural dynamic is the open-table-format convergence around Apache Iceberg, which simultaneously raises the commodity floor (any vendor can read your data) and intensifies competition for the engine-and-governance layer above it.

Size
Aggregate DIAI 2029-2030 magnitude in the high hundreds of billions USD; data-analytics segment alone +USD 375.6B over 2025-2030 (Technavio); analytics-as-a-service ~USD 69B by 2028 (Statista).
Segments
Data warehouses + cloud data platforms · Data lakes / lakehouses (Iceberg-based) · Streaming + integration (Kafka, CDC, ELT) · Business intelligence + analytics (Power BI, Tableau, Looker, Qlik) · Data-science / ML platforms · Vector + AI data stores (for RAG, semantic search) · Specialty / real-time analytical databases (ClickHouse, Druid, Pinot) · Enterprise-software-embedded analytics (Oracle ADW, SAP Datasphere, IBM watsonx.data, Teradata VantageCloud)
Dynamics
Growth — high single- to mid-teens CAGR through 2030 driven by GenAI workloads, with consumption-based pricing aligning revenue to usage. Consolidation — SAP's May 2026 Dremio + Prior Labs acquisitions; Databricks's 2024 Tabular acquisition; Microsoft Fabric's bundling-via-distribution strategy. Disintermediation — Apache Iceberg standardizes the storage layer, raising switching pressure on proprietary formats and pushing differentiation up to engines, governance, and AI-native interfaces.
§ 09

Outlook

Moderate confidence

Over the next 24 months it is highly likely the DIAI market continues high-single- to mid-teens CAGR growth driven by GenAI workloads, with Apache Iceberg solidifying as the open table standard. It is likely that Microsoft Fabric becomes a credible third 'unified data platform' alongside Snowflake and Databricks by end-2027 given Microsoft's distribution leverage and explicit positioning at Build 2026. There is a roughly even chance that further enterprise-software consolidation follows SAP's Dremio playbook over the same horizon, with another large vendor (plausibly Oracle, IBM, or Salesforce) acquiring a stand-alone Iceberg-aligned lakehouse vendor. It is unlikely any single vendor reaches a clearly dominant position by 2027 given the structurally multi-cloud demand pattern and the leveling effect of Iceberg.

§ 10

Key Judgments

graded per ICD 203
KJ-01 High Confidence

The data infrastructure and analytics platform market is likely in a structural multi-year growth phase driven by generative-AI workloads, with independent third-party forecasts converging on a 15-17% CAGR and a 2029-2030 magnitude in the $700-880 billion range.

KJ-02 High Confidence

The competitive structure is almost certainly settling around three hyperscalers (AWS, Microsoft Azure, Google Cloud) and two large independents (Snowflake, Databricks), with Apache Iceberg emerging as the de facto open table format that pulls the rest of the field toward a shared lakehouse substrate.

KJ-03 Moderate Confidence

SAP's near-simultaneous May 2026 acquisitions of Dremio (Iceberg-native lakehouse) and Prior Labs (tabular-AI models) signal that a fresh wave of platform-stack consolidation around Iceberg + agentic-AI data plumbing is likely over the next 12-24 months.

KJ-04 Moderate Confidence

Microsoft Fabric is more likely than not to become the third credible 'unified data platform' contender alongside Snowflake and Databricks by end-2027, given Microsoft's distribution leverage and explicit Fabric positioning against both incumbents.