Intermediate

watsonx.data

Learn IBM's open data lakehouse for managing AI-ready data across hybrid cloud environments with query federation and fit-for-purpose engines.

What is watsonx.data?

watsonx.data is IBM's open, hybrid, and governed data store built on an open lakehouse architecture. It uses Apache Iceberg for table format, Presto and Spark for query engines, and supports fit-for-purpose compute to optimize cost and performance.

💡
Cost optimization: IBM claims watsonx.data can reduce data warehouse costs by up to 50% by allowing you to move workloads from expensive warehouse engines to cost-effective lakehouse storage while maintaining SQL access.

Architecture

LayerTechnologyPurpose
StorageObject storage (S3, COS, HDFS)Scalable, low-cost data storage
Table formatApache IcebergACID transactions, time travel, schema evolution
Query enginesPresto, SparkFit-for-purpose compute for different workloads
MetadataHive Metastore / UnityCentralized catalog for data discovery
Governancewatsonx.governance integrationAccess control, lineage, compliance

Key Features

  • Query federation: Query data across multiple sources (databases, data lakes, warehouses) with a single SQL interface
  • Fit-for-purpose engines: Route workloads to the optimal engine — Presto for interactive queries, Spark for batch processing
  • Open formats: Apache Iceberg tables ensure no vendor lock-in and interoperability with other platforms
  • Hybrid deployment: Run on IBM Cloud, AWS, or on-premises with consistent management
  • Data sharing: Share governed datasets across teams and applications securely

Integration with watsonx.ai

watsonx.data connects directly to watsonx.ai for AI workflows:

  • Access training data directly from the lakehouse for model development
  • Store and serve feature data for ML models at scale
  • Use vectorized data for RAG applications with foundation models
  • Track data provenance from source through model training
Key takeaway: watsonx.data provides a cost-effective, open data foundation for enterprise AI. Its fit-for-purpose engine approach lets you match compute to workload requirements, while Apache Iceberg ensures data portability and no vendor lock-in.