Intermediate
watsonx.data
Learn IBM's open data lakehouse for managing AI-ready data across hybrid cloud environments with query federation and fit-for-purpose engines.
What is watsonx.data?
watsonx.data is IBM's open, hybrid, and governed data store built on an open lakehouse architecture. It uses Apache Iceberg for table format, Presto and Spark for query engines, and supports fit-for-purpose compute to optimize cost and performance.
Cost optimization: IBM claims watsonx.data can reduce data warehouse costs by up to 50% by allowing you to move workloads from expensive warehouse engines to cost-effective lakehouse storage while maintaining SQL access.
Architecture
| Layer | Technology | Purpose |
|---|---|---|
| Storage | Object storage (S3, COS, HDFS) | Scalable, low-cost data storage |
| Table format | Apache Iceberg | ACID transactions, time travel, schema evolution |
| Query engines | Presto, Spark | Fit-for-purpose compute for different workloads |
| Metadata | Hive Metastore / Unity | Centralized catalog for data discovery |
| Governance | watsonx.governance integration | Access control, lineage, compliance |
Key Features
- Query federation: Query data across multiple sources (databases, data lakes, warehouses) with a single SQL interface
- Fit-for-purpose engines: Route workloads to the optimal engine — Presto for interactive queries, Spark for batch processing
- Open formats: Apache Iceberg tables ensure no vendor lock-in and interoperability with other platforms
- Hybrid deployment: Run on IBM Cloud, AWS, or on-premises with consistent management
- Data sharing: Share governed datasets across teams and applications securely
Integration with watsonx.ai
watsonx.data connects directly to watsonx.ai for AI workflows:
- Access training data directly from the lakehouse for model development
- Store and serve feature data for ML models at scale
- Use vectorized data for RAG applications with foundation models
- Track data provenance from source through model training
Key takeaway: watsonx.data provides a cost-effective, open data foundation for enterprise AI. Its fit-for-purpose engine approach lets you match compute to workload requirements, while Apache Iceberg ensures data portability and no vendor lock-in.
Lilly Tech Systems