We built a
real-time lakehouse using
S3 Tables + AWS
Glue + Apache *Doris(
https://doris.apache.org/)* to query Iceberg data in S3 directly: No data copies, no endless ETL jobs, no wait.
🧱 Stacks:
1️⃣ S3 Tables store data in the Iceberg format, bringing ACID, schema evolution, and time travel to S3 buckets.
2️⃣ AWS Glue data catalog as the metadata layer, keeping track of table schemas, snapshots, and partitions.
3️⃣ Apache Doris (via VeloDB Cloud) as the compute engine. Doris connects to Glue and queries S3 Tables directly, delivering sub-second analytics and high concurrency, all without data movement.
P.S. Doris can be both a query engine on top of table formats and a real-time data warehouse when you need to materialize and accelerate results.
This pipeline is also applicable to many other open-source combinations, with table formats like Iceberg, Paimon, catalogs like Unity, Polaris, Gravitino, and query engines like Spark, Flink, Trino.
Full demo step in blog post(
https://www.velodb.io/blog/real-time-lakehouse-s3-tables-aws-glue-and-apache-doris)