This document discusses Presto, an open source distributed SQL query engine for interactive analysis of large datasets. It describes Presto's architecture including its coordinator, connectors, workers and storage plugins. Presto allows querying of multiple data sources simultaneously through its connector plugins for systems like Hive, Cassandra, PostgreSQL and others. Queries are executed in a pipelined fashion without disk I/O or waiting between stages for improved performance.
19. All stages are pipe-lined
✓ No wait time
✓ No fault-tolerance
MapReduce vs. Presto
MapReduce Presto
map map
reduce reduce
task task
task task
task
task
memory-to-memory
data transfer
✓ No disk IO
✓ Data chunk must
fit in memory
task
disk
map map
reduce reduce
disk
disk
Write data
to disk
Wait between
stages
22. Presto
JOIN
Hive
MySQLINSERT INTO
client
create table mysql.presto_test.recent_user_info
as
select users.id, users.email, count(1) as count
from orders
join mysql.presto_test.users
on orders.custkey = users.id
group by 1, 2;