Presto+MySQLで分散SQL

Sadayuki Furuhashi
Founder & Software Architect
Treasure Data, inc.
Presto + MySQL
道玄坂LT祭り
で分散SQL

A little about me...
> Sadayuki Furuhashi
> github/twitter: @frsyuki
> Treasure Data, Inc.
> Founder & Software Architect
> Open-source hacker
> MessagePack - Efﬁcient object serializer
> Fluentd - An uniﬁed data collection tool
> ServerEngine - A Ruby framework to build multiprocess servers
> Prestogres - PostgreSQL protocol gateway for Presto
> LS4 - A distributed object storage with cross-region replication
> kumofs - A distributed strong-consistent key-value data store

Check: www.treasuredata.com
Cloud service for the entire data pipeline,
including Presto. We’re hiring!

What’s Presto?
A distributed SQL query engine 
for interactive data analisys 
against GBs to PBs of data.

Client
Coordinator Connector 
Plugin
Worker
Worker
Worker
Storage / Metadata
Discovery Service

Client
Coordinator Connector
Plugin
Worker
Worker
Worker
Storage / Metadata
Discovery Service
1. find servers in a cluster

Client
Plugin
Worker
Worker
Worker
Storage / Metadata
Discovery Service
2. Client sends a query 
using HTTP

Client
Plugin
Worker
Worker
Worker
Storage / Metadata
Discovery Service
3. Coordinator builds 
a query plan
Connector plugin 
provides metadata
(table schema, etc.)

Client
Plugin
Worker
Worker
Worker
Storage / Metadata
Discovery Service
4. Coordinator sends 
tasks to workers

Client
Plugin
Worker
Worker
Worker
Storage / Metadata
Discovery Service
5. Workers read data 
through connector plugin

Client
Plugin
Worker
Worker
Worker
Storage / Metadata
Discovery Service
6. Workers run tasks 
in memory

Plugin
Worker
Worker
Worker
Storage / Metadata
Discovery Service
7. Client gets the result 
from a worker
Client

Client
Coordinator Hive 
Connector
Worker
Worker
Worker
HDFS, 
Hive Metastore
Discovery Service
find servers in a cluster
Hive connector

Client
Coordinator JDBC 
Connector
Worker
Worker
Worker
Cassandra
Discovery Service
Cassandra connector

Client
Coordinator
other 
connectors 
...
Worker
Worker
Worker
PostgreSQL
Discovery Service
Hive 
Connector
HDFS / Metastore
Multiple connectors in a query
JDBC 
Connector
Other data sources...

All stages are pipe-lined
✓ No wait time
✓ No fault-tolerance
MapReduce vs. Presto
MapReduce Presto
map map
reduce reduce
task task
task task
task
task
memory-to-memory
data transfer
✓ No disk IO
✓ Data chunk must
fit in memory
task
disk
map map
reduce reduce
disk
disk
Write data 
to disk
Wait between 
stages

Presto
JOIN
Hive
MySQL
client
select orderkey, orderdate, custkey, email 
from orders 
join mysql.presto_test.users 
on orders.custkey = users.id 
order by custkey, orderdate;

Presto
JOIN
Hive
MySQLINSERT INTO
client
create table mysql.presto_test.recent_user_info 
as
select users.id, users.email, count(1) as count 
from orders 
join mysql.presto_test.users 
on orders.custkey = users.id 
group by 1, 2;

Presto
JOIN
Hive
MySQL
$ psql Prestogres

Presto
JOIN
Hive
MySQL
$ psql Prestogres
PostgreSQL protocol gateway
for Presto

Presto+MySQLで分散SQL

More Related Content

Presto+MySQLで分散SQL