Facebook today announced Presto as a distributed SQL query engine optimized for ad-hoc analysis at interactive speed. Facebook developed even though Hive provides a query interface over big data because of a need of having lower latency interactive queries over web scale of data that Facebook has. By looking into the results, it seems that Facebook engineering has successfully able to accomplish this mission . It is deployed in three (may be more) geographical regions with scaled a single cluster to 1,000 nodes. This is being used by 1000+ employees firing more than 30,000 (up 3000 since June when Presto was first time revealed at WebScale event) over petabyte of data.
So, what does it do differently?
1. Optimized query planner and scheduler firing multiple stages
2. Stages are executed in parallel
3. Intermediate results are kept in memory as against persisting on HDFS thus saving IO cost
4. Optimized Java for key code directly generating optimized byte code.
More importantly, when contrasting with Hive, Presto does not use MapReduce for query processing.
One thing I liked about Presto is that it is built on a pluggable architecture. It can work on with Hive, Scrub, and potentially your own storage. That opens up a good opportunity for its adaption. Of course, we need to compare this with Impala from Cloudera.
In the Web Scale conference at Facebook menlo Park office back in the June, it was told that Presto would explore probabilistic sampling for quicker results at error (concept that BlinkDB implemented). I am not sure where is it, however, BlinkDb already supports Presto in addition to Hive and Shark.
Code and docs: