DRILL: A New Project added in Apache Incubation for Low Latency Query on Large Data Set in Hadoop

Just like Google’s Map Reduce and Google FS papers had been a basis for Hadoop’s Map Reduce and HDFS respectively, another paper from Google has become a basis for new project in Apache. A project named DRILL has been recently submitted to Apache mainly from developers from MapR. This inspiration for this project is Google paper on query language, DREML, on a very large data data set. DREML has been a basis for Google BigQuery for a while.

This project would provide a ‘low latency’ query language on HDFS. Hadoop is a great platform for Big data. It is more suitable for offline / batch processing based on MapReduce pattern. However, many customers need a way to make a real time query on the data residing in the hadoop / HDFS. DRILL will address the need.

This project has just started and a first code is yet to be contributed. However, this will be an important addition to Hadoop ecosystem. This will co-exist with Hive which also provide a query access.



Hadoop Affiliations and Partnership Are Coming Up

[tweetmeme source=”khanderao” only_single=false]
Follow khanderao on Twitter

Its truely an erra of collaboration. There is no time to build products. Either acquire or partner. Thats the way to quickly get in market. Likewise, EMC is moving very very fast on getting on Hadoop train. A couple of months back it affiliated with Cloudera. However, a couple of weeks back, it made other announcements in EMCWorld. Now, it has entered into licensing agreement with MapR. It seems that MapR would be powering EMC’s Hadoop efforts. Here is what MapR stack looks like.

It seems that Hadoop is bringing many folks to come together to quickly team up to build ecosystem. Yesterday Cloudera, leader in Hadoop products and services, yesterday partnered with RainStor. (http://www.marketwire.com/press-release/rainstor-delivers-big-data-retention-on-clouderas-distribution-including-apache-hadoop-1518189.htm)  RainStor claims compression resulting in 97% percentage smaller physical footprint. The RainStor is in Data Retention and would provide  access massive data sets on the Hadoop Distributed File System (HDFS) .

Here is Cloudera’s stack which includes :

anyway, coming back to the momemtum. It seems that many such partnerships coming up. and many may come by the time we meet at Hadoop Summit orgainzed by mainly Yahoo next month. By then, I hope Yahoo would work out details on spinning off Hadoop before it is too late 🙂