One of the main announcements, other than wearables and AndroidL, at Google IO conference yesterday was the availability of streaming analytics service called Dataflow on Google cloud. This service will enable application developers to quickly assemble a real time high volume data ingestion and processing pipeline and make it scalable using Google Cloud infrastructure. Google Dataflow combines Google’s internal Streaming engine called Millwheel with easy to program big data processing abstraction called FlumeJava. The Google Dataflow is compatible to Google BigQuery so that the output from DataFlow can fed to BigQuery. Using this pipeline stack, building applications like real time analytics and dashboarding will be easier.
Google has published papers on both of these technologies a few years back. However, google had not open sourced it. As in the case of other Google’s papers, these papers also had triggered development of similar technologies outside of Google. For example, Cloudera developed Crunch which is based on the concepts from FlumeJava and then open sourced it to Apache. There is an another project called Puma on the same concepts.
On streaming side, there are various other opensources that have come up. Twitter has open sourced its streaming processor called Storm. Amplabs’s (Berkeley University), now popular opensource, Apache Spark has a streaming as one of the important components. Yahoo open sourced S4 to Apache and recently LinkedIn also open sourced Samza which is based on Kafka, a distributed messaging that LinkedIN had earlier open sourced. Sometime later I would compare these streaming technologies on blog on my texploration blog at wordpress.
However, though all these technologies is to solve the same real time stream processing problem that Google Dataflow is solving, none of them are cloud services. That is where Amazon comes in the picture. Google is fighting this war in Cloud.
Back in December 2013, Amazon made available an AWS cloud service called Kinesis for real time streaming data processing. This service makes it quick to assemble application to process massive streaming data that various mobile games or sensors generates. It is cost effective, just 2.8 cents per million records to digest! Of course, Amazon earns money not only from processing but storage and further processing and using of the data. The service can be used for dashboarding as well as real time processing applications.
Google is yet to price to the service. I am not sure whether it would be based on data being processed or I am sure it will be in the same range that of the Amazon.
However, the main obstacle both of them will have is that both of these technologies are not opensource. Hence, it will be a one way entry to customers using it. Vender locking ! Hence, the biggest competition to them would application developers using Spark or Storm etc. on EC2 or Google cloud. MetaMarket had similar problem which it tackled by opensourcing Druid (BTW though Druid can be used for Real Time processing and dashboarding, its architecture and programming model is different than most of the above). I hope Google open sources MillWheel too.