Machine Learning on Big Data gets Big Momentum

Big Data without algorithms is a dumb data. Algorithms like machine learning, text processing, data mining extract knowledge out of the data and makes it smart data. These algorithms make the data consumable or actionable for humans and businesses. Such actionable data can drive business decisions or predict products that customers most likely to buy next. Amazon and Netflix are popular examples of how the learnings from data can be used for influencing customer decisions. Hence, machine learning algorithms are very important in the era of Big Data. BTW in the field of Big Data, ‘Machine learning’ is considered more broadly ( than what it is really meant by the machine learning professionals) and includes pure statistical algorithms as well as other algorithms that are not based on ‘learning’s.

Earlier today, on 16th June, Microsoft announced a preview of machine learning service called AzureML on its Azure cloud platform. With this service, business analysts may easily apply machine learning algorithms like the ones related to predictive analytics to data.

Machine learning itself has been popular for last few years. Microsoft has recognized the trend and jumped on it. When it comes to big players making machine learning services on cloud, Google had pioneered its PredictionEngine as a service on cloud few years back.

Traditionally data scientists use tools like Matlab, R, Python (NumPym, SciKit, Sklearn) and others for analyzing data. Programmers use open sources like Apache Mahout, Weka for developing Application services using Machine Learning algorithms. However, having machine learning algorithms is not good enough, scaling the machine learning algorithms to big data is very important.

Last year Cloudera did an acqui-hire, Myrrix, and open sourced Machine learning on Hadoop as Oryx. Berkeley’s Ampslab has opensourced its Big Data Machine learning work, called MLBase, in Apache Spark, an open source big data stack becoming rapidly popular.

The momentum in machine learning has already fueled a good amount of venture funding in this area.

  • SkyTree got $18Million funding from U.S. Venture Partners, UPS and Scott McNealy.
  • Nuotonian grabbed $4 million Atlas Ventures for Big Data Analytics.
  • Another startup wise.io raised $2.5 from VCs led by Voyager Ventures. Wise.io would makes it easy to predict customer behavior using machine learning.
  • AlpineLabs that came out of EMC raised series B last year from Sierra Ventures, Mission Ventures and others. It provides a studio and easy to assemble set of standard Machine Learning and analytics algorithms.
  • Oregon based BigML raised $1.2 million last year to provide easy to use machine learning cloud service.
  • RevolutionAnalytics which got $37 (in total) makes R algorithms to work on Map Reduce.
  • and the list goes on

There is an interesting Machine learning project called Vowpal Wabbit that initially started at Yahoo and continued at Microsoft. However, Interestingly, instead of VW, Microsoft is making R language and algorithms available on Azure Cloud.

Anyway, the trend of making machine learning services easy to run on Big Data and on Cloud would continue. But having the tools and algorithms available would not enough to solve the problem. We need qualified people who understands which algorithms to use for solving which cases and how to use them (parameterize). Moreover, what we really need is applications using such algorithms to solve the business problems without even having a need for users to understand the algorithms. In my opinion , what we would see in future is such vertical applications / services that would abstract (use but hide) machine learning or prediction algorithms to serve domain specific business needs.

Data on BigData

According to Transparency Market Research’s
  • Cumulative Ave Growth Rate (CAGR) of Big Data projected to be 40% from 2012-2018
  • the global big data market was worth USD 6.3 billion in 2012 and is expected to reach USD 48.3 billion by 2018
  • Big Data tools : CAGR of 41.4% from 2012 to 2018
  • Storage CAGR of 45.3% from 2012 to 2018
  • Major players (by revenue) last year HP Co.Teradata, Opera Solution, Mu Sigma and Splunk

Oracles Big Data Appliance Puts Hadoop, NoSQL, R in a Box

According to Oracle PR:

The Oracle Big Data Appliance is a new engineered system that includes
            an open source distribution of Apache(TM) Hadoop(TM), Oracle NoSQL
            Database, Oracle Data Integrator Application Adapter for Hadoop,
            Oracle Loader for Hadoop, and an open source distribution of R.

Engineered to work together, the Oracle Big Data Appliance is easily
            integrated with Oracle Database 11g, Oracle Exadata Database Machine,
            and Oracle Exalytics Business Intelligence Machine, and is designed to
            deliver extreme analytics on all data types, with enterprise-class
            performance, availability, supportability and security.

Oracle NoSQL Database: Oracle NoSQL Database Enterprise Edition is a
            distributed, highly scalable, key-value database. Unlike competitive
            solutions, Oracle NoSQL Database is easy to install, configure and
            manage, supports a broad set of workloads, and delivers
            enterprise-class reliability backed by enterprise-class Oracle
            support.
        --  Oracle Data Integrator Application Adapter for Hadoop: The new Hadoop
            adapter simplifies data integration from Hadoop and an Oracle Database
            through Oracle Data Integrator's easy to use interface.
        --  Oracle Loader for Hadoop: Oracle Loader for Hadoop enables customers
            to use Hadoop MapReduce processing to create optimized data sets for
            efficient loading and analysis in Oracle Database 11g. Unlike other
            Hadoop loaders, it generates Oracle internal formats to load data
            faster and use less database system resources.
        --  Oracle R Enterprise: Oracle R Enterprise integrates the open-source
            statistical environment R with Oracle Database 11g. Analysts and
            statisticians can run existing R applications and use the R client
            directly against data stored in Oracle Database 11g, vastly increasing
            scalability, performance and security. The combination of Oracle
            Database 11g and R delivers an enterprise-ready deeply-integrated
            environment for advanced analytics.
        --  Oracle NoSQL Database, Oracle Data Integrator Application Adapter for
            Hadoop, Oracle Loader for Hadoop, and Oracle R Enterprise will be
            available both as standalone software products independent of the
            Oracle Big Data Appliance.

Hadoop Affiliations and Partnership Are Coming Up

[tweetmeme source=”khanderao” only_single=false]
Follow khanderao on Twitter

Its truely an erra of collaboration. There is no time to build products. Either acquire or partner. Thats the way to quickly get in market. Likewise, EMC is moving very very fast on getting on Hadoop train. A couple of months back it affiliated with Cloudera. However, a couple of weeks back, it made other announcements in EMCWorld. Now, it has entered into licensing agreement with MapR. It seems that MapR would be powering EMC’s Hadoop efforts. Here is what MapR stack looks like.

It seems that Hadoop is bringing many folks to come together to quickly team up to build ecosystem. Yesterday Cloudera, leader in Hadoop products and services, yesterday partnered with RainStor. (http://www.marketwire.com/press-release/rainstor-delivers-big-data-retention-on-clouderas-distribution-including-apache-hadoop-1518189.htm)  RainStor claims compression resulting in 97% percentage smaller physical footprint. The RainStor is in Data Retention and would provide  access massive data sets on the Hadoop Distributed File System (HDFS) .

Here is Cloudera’s stack which includes :

anyway, coming back to the momemtum. It seems that many such partnerships coming up. and many may come by the time we meet at Hadoop Summit orgainzed by mainly Yahoo next month. By then, I hope Yahoo would work out details on spinning off Hadoop before it is too late 🙂

Hadoop on fire !

[tweetmeme source=”khanderao” only_single=false]
Follow khanderao on Twitter

Last week I covered few product launches that were declared during the EMCWorld. This week there are news about fundings and acquisitions related to startups leveraging Hadoop.

Today TechCrunch covered $9million funding by Kliner Perkins to Datameer which offers Hadoop based Analytics platform. Batch processing like Analytics is one of the main use case of Hadoop which offers a great horizontal scalability and parallelism for such processing over commodity hardware. Key thing is that Datameer offers easy to use spreasheet interface to deal with the data so that any business person can deal with it without any coding. Datameers seems to be having a good team in place. A product head of Hadoop group in Yahoo and developer of Kata, luncen based distributed indexer based on Hadopp, are the co-founders of Datameer.

Off late there has been a lot of stories around Hadoop based companies. For example, Gigaom covered Opera Solutions which is a 100 million dollars company in analytics and uses Hadoop stack as a foundation. A nice quote from its CEO Arnab Gupta: Opera is built to “mine the signal versus mine the data [itself].

While seeing on the momentum on Hadoop, I missed to see Redhat adding Hadoop in its OpenShift, PaaS, which has included MongoDB. However, since the Hadoop’s strength is in scaling, it would be difficult to offer cloud based PaaS that would let apps to scale-out massively. May be not now, but those who has huge storage strength or datacenters would sooner or later offer it. Atleast we are seeing SaaS based on Hadoop platform covering Big Data analytics based services.

Anyway, Another couple of announcements: EMC Data award listed Apache Hadoop while NOrth Ventures added Cloudera, opensource providing training , consultancy and solutions on Hadoop, in one among top 10 open source companies to watch.

Hadoop Based Products Recently Launched

[tweetmeme source=”khanderao” only_single=false]
Follow khanderao on Twitter
At EMC World this week, there were few announcements of Hadoop based products. From the number of announcement, it is apparant that Hadoop’s popularity is growing among the enterprises for processing very large volume of data, typically unstructured data like web logs, social media chatters, emails and similar texts analytics. Hadoop can scale up to very very large number of nodes which are typically commodity hardwares. Hadoop is based on Map Reduce architecture which splits the jobs across the nodes and then reduces them in the reduce phase. Though Hadoop nodes are separate hardwares, there is still pre and post processing of in/out of data in a traditional SQL form is needed. Thats where many solutions as well as Hadoop’s eco system like Pig, Flume, Hbase come in picture.

EMC itself announced EMC announced Greenplum HD as a distribution and appliance. EMC Hadoop Distribution would be available around 3rd quarter both as community as well as commercial mode. The Greenplum HD appliance will combine the Greenplum database and the Enterprise Edition distribution of Hadoop on a single appliance. EMC has announced this direction few months back and very recently it has partnered with Cloudera. Of course, with this announcement, the partnership with Cloudera would come under cloud.

SnapLogic also announced Hadoop integration via SnapReduce making Snaplogic’s data integration pipeline as MapReduce tasks. This is a good way to offer Hadoop’s scalability to the SnapLogic’s cusomters. Also, Nice name, Gaurav, SnapReduce! SnapLogic is a opensource solution for ETLs ofcourse, there is a commercial Solution-training-support and consulting from SnapLogic itself. Following video gives a good introduction of SnapReduce

Since there is a good synergy for Hadoop on cloud, Mellonox, Data center connectivity company, announced acceletators for Hadoop and Memcached. It announced Hadoop product, called Hadoop-Direct on Mellanox’s InfiniBand adapters and switches. For more information
http://www.marketwatch.com/story/mellanox-accelerates-hadoop-and-memcached-for-web-20-applications-2011-05-09

There was another product release, Brisk from a startup, DataStax. This is an interestingly controversial product. It combines Haroop with a competing open source No-SQL product called Cassandra. Traditional pure-play Hadoopers like Cloudera criticized the integration. However, it would be interestin to watch adoption. Following diagram shows the integration.

BTW I am eagerly waiting to hear more from Yahoo about its Hadoop spinoff. May be it would be announced during next months Hadoop Summit that is organized mainly by Yahoo!

Yahoo Spinning off Hadoop Development to Ride on Hadoop Wave

[tweetmeme source=”khanderao” only_single=false]
Accoring WSJ report,  http://on.wsj.com/fMzApi , yahoo is considering to spin off Hadoop Development as a separate company like Cloudera. This would help in ridding on the wave of Hadoop. Yahoo has a significant work in Hadoop. Hadoop is an Apache opensource project based on Map-Reduce concept first introduced by Google. Hadoop is increasingly popularly in big data analytics to provide scalability especially using commodity hardware platforms. This Map Reduced based platforms like Hadoop,  Cassendra, CouchDB, MongoDB are increasingly getting popular especially in social site and web2.0 where the volume of data is huge.

Yahoo has been one of the core contributors of Hadoop. It has contributed from the begining and still committed a large team to Hadoop development. It also developed additional layers like Pig to enable data warehousing / business analytics apps to leverage Hadoop. I believe Yahoo uses Hadoop extensivly in its content minning, emails etc.

Still, why does Yahoo wants to spin-off? Definitely for taking advantage of the Hadoop’s commercial potential. There are startups like Cloudera are floated around the Hadoop ecosystem. As per analysts there is a multi-billion dollar market based on Hadoop ecosystem. By spinning off a separate company, yahoo can monetize on the market without getting distracted to its main business. For the customers, it is better to have a company with a backing from Yahoo. Such company can focus on delivering specialized solutions around Hadoop. Having said it, I am not yet sure whether this rumored company will also focus on training and services or not. In any case, this move would help in further maturing Hadopp ecosystem.