Machine Learning on Big Data gets Big Momentum

Big Data without algorithms is a dumb data. Algorithms like machine learning, text processing, data mining extract knowledge out of the data and makes it smart data. These algorithms make the data consumable or actionable for humans and businesses. Such actionable data can drive business decisions or predict products that customers most likely to buy next. Amazon and Netflix are popular examples of how the learnings from data can be used for influencing customer decisions. Hence, machine learning algorithms are very important in the era of Big Data. BTW in the field of Big Data, ‘Machine learning’ is considered more broadly ( than what it is really meant by the machine learning professionals) and includes pure statistical algorithms as well as other algorithms that are not based on ‘learning’s.

Earlier today, on 16th June, Microsoft announced a preview of machine learning service called AzureML on its Azure cloud platform. With this service, business analysts may easily apply machine learning algorithms like the ones related to predictive analytics to data.

Machine learning itself has been popular for last few years. Microsoft has recognized the trend and jumped on it. When it comes to big players making machine learning services on cloud, Google had pioneered its PredictionEngine as a service on cloud few years back.

Traditionally data scientists use tools like Matlab, R, Python (NumPym, SciKit, Sklearn) and others for analyzing data. Programmers use open sources like Apache Mahout, Weka for developing Application services using Machine Learning algorithms. However, having machine learning algorithms is not good enough, scaling the machine learning algorithms to big data is very important.

Last year Cloudera did an acqui-hire, Myrrix, and open sourced Machine learning on Hadoop as Oryx. Berkeley’s Ampslab has opensourced its Big Data Machine learning work, called MLBase, in Apache Spark, an open source big data stack becoming rapidly popular.

The momentum in machine learning has already fueled a good amount of venture funding in this area.

  • SkyTree got $18Million funding from U.S. Venture Partners, UPS and Scott McNealy.
  • Nuotonian grabbed $4 million Atlas Ventures for Big Data Analytics.
  • Another startup raised $2.5 from VCs led by Voyager Ventures. would makes it easy to predict customer behavior using machine learning.
  • AlpineLabs that came out of EMC raised series B last year from Sierra Ventures, Mission Ventures and others. It provides a studio and easy to assemble set of standard Machine Learning and analytics algorithms.
  • Oregon based BigML raised $1.2 million last year to provide easy to use machine learning cloud service.
  • RevolutionAnalytics which got $37 (in total) makes R algorithms to work on Map Reduce.
  • and the list goes on

There is an interesting Machine learning project called Vowpal Wabbit that initially started at Yahoo and continued at Microsoft. However, Interestingly, instead of VW, Microsoft is making R language and algorithms available on Azure Cloud.

Anyway, the trend of making machine learning services easy to run on Big Data and on Cloud would continue. But having the tools and algorithms available would not enough to solve the problem. We need qualified people who understands which algorithms to use for solving which cases and how to use them (parameterize). Moreover, what we really need is applications using such algorithms to solve the business problems without even having a need for users to understand the algorithms. In my opinion , what we would see in future is such vertical applications / services that would abstract (use but hide) machine learning or prediction algorithms to serve domain specific business needs.


Nebula, RightScale and Scalr from their pitches at CloudCamp

Yesterday, at Cloud Camp, I had an opportunities to hear pitches from Nebula, RightScale and Scalr.

Scalr’s young CEO Sebastin positioned Scalr for Auto Scaling, Recovery and Server management for Apps. In a lively 5 mins pitch, he covered how Sclar concentrates on Apps management and does the seamless recovery and auto scaling.

Sean Chuo from RightScale focused on RightScale positioning as a layer between Apps and lower level infrastructure like server, storage etc. With its templates, RightScale is a great tool for implementing your own cloud.

Most interesting presentation was from Nebula’s Chris Kemp. He covered the history of Nebula project at NASA and how it helped to tame down 7B huge investment with cloud infrastructure where scientists can use computing on demand and quickly by reducing. Since he is out of NASA now, he told how some senators refused this optimization in fear of loosing jobs. However, how this project got support from Vivek Kundra and then how President Obama’s trasparent government project was on cloud where Mr. President himself was a consumer! He had a picture showing President Obama using the site. BTW Nebula project is one of the key contributions in OpenStack.

Cloud Based Services Outages Becoming An Issue

Last week was a bad week for Microsoft and Google Cloud apps. Microsoft’s online services infrastructure experienced outage affecting some customers in North America online. It caused interruptions in Office 365 and various Windows Live services for a few hours. Coincidently on the same day, Google’s cloud productivity service, Google Docs, went offline for some time.

Google on the outages: The [Google Docs] outage was caused by a change designed to improve real time collaboration within the document list,Unfortunately this change exposed a memory management bug which was only evident under heavy usage … We have assembled a list of steps which will both reduce the chance of a future event, decrease the time required to notice and resolve a problem, and limit the scope which any single problem can affect.

Microsoft on the outage: Microsoft became aware of a Domain Name Service (DNS) problem causing service degradation for multiple cloud-based services, A tool that helps balance network traffic was being updated, and for a currently unknown reason, the update did not work correctly. As a result, the configuration was corrupted, which caused service disruption. We are continuing to review the incident.

Amazon: Few months back, Amazon had faced similar outages. In the second week of Aug, Amazon EC2 and RDS outage had impacted Netflix. In the month of April, Amazon had hit with serious outages for which Amazon had to apologize.

Solution? so far, the outages are low and quickly addressed. so, this outages are an issue that is being discussed but so far not adversely affecting adaption of cloud . However, it does raise a need to come up with a solution to mitigate such risk. Standardizations of cloud platform and having a standby on private cloud or secondary provider could be possible solution.

Computer Manufacturers Rushing To Provide Clouds

After years of delay after Amazon pioneered a cloud market, there is a stampade now. One after another hardware manufacturers are setting up software, infrastructure, as well as services to tap into the cloud market. HP launched a private beta program  for public / private cloud.  According to HP Cloud Blog, HP Cloud Compute allows you to deploy secure, reliable compute instances on-demand to dynamically adjust the computing capacity . HP Cloud platform also comes up with a Cloud Object storage ideal for archival and back up, serving static content for web applications, and storage of large public or private data sets, such as online files and media. Underneath this HP Cloud is  OpenStack on top of which HP provides a web-based User Interface (UI) along with open, RESTful APIs.

Last week at VMWorld, Dell announced TheDellCloud for public / private / Hybrid cloud platform based on VMware vCloud Datacenter Services . This is as a result of Dell’s   $1billion dollar investment to cloud computing earlier this year. In last few years, Dell has been doing such strategic investments to diversify from its traditional PC business. It is inline with what others doing. IBM disinvested from  its PC computing by selling to Lenovo.

The story of PC vendor rushing into the Cloud does not end in US only. Recently, July 2011, Accer, the leader in NetBooks, paid $320 million US firm iGware providing Cloud platform software and infrastructure tools for customers like for Japanese Nintendo . With this acquisition Accer is targetting cloud market in China and Taiwan. Within a month from this announcement, a couple of weeks back on Aug 28, 2011, the news about Accer unveiling its 1st Cloud-based solution ‘Accer Cloud Enabler’ in China flashed in Chinese media. Intel is providing advise on the cloud to Accer. Intel is also heavily investing into Chinese Cloud market via investing in Chinese companies.

Intel has been focusing on Cloud for a while. Back in the month of May 2011, Intel announced its announced AppUp, a subscription-based “hybrid” cloud computing model designed especially for small businesses. The AppUp addresses concenrs about data security by allowing SMBs to keep their data onsite without having to ante up for server hardware or software. AppUp a server reference architecture called the Intel Hybrid Cloud Platform, a catalog of applications that small businesses can subscribe to, and a software platform that provides management and tracking of the applications’ usage.

In summary, Intel, HP, Dell, Accer, etc. all these PC hardware makers are now making moves to Cloud.



Hadoop Based Products Recently Launched

[tweetmeme source=”khanderao” only_single=false]
Follow khanderao on Twitter
At EMC World this week, there were few announcements of Hadoop based products. From the number of announcement, it is apparant that Hadoop’s popularity is growing among the enterprises for processing very large volume of data, typically unstructured data like web logs, social media chatters, emails and similar texts analytics. Hadoop can scale up to very very large number of nodes which are typically commodity hardwares. Hadoop is based on Map Reduce architecture which splits the jobs across the nodes and then reduces them in the reduce phase. Though Hadoop nodes are separate hardwares, there is still pre and post processing of in/out of data in a traditional SQL form is needed. Thats where many solutions as well as Hadoop’s eco system like Pig, Flume, Hbase come in picture.

EMC itself announced EMC announced Greenplum HD as a distribution and appliance. EMC Hadoop Distribution would be available around 3rd quarter both as community as well as commercial mode. The Greenplum HD appliance will combine the Greenplum database and the Enterprise Edition distribution of Hadoop on a single appliance. EMC has announced this direction few months back and very recently it has partnered with Cloudera. Of course, with this announcement, the partnership with Cloudera would come under cloud.

SnapLogic also announced Hadoop integration via SnapReduce making Snaplogic’s data integration pipeline as MapReduce tasks. This is a good way to offer Hadoop’s scalability to the SnapLogic’s cusomters. Also, Nice name, Gaurav, SnapReduce! SnapLogic is a opensource solution for ETLs ofcourse, there is a commercial Solution-training-support and consulting from SnapLogic itself. Following video gives a good introduction of SnapReduce

Since there is a good synergy for Hadoop on cloud, Mellonox, Data center connectivity company, announced acceletators for Hadoop and Memcached. It announced Hadoop product, called Hadoop-Direct on Mellanox’s InfiniBand adapters and switches. For more information

There was another product release, Brisk from a startup, DataStax. This is an interestingly controversial product. It combines Haroop with a competing open source No-SQL product called Cassandra. Traditional pure-play Hadoopers like Cloudera criticized the integration. However, it would be interestin to watch adoption. Following diagram shows the integration.

BTW I am eagerly waiting to hear more from Yahoo about its Hadoop spinoff. May be it would be announced during next months Hadoop Summit that is organized mainly by Yahoo!

GITPRO’s tech talk on Cloud Computing

[tweetmeme source=”khanderao” only_single=false]
Yesterday was a beautiful day for spending a weekend with family and play outside. And there were many programs in Indian community in the bay area. There was a concert of Sanjeev Abhyankar arranged by Swarasudha. However, many Indian techies sacrificed all that to attend GITPRO’s session on Cloud Computing. The attendance would have been huge if there was not another free event with free food- free fun and many speakers at Microsoft on the same topic. However, the GITPRO tech talk was more successful due to two prominent and speakers: Anant Jhingran CTO-VP of Managing IS and Co-chair of IBM Cloud Computing Initiatives and Sheng Liang, CEO of

Anant helped the enthusiastic audience to walk through the concepts and value propositions of Cloud Computing covering various types of cloud computings including IAS, PAAS, SAAS as well as Public-Private-Hybrid cloud. He focussed on the impacts of these types of Cloud on large enterprises where number of apps vary from 2000 to 10,000 where CIOs and IT departments would need to carefully analyse which type is suitable for which app. He also covered the stack with great clarity. While covering basic intiatives in Cloud, he emphasised on Standardizations at various levels. At the end, He gave many  suggestions to help profesionals and businesses to appropriately leverage cloud and navigate through the cloud. Anant regularly blogs his thoughts on this and other topics at

Sheng focussed more of opensources in Cloud Computing along with market leading propritery players. Starting from networking opensources like HAProxy for load balancing,  Open FLow for core switching, Vyatta for Firewalling, Zen for Hypervisor, CloudStack-OpenStack Nova-and-Ecalyptus for IaaS, Rightscale, Enatraus, Tivoli (commercial and not open source since there are less open sources for Cloud management) for Cloud Management. He then outlines Microsoft Azure, GAE for PaaS.  BTW Sheng covered Cloud Opensource and Asia in an interview.


May be we would be able to host the presentation on GITPRO site later.