Big Data and Evolution in Storage

Big Data has been disruptive movement that has caused a disruption in storage world. This big data Disruption has created a big opportunity for innovation and profits for storage industry and thus giving birth to many startups.

One of the main obvious thing about Big Data is the big data needs a big storage. And the performance of the big data processing depends on the data storage and data movement.

There are three well discussed key data related aspects of big data:

1. Stored on commodity hardware including storage

2. Data locality: One of the key architectural aspect is to move processing code to the data instead of moving data to the computing node.

3. Replication based fault tolerance. A typical replication factor is three that results into a need of three times of storage space.

These three principles of Big Data have caused a need for challenges that needed to met with innovation. Obviously, established vendors reacted to this with two approaches, in-house R&D innovation as well as acquisitions.

Typical problems associated with the data storage and movements are: Size, Access speed, Data movement pipe, etc. The Size problem is dealt with optimized compression, de-duplications. For example, to deal with a problem of volume of data, Dell had bought Ocarina which does storage optimization with compression and de-duplication.

The storage access performance dealt with faster media technologies SSD, flash etc. With the active interest in various in-memory databases like SAP HANA, as well as computing like Spark that heavily dependent on memory, there is an active interest in SSD and Flash based memory. Last month, May 2014, EMC acquired a privately funded DSSD which makes a rack scale flash storage which is better suitable for IO intensive operations like in-memory database. EMC has invested in this startup early on. Some notable startups in the area of Flash based technologies are iSCSI, Nimble Storage, Amplidata, VelocityIO, Coraid, etc.

Many accelerators or faster access pipes/switches deal with data movement issue. A couple of years back Netapp bought CacheIQ which was a NAS accelerator specifically for caching. Last year Violin Memory acquired GridIron which is a flash cache based SAN accelerator.

To take this further, innovative startups like Nutanix, Tintri, etc. are providing software defined (virtualized ) storage. These startups are quickly followed by the existing players. In March this year, VMware announced VSAN, virtual SAN, based on the principles of Software Defined Storage. Earlier EMC has also acquired ScaleIO.

Hadoop’s HDFS itself got enhanced to take advantage of these variety of storage types. HDFS 2.3 release (April 2014) has been significant in this respect. From this release, HDFS has a support for in-memory caching and heterogeneous storage hierarchy. We will continue to see innovation in storage technologies. Choosing a right combination, configuring and managing it would be an important task for big data deployments.


BTW this post is not claimed to be a comprehensive survey of storage technologies or startups. The names I mentioned are just to give example to make a point. I have not covered all the players and the names I mentioned are not necessarily preferred choices.

(Also posted on LinkedIN)

Machine Learning on Big Data gets Big Momentum

Big Data without algorithms is a dumb data. Algorithms like machine learning, text processing, data mining extract knowledge out of the data and makes it smart data. These algorithms make the data consumable or actionable for humans and businesses. Such actionable data can drive business decisions or predict products that customers most likely to buy next. Amazon and Netflix are popular examples of how the learnings from data can be used for influencing customer decisions. Hence, machine learning algorithms are very important in the era of Big Data. BTW in the field of Big Data, ‘Machine learning’ is considered more broadly ( than what it is really meant by the machine learning professionals) and includes pure statistical algorithms as well as other algorithms that are not based on ‘learning’s.

Earlier today, on 16th June, Microsoft announced a preview of machine learning service called AzureML on its Azure cloud platform. With this service, business analysts may easily apply machine learning algorithms like the ones related to predictive analytics to data.

Machine learning itself has been popular for last few years. Microsoft has recognized the trend and jumped on it. When it comes to big players making machine learning services on cloud, Google had pioneered its PredictionEngine as a service on cloud few years back.

Traditionally data scientists use tools like Matlab, R, Python (NumPym, SciKit, Sklearn) and others for analyzing data. Programmers use open sources like Apache Mahout, Weka for developing Application services using Machine Learning algorithms. However, having machine learning algorithms is not good enough, scaling the machine learning algorithms to big data is very important.

Last year Cloudera did an acqui-hire, Myrrix, and open sourced Machine learning on Hadoop as Oryx. Berkeley’s Ampslab has opensourced its Big Data Machine learning work, called MLBase, in Apache Spark, an open source big data stack becoming rapidly popular.

The momentum in machine learning has already fueled a good amount of venture funding in this area.

  • SkyTree got $18Million funding from U.S. Venture Partners, UPS and Scott McNealy.
  • Nuotonian grabbed $4 million Atlas Ventures for Big Data Analytics.
  • Another startup wise.io raised $2.5 from VCs led by Voyager Ventures. Wise.io would makes it easy to predict customer behavior using machine learning.
  • AlpineLabs that came out of EMC raised series B last year from Sierra Ventures, Mission Ventures and others. It provides a studio and easy to assemble set of standard Machine Learning and analytics algorithms.
  • Oregon based BigML raised $1.2 million last year to provide easy to use machine learning cloud service.
  • RevolutionAnalytics which got $37 (in total) makes R algorithms to work on Map Reduce.
  • and the list goes on

There is an interesting Machine learning project called Vowpal Wabbit that initially started at Yahoo and continued at Microsoft. However, Interestingly, instead of VW, Microsoft is making R language and algorithms available on Azure Cloud.

Anyway, the trend of making machine learning services easy to run on Big Data and on Cloud would continue. But having the tools and algorithms available would not enough to solve the problem. We need qualified people who understands which algorithms to use for solving which cases and how to use them (parameterize). Moreover, what we really need is applications using such algorithms to solve the business problems without even having a need for users to understand the algorithms. In my opinion , what we would see in future is such vertical applications / services that would abstract (use but hide) machine learning or prediction algorithms to serve domain specific business needs.

GITPRO World 2012 : Best Conference for Technology Professionals and Entrepreneurs

22 Jan 2012,

Cupertino, CA, USA

http://www.gitpro.org

 

Global Indian Technology Professionals Association (GITPRO) is hosting a conference on “Emerging Technologies and Opportunities for Professionals and Entrepreneurs” on 18th Feb 2012 at Palo Alto, CA. With three parallel tracks focused on Technology, Career & Leadership and Startups, this conference is best suitable for everybody from technology to entrepreneurs.

 

Iconic serial entrepreneur and entrepreneurship coach at Stanford University, Steve Blank would be delivering a keynote. The CEO of Persistent Systems, Anand Deshpande, would be delivering keynotes at the conference.

 

The Technology track is full of experts on Big Data, Hadoop, Cloud, Mobile and Social Computing. They are coming from Greenplums, Cloudera, HortonWorks, Microsoft, IBM, ThisMoment, AdMaxim, and GloMantra.

 

The Startup Bootcamp at GITPRO World 2012 would cover everything that an entrepreneur should know from launching a startup to a successful exit. Successful startup entrepreneurs, VCs, sales & marketing executives would be guiding aspiring entrepreneurs.

 

The GITPRO World 2012 has sessions specially focuses on career and leadership related topics covering various aspects like managing with influence, evolving from individual role to manager and leader, mid career accelerators & Mid-Career Switch and job opportunities in 2012.

 

This event would provide a unique opportunity for Indian Technology professionals for networking with fellow professionals, technical experts, industry leaders, entrepreneurs, career coaches, and VCs.

 

GITPRO is a global networking platform for Indian Technology Professionals for their professional and self-development and their contribution back to the profession, society, and people of US and India. GITPRO, started in 2009, has chapters in Silicon Valley, Contra Costa Valley, Seattle, DC, Denver in US and Bangalore, Hyderabad, Pune in India.

Team Matters for the Success of Startup

In TechCrunch blog ” How To Found, Grow and Sell”, a startup wisdom from GamesThatGive’s Adam Archar has been captured. Though the blog stresses importance of co-founder, advisers, investors and supportive significant other, it can be summarized into one : Team that matters. And in startup the team is not limited to founders and employees but also includes :

  • advisors,
  •  investors,
  • spouses,
  • lawyers,
  • designers
  • financial advisors
  • and network
  • etc.

Team Matters for the Success of Startup

In TechCrunch blog ” How To Found, Grow and Sell”, a startup wisdom from GamesThatGive’s Adam Archar has been captured. Though the blog stresses importance of co-founder, advisers, investors and supportive significant other, it can be summarized into one : Team that matters. And in startup the team is not limited to founders and employees but also includes :

  • advisors,
  •  investors, 
  • spouses, 
  • lawyers, 
  • designers
  • financial advisors 
  • and network
  • etc.