Big Data has been disruptive movement that has caused a disruption in storage world. This big data Disruption has created a big opportunity for innovation and profits for storage industry and thus giving birth to many startups.
One of the main obvious thing about Big Data is the big data needs a big storage. And the performance of the big data processing depends on the data storage and data movement.
There are three well discussed key data related aspects of big data:
1. Stored on commodity hardware including storage
2. Data locality: One of the key architectural aspect is to move processing code to the data instead of moving data to the computing node.
3. Replication based fault tolerance. A typical replication factor is three that results into a need of three times of storage space.
These three principles of Big Data have caused a need for challenges that needed to met with innovation. Obviously, established vendors reacted to this with two approaches, in-house R&D innovation as well as acquisitions.
Typical problems associated with the data storage and movements are: Size, Access speed, Data movement pipe, etc. The Size problem is dealt with optimized compression, de-duplications. For example, to deal with a problem of volume of data, Dell had bought Ocarina which does storage optimization with compression and de-duplication.
The storage access performance dealt with faster media technologies SSD, flash etc. With the active interest in various in-memory databases like SAP HANA, as well as computing like Spark that heavily dependent on memory, there is an active interest in SSD and Flash based memory. Last month, May 2014, EMC acquired a privately funded DSSD which makes a rack scale flash storage which is better suitable for IO intensive operations like in-memory database. EMC has invested in this startup early on. Some notable startups in the area of Flash based technologies are iSCSI, Nimble Storage, Amplidata, VelocityIO, Coraid, etc.
Many accelerators or faster access pipes/switches deal with data movement issue. A couple of years back Netapp bought CacheIQ which was a NAS accelerator specifically for caching. Last year Violin Memory acquired GridIron which is a flash cache based SAN accelerator.
To take this further, innovative startups like Nutanix, Tintri, etc. are providing software defined (virtualized ) storage. These startups are quickly followed by the existing players. In March this year, VMware announced VSAN, virtual SAN, based on the principles of Software Defined Storage. Earlier EMC has also acquired ScaleIO.
Hadoop’s HDFS itself got enhanced to take advantage of these variety of storage types. HDFS 2.3 release (April 2014) has been significant in this respect. From this release, HDFS has a support for in-memory caching and heterogeneous storage hierarchy. We will continue to see innovation in storage technologies. Choosing a right combination, configuring and managing it would be an important task for big data deployments.
BTW this post is not claimed to be a comprehensive survey of storage technologies or startups. The names I mentioned are just to give example to make a point. I have not covered all the players and the names I mentioned are not necessarily preferred choices.
(Also posted on LinkedIN)