Incompatibility of DataLake Big Data Architecture with Internet Of Things (IoT)

Social, Mobile and application logs have been the three main contributors to Big Data. To handle such data, many stakeholders and experts are promoting a Data Lake Architecture in which the data collected is brought to a central platform where it is stored, processed and made it available for interactive analysis as well as for applications.


Recently Internet Of Things(IoT) joined the party.  It is clear that IoT would take this the data explosion to the next level. As per a report there will be 26billion IoT devices installed by 2020. Wearable devices, sensor devices, home appliances, etc would be producing huge machine data.


The synergy of IoT and Big Data is obvious. However, there is an incompatibility too, especially when it comes to promoting Data Lake architecture to IoT. The IoT data would be structured data continuously streamed in but not all data tupples would be interesting enough for a long term storage. The machine generated data would not only be huge, but its utility to store centrally would be limited.


There are two types of processing would commonly be done on IoT data:

1. Actionable processing: often, locally and immediately

2. Batch processing for patterns, aggregates and learnings : Both locally and centrally.


The local processing would need to be done to identify any anomaly or changes that detecting a change of state of the observables. For example, machine failure or in case of Google’s Nest , a departure of home people indicating a need of stopping temperature control or switching off lights. Other type of local processing is to learn patterns from the usage. However, there is not much value to bring the entire data centrally to process.


Hence, IoT data processing architecture would need to have a combination of local ‘smart’ processor as well as a central Big Data processor.  In this case, the data from all local sensors can be processed by Local processors and send aggregated data, insights and, in some cases, sample data to central Big Data platform. The processors on edges can collect data from local sensors, process them, learn from them and if needed act on the insights.


In this architecture, the central big data platform would receive processed and some sample data from all local processors at some interval. The central Big Data platform would further do aggregation, pattern recognition as well as machine learning from the data from all the sources. Such processing can not only identify and notify local anomalies but can detect and learn from patterns in the entire network. The central processing would be able to identify clusters of behavioral patterns and trends. Some of the insights identified by the central processing would also be useful for edge processing particularly to compare the local behavior with global behavior.

Hence, IoT architecture would involve ‘intelligence’ data processing both of on local networks where data will be small as well as centrally where the data will be huge not because the entire data is coming to it but because the data would be coming from large types of small networks. A key thing to remember is that local processing may be working on a small data but it would still be intelligent processing. Google’s Nest is a classic example of such processing.