Internet of Things and Event Stream Processing
It’s a common theme that IoT is all about data. But it’s the analytics applied to the data that makes an IoT solution intelligent. IoT represents a massive increase in data rates from multiple sources that needs to be processed and analyzed. A variety of heterogeneous sensors exhibits a continuous stream of information back and forth requiring real-time processing, and smart visualization of the data using something along the lines of a business dashboard.
This type of data flow and volume shift may easily represent thousands to millions of events per second. It is the biggest kind of “big data” and will exhibit considerably more data than what we have seen from the Internet of humans. The ability to quickly process large amounts of data from multiple sources in real time is a key requirement for most IoT solutions. Data transmitted between things provides instruction of how they should act and react to certain conditions and thresholds. Analysis of this data turns streams of data into meaningful events, offering unique situational awareness and insight of the thing transmitting the data. This type of analysis offers engineers and data science specialists the ability to track formerly immeasurable processes.
All this type of new device information enables valuable insights into what is really happening on our planet, offering the ability to make accurate and quick decisions. However, analytics and data handling are challenging. Everything is now distributed to the edge and new ways of handling data are coming to the forefront. To combat this, IoT uses emerging technologies such as stream data processing with in-stream analytics, predictive analytics, and machine learning techniques.
IoT devices generate huge amounts of data, putting pressure on the internet infrastructure. This is where the role of cloud computing comes in useful. Cloud computing assists in storing, processing, and transferring data in the cloud instead of connected devices. Consequently, the advent of cloud computing has also lead to fast file sharing opportunities for businesses who need to access documents from various locations around the globe in a secure and quick manner.
Distributed to the Edge
IoT represents a distributed architecture. We have the distribution of analytics from the IoT platform, either cloud or on-premise to network edges making analytics that more complicated. A lot of the filtering and analysis is carried out on the gateways and / or on the actual things themselves. These types of edge devices process sensor event data locally. Some things can execute immediate local responses without even contacting the gateway or remote IoT platform. If a device has sufficient memory and processing power they can run a lightweight version of an Event Stream Processing ( ESP ) platform, like those offered by ververica. For example, Raspberry PI supports complex-event processing ( CEP ). Gateways ingest event streams from sensors and usually carry out more sophisticated steam processing than the actual thing. Some have the ability to send an immediate response via a control signal to actuators, causing a change in state.
The technicality is only one part of the puzzle, data ownership, and governance is the another.
Time Series Data – Data in Motion
In certain IoT solutions, such as traffic light monitoring in smart cities, the reaction time must be immediate without delay. This requires a different type of big data solution, a solution that processes data while it’s in motion. In some IoT solutions, there is simply too much data to store so the analysis of data streams must be done on the fly while being transferred.
It’s not just about capturing and storing as much data as possible anymore. The ability to use the data while it is still in motion is the essence of IoT. Applying analytical models to data streams before they are forwarded enable accurate pattern and anomaly detection while they are occurring. This type of analysis offers immediate insight into events enabling quicker reaction times and business decisions.
Traditional analytical models are applied on stored data offering analytics for historic events only. IoT requires the examination of patterns before data is stored not after. The traditional store and process model does not have the characteristic to meet the real-time analysis of IoT data streams. In response to new data handling requirements, new analytical architectures are emerging. The volume and handling of IoT traffic require a new type of platform known as Event Stream Processing ( ESP ) and Distributed Computing Platforms ( DCSP ).
Event Stream Processing ( ESP )
ESP is an in-memory real-time process technique enabling the ability to analyze continuously flowing events held in streams of data. Assessing events while they are in motion is known as “event streams”. This not only reveals what is happening now but it can also be used in combination with historical data to accurately predict the future event. To predict future events, predictive models are embedded into the data streams. This type of processing represents a shift in how data is processed. Data is no longer simply stored and processed, it is analyzed while it is still being transferred and models are applied.
ESP applies sophisticated predictive analytics models to data streams and then takes action based on those scores or even business rules. It is becoming popular in IoT solutions with predictive asset maintenance and performing real-time detection of fault conditions. You can create models that can signal a future unplanned condition. This can then be applied to ESP allowing quick detection of any upcoming failures and interruptions. ESP is also commonly used in network optimization for Power Grids and Traffic Control systems.
ESP is in-memory meaning all data is loaded into RAM. It does not use hard drives or other substitutes resulting in very fast processing, enhanced scale, and analytics. In-memory has the capability to analyze terabytes of data in just a few seconds with the ability to ingest from millions of sources in milliseconds. All the processing happens at the edge of the system before data is passed to storage.
How you define real time really depends on the context. Your time horizon will depict whether you need the full power of ESP or not. Events with ESP should happen close together in time and frequently. However, if your time horizon is over a relatively long period of time and events are not close together, then your requirements might be fulfilled with Batch processing.
Batch vs Real Time Processing
With Batch processing, files are gathered over a period of time and sent together as a batch. It is commonly used when fast response times are not critical and for non-real-time processing. Batch jobs can be stored for a long period of time and then executed, for example, an end of day report is suited for batch processing as it does not need to be done in real time. Although they can scale, the batch orientation limits real-time decision making and IoT stream requirements. Real-time processing involves a continual input, process, and output of data. Data is processed in a relatively small time period. When your solution requires an immediate action then real time is the one for you. An example of batch and real-time solution, include Hadoop for batch and Apache Spark focuses on real-time computation.
Hadoop vs Apache Spark
Hadoop is a distributed data infrastructure that distributes data collections across nodes in a cluster. It includes a storage component called Hadoop Distributed File System ( HDFS ), and a processing component called MapReduce. With the new requirements for IoT, MapReduce is not the answer for everything.
MapReduce is fine if your data operation requirements are static and you can wait for batch processing. But if your solution requires analytics from sensor streaming data then you are better off to use Apache Spark. Spark was created in response to the limitations of MapReduce. Apache Spark does not come with its own file system and may be integrated with HDFS or a cloud-based data platform such as Amazon S3 or OpenStack Swift. It is much faster than MapReduce and operates in-memory and in real time. It has machine learning libraries in order to gain insights from the data and to identify patterns. Machine learning can be as simple as a python script used for event detection and anomaly detection.