Stream IoT Data into Applications Easily

February 21, 2018

IoT applications need to continuously ingest data, process that data, make decisions, and then act. The decision-making pattern typically starts with an ingestion phase of streaming raw data from the edge to a storage medium like the Hadoop Distributed File System (HDFS). Then, data engineers and data scientists iteratively wrangle the data to get it into a form that can be used by learning, planning, and operational systems downstream.

  • Learning systems continuously analyze and build prediction models based on historical data. For example, 1) how likely is a field component to fail? 2) how late will that order be? or 3) how likely is it that the patient in room 1011 is going to have a heart attack? Learning systems typically allow you to transform data into feature sets for statistical and numerical processing.
  • Planning systems explore what the future might hold given certain predictions. For example, 1) do we have enough inventory to replace the likely malfunctioning components next Tuesday? or 2) should we re-allocate inventory across the supply chain given the likelihood of a late shipment?
  • Operational systems like customer service systems, inventory management systems, and health care advisory systems that support multiple concurrent users to perform everyday tasks.

Until now, IT staff and developers had to duct-tape these systems together themselves with engines like Apache Spark or Apache Hive for analysis, MLlib, R, or Tensorflow for machine learning, and HBase, Redis, or Cassandra for the planning and operational systems. This is complicated, and requires low-level distributed systems work, taking developers away from more strategic application development. And most importantly, moving data between systems introduces latency to the overall system.

Splice Machine provides a better platform for powering IoT applications because it is a single, integrated SQL RDBMS platform for all these processes.

Step 1: Streaming Ingestion

In future blogs, we will show how all of these steps can be done with Splice Machine. Today, we will start with the first step, and demonstrate how easy it is to ingest streams of IoT data into Splice Machine tables. Once ingested, that data can be used for any purpose – including learning, planning and operations – without the cost, complexity, and latency of extracting and loading data into different systems for each phase.

Streaming Weather Data

Let’s review a concrete example involving weather data, and look at the code that makes it work. Suppose we have a domain that needs to consider the weather forecast to make its decisions. Perhaps its supply-chain planning system is trying to predict late orders or a predictive maintenance system is planning future service calls.

For this example, we stream weather data from a public weather data source. We assume there exists an external process extracting forecast weather data and publishing it in a JSON format to a Kafka queue:

{“CITY_ID”:5506956,”AS_OF_TS”:”2018-02-17 16:02:10″,”FORECAST_TS”:”2018-02-21 00:00:00″,”WEATHER_CODE_ID”:800}

Splice Machine will subscribe to that weather data and stream it into our tables.

Below the incoming stream is defined in an integrated Apache Zeppelin notebook:

The code above sets up a Kafka stream that designates a particular external Kafka broker, and subscribes to the “WeatherForecast” topic.

The code below reads from the queue and insert data into Splice Machine:

This code reads data from the stream and uses Splice Machine’s Spark Adapter (via the spliceMachineContext variable) to insert data into a Splice table. It parses the JSON weather data off the queue and transforms it into DataFrames; the Spark Adapter then inserts each DataFrame into the “WEATHER.FORECAST” table.

The Spark Adapter here is key: it is a concise, transactionally consistent API between Spark DataFrames and the Splice Machine RDBMS. The Spark adapter supports other database operations as well including upsert, delete, etc. This API enables high-performance inserts – without the serialization of data over JDBC – to Splice Machine’s durable store. The Spark Adapter is transactionally consistent and obeys all ACID properties. It can also return result sets as DataFrames for efficient transformation and learning pipelines.

Now that we have this weather data streamed into our forecast table, we can demonstrate some ways to leverage it. Suppose that we know that there is a direct correlation between orders that arrive late and severe weather in their destination cities. With the streamed data above, we are able to look ahead in the forecast for severe weather to predict shipment delays.

Below, we query the forecast data and see that there is a forecast for heavy rain in Dallas and Chicago:

Knowing that we have key shipments scheduled for Dallas, we can look at the details of the Dallas forecast over the next few days:

We can cross-reference the time window of the severe weather to our Dallas shipment dates, and project potential late deliveries. With that information, we can perform what-if scenarios on our supply chain, given the likelihood that those deliveries will be late.

To summarize, we have shown that it is very straightforward and performant to ingest Kafka streaming data into Splice Machine through the use of the Splice Spark Adapter. Look for future blogs to drill into the possibilities enabled by this kind of capability.

Try Streaming Yourself in Minutes

[btn url=”” target=”_blank”]Get Free Trial[/btn]