Looking at Real Time Streaming in Depth 2017-11-09T09:15:38+00:00

Looking at Real Time Streaming in Depth

In a previous article, we talk about the growing importance of real time streaming with AI.  But what exactly is real time streaming? And how does it work?  These are common questions asked by CMOs when discussing ways to increase marketing revenues though AI.  As AI transforms every facet of our society, the speed in which it is delivered will become more important. This is especially true in high volume situations where data is processed in real time to make decisions.  For these situations, every millisecond counts, with a second latency potentially costing a company millions of dollars annually.

Use Case

These types of use cases are common in business.  Imagine a marketplace where thousands of online applicants are matched with financial lenders on a daily basis, in real time.  And the quality of each applicant varies from poor to excellent, along with hundreds of lenders with different risk thresholds and payouts.   This is an optimization problem that may seem daunting, but is very doable with machine learning and real time streaming.  

First you need to understand the quality of each applicant.  By modeling customer data points such as monthly income, employer status, owned assets, previous funding, etc. you can predict the quality of each applicant.  Next you need to understand the risk threshold of lender.  By developing a machine learning algorithm for each lender, you can accurately measure the risk threshold.  With these two approaches, suddenly what seemed to be an abstract problem becomes tangible.  Now comes the development.  

Real Time Streaming Architecture

To start, let’s discuss the technology behind real time streaming, Spark.  Spark is an open-source, big data framework developed at UC Berkeley.  Much can be said about the technical infrastructure, but in essence, it consists of five areas: data ingestion, storage, machine learning, messaging, and processing.

Kafka, realtime streaming, spark, batch processing, micro batch, data warehouse

Data Ingestion

With any Big Data environment, data is the currency that makes it valuable.  Data ingestion is the first step to setting up a real-time environment.  To achieve this, we used both Spark and Flume to create data pipelines from internal data sources.  Flume is often used for this type of service because it is an easy and fast way to aggregate and transform raw data. 

Storage

Once you have the data, where you store the data is the next requirement.  Traditionally, businesses have used relational SQL data warehouses to store their data.  However, as companies have gotten access to more and more data, SQL data warehouse have run into limitations related to speed.  As a result, companies have turned to Hadoop or HDFS based data warehouses for better performance and redundancy.  Specifically, Hadoop allows distributed storage and processing among multiple computers, giving companies the ability to scale up from two to thousands of computers in a cluster to give it virtually unlimited speed.

Machine Learning

Once data is ingested and stored into a data warehouse, data scientists can start to build machine learning models.  Typically, this is an iterative process where multiple machine learning algorithms are trained and tested, ranging from simple regressions to complex neural networks.  Results can range significantly, not only in accuracy but also in speed.  These are usually the two tradeoffs that data scientists need to make in real time streaming environment.  For this case, you can test models such as Logistic Regressions, Random Forests, XGBoost, and Neural Networks.  While Neural Networks will most likely yield higher accuracy, it will be relatively slow.  In the past, we have implemented simple models such as Logistic Regressions just for its speed, delivering times of 200 milliseconds using as few as 6 nodes.    

Messaging and Processing

Once the models are trained, it needs data to be delivered to them to run predictions.  For this purpose, you can use services such as Apache Kafka.  Kafka is typically used to build real time data pipelines and streaming apps. In the past, we used it as a delivery mechanism with great results.  Once the data is made available, there needs to be a mechanism to trigger the models.  This process entails embedding the machine learning algorithm within a processing service, such as Apache Storm or Spark Streaming.  While there are pros and cons between the two processing services, we have found Storm to give us the best chances to achieve sub-second.

Competitive Edge

Real time streaming will be a game changer.  Whether it is improving the customer match to lenders or presenting the right products for personalization, real time streaming enable companies to use data and AI to deliver smart, fast decisions.  Companies that adopt real time streaming technology will gain a competitive edge as AI will enhance the customer experience and maximize the chances of conversions.    

Eric Kim
Eric KimFounder of Invertible.io, ML/Data Scientist | Marketing Analytics Leader