Prevent Data Breaches: How to Build Your AI/ML Data Pipeline

Identity platforms like ForgeRock are the backbone of an enterprise, with a view of all apps, identities, devices, and resources attempting to connect with each other. This also makes them perfectly positioned to gather rich log identity data to use for preventing data breaches. In my previous blogs (1 and 2), I discussed how we detect data breaches using identity logs and how we test for accuracy of our anomaly detection engine. Now I’m back to talk about the most important step in the process - gathering good, clean data; quickly and at scale. ForgeRock leverages lean serverless data pipelines to achieve faster and efficient data processing because, in our business, data is incredibly time-sensitive. In this post, I discuss how we built our data pipelines and used Apache Beam, Google Dataflow and a distributed architecture to achieve faster, accurate and timely processing of 100K identity events in few seconds.

Designing a Data Pipeline to Meet Your Business Requirements 

Designing a data pipeline is a critical step to get right. The design should always start with your business use case -- you need to ask yourself, “What problem am I trying to solve?” In our case, we need anomaly detection on massive amounts of logs in real-time, with virtually zero latency, and we need the data to be anonymized (to protect PII). Our ML models train and predict on embedding that are extracted using deep neural models for NLP processing. For this purpose, we had to build two pipelines - one for training, one for prediction. We separate out training from prediction so that our training engine can take a little more time, while our prediction engine remain lightning fast. Our training pipeline doesn’t need to be real-time as we requires Human-in-the-Loop for maintaining fairness and quality of the models. Hence, our training data pipelines are batch pipelines that work on massive amount of raw data in few minutes. On the other hand, our prediction data pipeline is real-time. An astute reader may be confused here asking why have same functionality of processing raw data in two different pipelines. Well, we don’t duplicate functionality in different pipelines. Our architecture is designed to help us to create logical data pipelines and reuse modular code in multiple pipelines.


How to Build Your AI_ML Blog.jpg

Micro Data Lakes Help with Privacy and Cost Reduction 

Our design and philosophy centered around micro data lakes. The main design criteria for our data lake is storing identity event messages in a secured fashion with no Personally Identifiable Information (PII). To achieve this, we went for a micro data lake architecture. Rather than bringing the data to a central location, we have elastic on-demand pipelines that are spun up to process streaming data and store the extracted features in a feature lake along with raw information that we deem necessary for explainability purposes. This provides us with great flexibility to consume different sources of data and also keep a small cost footprint.

Elastic On-Demand Pipelines Provide Massive Scale 

Thanks to our micro data lake architecture, our data pipelines can be launched on demand depending on the flow and volume of data. We leverage Apache Beam, Google Dataflow, GKE and our own homegrown meta-data eventing system to proficiently trigger, process and shutdown our data pipelines. We can run our data pipelines on different runners’ environment both on-premise and in the cloud.

Which is Better: Horizontal or Vertical Scaling? 

Decisions around scaling is completely data driven for us. We use custom metrics around the functional section of our pipelines to record CPU , memory used and other system/app level metrics. This helps us in vertical scaling our pipelines which, we achieve in refactoring our pipeline code, introducing async processing using advanced threading techniques and leveraging data buffering techniques. An outcome of such optimization led us to embed our embedding model in our pipeline and decrease our processing time by 3.5 times.

We use horizontal scaling to increase our capacity based on the volume of data. This can be easily achieved with serverless design using runners like Google Dataflow. Google Dataflow takes hints from our configuration files and scales horizontally as required. This kind of on-demand scaling helps in accommodating increased data traffic and processing capacity.

The Future of Data Pipelines 

In this post, we discussed how ForgeRock leverages lean serverless data pipelines to achieve faster and efficient data processing. We have been able to successfully use modern data processing techniques and the cloud to build on-demand elastic data pipelines to process massive amount of event data around identities. We continue to pioneer advanced features and techniques to make our platform better and faster.  We love partnering with ForgeRock customers in building support for additional data sources. If you are a current customer with interest in exploring more about our data pipelines, we’d love to collaborate with you! Please reach out to your ForgeRock representative if you are interested.