Prevent Data Breaches: Identity Logs and Machine Learning
An identity platform like ForgeRock sits right in the heart of an enterprise, with a view of all apps, identities, devices, and resources attempting to connect with each other. It turns out that this is a perfect position to gather rich log identity data to use to prevent data breaches.
Prevent Data Breaches? It's Hard.
An attacker has the luxury of finding the easiest way to break-in, whereas a defense team has to secure every possible attack surface. There were 12,440 new breaches in 2018, which was an increase of 424% over the known breach count in 2017. A total of 14.9 billion identity records were found to have been exposed during the year, up from 8.7 billion available in 2017. Some of the hardest breaches to find are micro data breaches, which are spread over a long period of time. Data breaches through micro transactions are becoming more prevalent and are very hard to detect.
Identity Logs and Machine Learning: How To Approach the Problem
We are in the right position: All authentication (AuthN) and authorization (AuthZ) requests and identities behavior events are tracked and logged by our IAM products.
We stream raws logs into a big data store and store a few months of data.
We analyze behavioral patterns on logs generated by identities. When we represent these patterns in a latent space, we can use the pattern to train models to detect anomaly behaviors.
Machine Learning Algorithms Showing Promise
We leveraged word embedding to learn temporal contextual information. This helped us to learn what events naturally occur with identities and group them into a latent space. After further experimentation using a customized version of Non Contrastive Loss, we converged to a 50 dimensional temporal representation of an identity behavior in the latent space.
We use stacked autoencoder to compress the log embeddings with artificial bayesian noise in the input. The bottleneck layer compressed higher dimension log embeddings into principal lower dimensional representation. The decoder learned to reconstruct from the lower dimensional representation. We used simple reverse indexing methods to map and extract information from the log entries.
We have over 90% accuracy in predicting anomaly which is used through a graphQL API to predict micro-data breaches. Our t-SNE visualization corroborates these results.
In Part 2 of this blog series on how to prevent data breaches, which will appear next month, we will delve into metrics, derived metrics, A/B testing, back-testing, and how we improved on this model.