No Protection From Bad Data
Beware what the experts claim as noise and a security signal
- By Jonathan Sander
- Aug 01, 2015
Walking the expo floor at the most recent RSA
conference, it was hard to miss how many
companies were talking about big data. Maybe
they used the word analytics, or perhaps
they called it machine learning. The claims
were all similar. Give me your tired, your
poor, your huddled masses of log and incident data that you’re
yearning to make actionable—or something along those lines.
They all wanted to sell us on the notion that big data can take
operational noise and turn it into security signal.
Of course, the next line of the famous poem I misquote refers
to “wretched refuse,” and that’s a pretty good description
for most of the data people would feed into these systems. The
belief that such systems can take refuse and spin it into gold is
magical thinking. Big data, data science, machine learning and
other novel approaches do have true promise for security leaders
and practitioners looking for better results. There are ways to get
better data to feed these new systems if you know where to look.
When you combine good data and big data, you get results that
can seem like magic without having to wear the funny hats.
The need for good data starts right in the beginning of the big
data lifecycle. The real power of these analytics systems are in
the models they build. A good model will allow you to go from a reactive to a predictive approach to security
incidents. It’s like having hindsight for
the future—you’re always seeing the signs
20/20. But how can you tell if you’ve got a
good model? The model needs to be tried
out on a dataset you know a lot about.
This is the practice test where you
have the answers and you want to see how
well you may do on the real test later. Of
course, a lot of the vendors have done the
dirty work for you here. They have models
that they feel should work well.
Best practice would be to take their
model for a spin on your well known data
and see what happens. Maybe the model
needs some adjustments for the particular
configuration of your IT infrastructure.
People paying attention often get stuck
right at this phase. Do you have a good
backlog of data you could use to run tests
like this? Are you archiving the critical logs
and other data sources that these models
would employ? Have you done the forensics
to identify where the data you have corresponded
to breaches or incidents you’ve
experienced so you know what the models
ought to turn up? Too often the answers to
these questions are no, no and no.
Let’s say you get to the stage where
you have models that will work. The next
challenge on the path to big data nirvana
will be having these analytics cough up the
answers to your burning questions. Here’s
where another problem in the data often
emerges. You’ve set up your shiny new
data science driven machine, but suddenly
you find the questions you’re being asked
and the data you have don’t line up.
It’s often a
level of abstraction mismatch.
The executives ask questions about
people, but your data is about machines
and IP addresses. This can also affect your
modeling. If you only have a hammer,
you smash everything like a nail. Maybe
you go find a bunch of data about people,
but if you don’t adjust the models so that
they treat people, machines, IP addresses
and everything else like they ought to be
treated and as if they are in the right relationships,
then you’ve made the problem
worse and not better. Often bad models
with data at different levels of abstraction
(e.g. a person and an IP) that fall into
some correlation will start to crank out
tons of bad conclusions when one thing
does something normal for its level of abstraction
(e.g. a person quits) that doesn’t
have anything to do with the thing it fell
into accidental correlation with.
An excellent example of good modeling
for these disparate types of data in the
security space is how Securonix has used
peer grouping from Identity and Access
Management (IAM) systems to enrich
their data. Like many in the User Behavior
Analytics (UBA) space, Securonix will
look at lots of different data sources to
get you the insight you want about what
your users are up to. They could have been
prone to the troubles of modeling with different
layers of abstraction and correlating
at the wrong levels, but seem to have
gotten it right.
When they take in data about people
and their organizational relationships
from IAM systems, they use this to influence
the outcomes not by directly correlating
it with activity at the network and
system layer, but rather by augmenting expectations
of what is normal based on the
peer group you may have. A peer group
will be people in the same building, managers
at the same level of the hierarchy, or
employees with the same job role.
You would expect these people to have
notably similar activity patterns, and you
can use that to learn what they do and
point out anomalies. That’s getting the
analytics game right by pulling in the right
data and using it well.
Sometimes you have to work with what
you’ve got, and that means bad data will
happen to you. Nothing serves as a better
example for that than the notoriously poor
quality logs on Microsoft platforms. Single
events like a bad authentication appear as
multiple entries in multiple events on multiple
systems. Often, even if it’s possible to
collect all the separate events, you find key
pieces of data like an IP address is missing
so correlating that event with other events
from other sources is near impossible.
The default level of logging doesn’t
offer enough detail for proper detection
or forensics from a security perspective.
Turning up the logging so that it becomes
nominally useful for security means sacrificing
a huge chunk of the system resources.
To add insult to injury, collecting and
parsing all these logs is immensely burdensome.
If you’re a practitioner, then none
of this is news to you. From the analytics
perspective, these problems often hide
behind the SIEM system. You don’t know
you’re dealing with such low quality data
until you start using it via the SIEM and
getting poor results.
This is where you will likely need to
look to other sources for good data if
you want the quest for big data analytics
to have a happy ending. The STEALTHbits
StealthINTERCEPT platform is able
to get all the security data you want from
these Microsoft platforms and overcome
these native logging issues. You can plug it
into the SIEM, feed directly from Stealth-
INTERCEPT itself, or even plug it into
something else that can consume our SYSLOG
output. This means adding another
layer to your infrastructure, but the result
is a high quality, real time stream of security
events from your Microsoft systems.
I’m picking on Microsoft, but it’s not like
they’re the only one guilty of having bad
logging. Luckily for many of these problem
children you’ll find other solutions
like ours that will help fill the gaps and get
you the good data you want.
Moving from reactive controls to predictive
controls has been a goal on the horizon
for security organizations for a long
time. Solid models powered by sound data
science that processes the huge well of big
data can finally make security predictive.
That’s assuming they’re fed by streams of
good data. If the stream of data is polluted,
then all that water in the big data well
isn’t worth the trouble to drink. You’re
going to need good data to get these systems
going, which means discipline and
practices around retention you may need
to improve.
You will have to be sure you’re getting
all the right data from the right level sources
and putting them in just the right relationships
in your models so they produce
useful results even as the real world changes
around them. Sometimes you’re going to
need to make tough choices to invest in
better data sources or deal with lower quality
results. But if you can get the data right,
then big data will do right by you.
This article originally appeared in the August 2015 issue of Security Today.