Practical Machine Learning in Information Security

PRACTICAL MACHINE LEARNING
IN INFORMATION SECURITY
DR. SVEN KRASSER CHIEF SCIENTIST
@SVENKRASSER

2017 CROWDSTRIKE, INC. ALL RIGHTS RESERVED.
MACHINE LEARNING AT CROWDSTRIKE
§ ~40 billion events per day
§ ~800 thousand events per second peak
§ ~700 trillion bytes of sample data
§ Local decisions on endpoint and large scale analysis in cloud
§ Static and dynamic analysis techniques, various rich data sources
§ Analysts generating new ground truth 24/7

BRIEF ML
EXAMPLE
“Buttock Circumference” [mm]
Weight[10-1kg]
• What’s this?
http://tinyurl.com/MLprimer
• Two features
• Two classes

MODEL
FIT
“Buttock Circumference” [mm]
Weight[10-1kg]
• Support Vector
Machine
• Real world:
more features

ML IN INFOSEC APPLICATIONS
§ Not a single model solving everything
§ But many models working on the data in scope
§ Endpoint vs cloud
§ Fast response vs long observation
§ Lean vs resource intensive
§ Effectiveness vs interpretability
§ Avoid ML blinders
§ The guy in your store at 2am wielding a crowbar is not a customer

FALSE POSITIVE RATE
§ Most events are associated with clean executions
§ Most files on a given system are clean
§ Therefore, even low FPRs cause large numbers of FPs
§ Industry expectations driven by performance of narrow signatures

TRUE POSITIVE RATE
8
Chanceofatleastone
successforadversary
Number of attempts at 99% detection rate
1%
>99.3%
500

UNWIELDY DATA
§ Many outliers
§ Multimodal distributions
§ Sometimes narrow modes far
apart
§ Adversary-controlled features
§ Mix of sparse/dense and
discrete/continuous features

Training set distribution generally differs from…
DIFFERENCE IN DISTRIBUTIONS
§ Real-world distribution (customer networks)
§ Evaluations (what customers test)
§ Testing houses (various 3rd party testers with varying methodologies)
§ Community resources (e.g. user submissions to CrowdStrike scanner on
VirusTotal)

Or: the second model needs to be cheaper
REPEATABLE SUCCESS
§ Retraining cadence
§ Concept drift
§ Changes in data content (e.g. event field definitions)
§ Changes in data distribution (e.g. event disposition)
§ Data cleansing is expensive (conventional wisdom)
§ Needs automation
§ Labeling can be expensive
§ Ephemeral instances (data content or distribution changed)
§ Lack of sufficient observations
§ Embeddings and intermediate models
§ Keep track of input data
§ Keep track of ground truth budget

CLASSIFYING EVENT DATA
§ Idea: global classification
§ Observe all executions for a file, not just a single one
§ Initially only behavioral event data
§ In later versions also combined with static analysis data
§ Early project, focus on the data already there
§ Events fall into various categories, mainly:
§ Process data (hub)
§ Network data
§ DNS data
§ File system data
§ Capping data at 100 seconds since process start
§ Carving out a smaller problem
§ Ignoring classes of malware that are idle initially

RELEVANT ARCHITECTURE IN A NUTSHELL
Event
collector
Message
bus
Sensor
population
S3
Spark
Hash
DB
Cloud

15
HIGH-LEVEL JOB FLOW
Read in event
data
•Filter by event type
•Filter unneeded
fields
Aggregate per-
process data
•Add derived features
•One-to-one:
combine events
process creation and
termination
•One-to-many:
combine events such
as DNS requests
(many per process)
and add result to
process record
Direct children
•Join to parents and
copy parent data
into children
•Aggregate children
features by their
parent
Second order
children
•Aggregate second
order children by
their parent’s parent
•Aggregate with
direct children
Process features
•Combine process
data with children
data
Hash features
•Roll up all process
data by hash
•Output per-hash
statistics as
behavioral features
2017 CrowdStrike, Inc. All rights reserved.

Process records
Children records
Hash record
...
(a)
(a)
(b)
(b)
(c)
(d)
AGGREGATION
2015 CrowdStrike, Inc. All rights reserved.16

LABELS
2017 CrowdStrike, Inc. All rights reserved.
Clean
training data
Dirty
training data
Sandbox
deployment
Unlabeled
data
Large scale
ﬁeld
deployment
§ Field data contains too little
malware
§ Extra malware executions in
sandbox
§ Need to consider bias introduced
by sandbox
§ Parent process
§ Execution time
§ Location of file

SPARKJOBDAG

Challenges & Lessons Learned
PROCESSING WITH SPARK
§ Issues due to data size
§ Lots of cycles sunk into tuning memory parameters to address job failures
§ Job structure and recovery considerations (reprocessing not always viable)
§ Issues due to input data model
§ Highly referential event data, spreading information across many real-time events
§ Flattened tree/graph-based data
§ Complex to handle in Spark’s RDD model (see DAG)
§ Abstractions such as GraphX may help
§ Processing overhead
§ Job based on Pyspark RDDs – most time spent on serialization/deserialization
§ Initial investment in migrating to Scala would have paid off in deployment
§ Life is now better with Dataframe API
§ Development velocity with Spark
§ Trivial to set up a local dev environment
§ Trivial to add unit tests

Smaller Larger
EVOLUTION
§ Operating on
fewer events
§ Rich event
data
§ Very fast
decisions
§ Moving event
correlation
into graph
database
§ Operating on
large event
volumes

VIRUSTOTAL INTEGRATION

FILE
ANALYSIS
AKA Static Analysis
• THE GOOD
– Relatively fast
– Scalable
– No need to detonate
– Platform independent, can be done at gateway or cloud
• THE BAD
– Limited insight due to narrow view
– Different file types require different techniques
– Different subtypes need special consideration
– Packed files
– .Net
– Installers
– EXEs vs DLLs
– Obfuscations (yet good if detectable)
– Ineffective against exploitation and malware-less attacks
– Asymmetry: a fraction of a second to decide for the
defender, months to craft for the attacker

ENGINEERED FEATURES
32/64BIT
EXECUTABLE
GUI
SUBSYSTEM
COMMAND
LINE
SUBSYSTEM
FILESIZE TIMESTAMP
DEBUG
INFORMATION
PRESENT
PACKERTYPE FILEENTROPY
NUMBEROF
SECTIONS
NUMBER
WRITABLE
NUMBER
READABLE
NUMBER
EXECUTABLE
DISTRIBUTION
OFSECTION
ENTROPY
IMPORTED
DLLNAMES
IMPORTED
FUNCTION
NAMES
COMPILER
ARTIFACTS
LINKER
ARTIFACTS
RESOURCE
DATA
PROTOCOL
STRINGS
IPS/DOMAINS
PATHS
PRODUCT
METADATA
DIGITAL
SIGNATURE
ICON
CONTENT
…

LEARNED
FEATURES
• Unstructured file
content
• Translated into
embeddings
• Vastly larger
corpus (no labels
needed)

String-based feature
Executablesectionsize-based
feature
COMBINING
FEATURES

Subspace Projection A
SubspaceProjectionB
COMBINING
FEATURES

PRODUCTIONFLOW
Sample Data
Labels
Cloud FX
Engine
Model

Embed
Embed
PRODUCTIONFLOW
Sample Data
Labels
Learned
Features and
Embeddings
Cloud FX
Engine
Sensor FX
Engine
Feed
Processing
Re-
processing
μService FX Worker Endpoints
Docker
Feature rankings

Challenges & Lessons Learned
STATIC ANALYSIS
§ Performance
§ Acceptable results can be achieved quickly
§ State-of-the art results require a bit more tweaking and feature engineering
§ Staying current requires a maintainable data pipeline
§ Hostile data
§ Wild outliers, e.g. PNG width is encoded in 4 bytes
§ All sorts of obfuscations and malformations
§ PE format !(ಠ益ಠ!)
§ What the standard says, what the loader allows…
§ Layers upon layers in an electronic archeological excavation
§ Not everything is documented
§ Tons of subtypes
§ More work
§ More opportunity

Practical Machine Learning in Information Security

Practical Machine Learning in Information Security

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Semelhante a Practical Machine Learning in Information Security

Semelhante a Practical Machine Learning in Information Security (20)

Último

Último (20)

Practical Machine Learning in Information Security