The document discusses machine learning applications in information security. It describes how CrowdStrike uses machine learning models on over 40 billion events per day from endpoint and cloud data to detect security threats. It provides examples of how machine learning can be used for malware detection, outlining challenges like high false positive rates, concept drift over time, and differences between training and real-world data distributions. The document also summarizes CrowdStrike's use of machine learning techniques for static file analysis and malware detection.
2. 2017 CROWDSTRIKE, INC. ALL RIGHTS RESERVED.
MACHINE LEARNING AT CROWDSTRIKE
§ ~40 billion events per day
§ ~800 thousand events per second peak
§ ~700 trillion bytes of sample data
§ Local decisions on endpoint and large scale analysis in cloud
§ Static and dynamic analysis techniques, various rich data sources
§ Analysts generating new ground truth 24/7
3. BRIEF ML
EXAMPLE
“Buttock Circumference” [mm]
Weight[10-1kg]
• What’s this?
http://tinyurl.com/MLprimer
• Two features
• Two classes
2016 CROWDSTRIKE, INC. ALL RIGHTS RESERVED.
5. 2017 CROWDSTRIKE, INC. ALL RIGHTS RESERVED.
ML IN INFOSEC APPLICATIONS
§ Not a single model solving everything
§ But many models working on the data in scope
§ Endpoint vs cloud
§ Fast response vs long observation
§ Lean vs resource intensive
§ Effectiveness vs interpretability
§ Avoid ML blinders
§ The guy in your store at 2am wielding a crowbar is not a customer
7. 2017 CROWDSTRIKE, INC. ALL RIGHTS RESERVED.
FALSE POSITIVE RATE
§ Most events are associated with clean executions
§ Most files on a given system are clean
§ Therefore, even low FPRs cause large numbers of FPs
§ Industry expectations driven by performance of narrow signatures
10. 2017 CROWDSTRIKE, INC. ALL RIGHTS RESERVED.
Training set distribution generally differs from…
DIFFERENCE IN DISTRIBUTIONS
§ Real-world distribution (customer networks)
§ Evaluations (what customers test)
§ Testing houses (various 3rd party testers with varying methodologies)
§ Community resources (e.g. user submissions to CrowdStrike scanner on
VirusTotal)
11. 2017 CROWDSTRIKE, INC. ALL RIGHTS RESERVED.
Or: the second model needs to be cheaper
REPEATABLE SUCCESS
§ Retraining cadence
§ Concept drift
§ Changes in data content (e.g. event field definitions)
§ Changes in data distribution (e.g. event disposition)
§ Data cleansing is expensive (conventional wisdom)
§ Needs automation
§ Labeling can be expensive
§ Ephemeral instances (data content or distribution changed)
§ Lack of sufficient observations
§ Embeddings and intermediate models
§ Keep track of input data
§ Keep track of ground truth budget
13. 2017 CROWDSTRIKE, INC. ALL RIGHTS RESERVED.
CLASSIFYING EVENT DATA
§ Idea: global classification
§ Observe all executions for a file, not just a single one
§ Initially only behavioral event data
§ In later versions also combined with static analysis data
§ Early project, focus on the data already there
§ Events fall into various categories, mainly:
§ Process data (hub)
§ Network data
§ DNS data
§ File system data
§ Capping data at 100 seconds since process start
§ Carving out a smaller problem
§ Ignoring classes of malware that are idle initially
19. 2017 CROWDSTRIKE, INC. ALL RIGHTS RESERVED.
Challenges & Lessons Learned
PROCESSING WITH SPARK
§ Issues due to data size
§ Lots of cycles sunk into tuning memory parameters to address job failures
§ Job structure and recovery considerations (reprocessing not always viable)
§ Issues due to input data model
§ Highly referential event data, spreading information across many real-time events
§ Flattened tree/graph-based data
§ Complex to handle in Spark’s RDD model (see DAG)
§ Abstractions such as GraphX may help
§ Processing overhead
§ Job based on Pyspark RDDs – most time spent on serialization/deserialization
§ Initial investment in migrating to Scala would have paid off in deployment
§ Life is now better with Dataframe API
§ Development velocity with Spark
§ Trivial to set up a local dev environment
§ Trivial to add unit tests
23. FILE
ANALYSIS
AKA Static Analysis
• THE GOOD
– Relatively fast
– Scalable
– No need to detonate
– Platform independent, can be done at gateway or cloud
• THE BAD
– Limited insight due to narrow view
– Different file types require different techniques
– Different subtypes need special consideration
– Packed files
– .Net
– Installers
– EXEs vs DLLs
– Obfuscations (yet good if detectable)
– Ineffective against exploitation and malware-less attacks
– Asymmetry: a fraction of a second to decide for the
defender, months to craft for the attacker
2016 CROWDSTRIKE, INC. ALL RIGHTS RESERVED.
25. 2016 CROWDSTRIKE, INC. ALL RIGHTS RESERVED.
LEARNED
FEATURES
• Unstructured file
content
• Translated into
embeddings
• Vastly larger
corpus (no labels
needed)
30. 2017 CROWDSTRIKE, INC. ALL RIGHTS RESERVED.
Challenges & Lessons Learned
STATIC ANALYSIS
§ Performance
§ Acceptable results can be achieved quickly
§ State-of-the art results require a bit more tweaking and feature engineering
§ Staying current requires a maintainable data pipeline
§ Hostile data
§ Wild outliers, e.g. PNG width is encoded in 4 bytes
§ All sorts of obfuscations and malformations
§ PE format !(ಠ益ಠ!)
§ What the standard says, what the loader allows…
§ Layers upon layers in an electronic archeological excavation
§ Not everything is documented
§ Tons of subtypes
§ More work
§ More opportunity