MACHINE LEARNING IN CYBERSECURITY

[MACHINE LEARNING IN CYBER SECURITY DOMAIN]
BGA Bilgi Güvenliği A.Ş. | www.bgasecurity.com | @BGASecurity
Machine Learning in
Cyber Security Domain
Yazar: Ebubekir Büber – Normshield
Baskı: 2017

INTRODUCTION
In recent years, attackers have been developing more
sophisticated ways to attack systems. Thus, recognizing these
attacks is getting more complicated in time. Most of the time,
network administrators were not capable to recognize these
attacks effectively or response quickly.
Therefore, there is a lot of software has been developed to
support human in order to be able to manage and protect their
systems effectively. Initially, these software has been
developed to handle some operations like mathematical
calculations which seem very complex for human being. And
then we need more. Next step was extending the ability of
software using artificial intelligence and machine learning techniques. As technology
advances, huge amount of data is being produced to be processed every day and every hour.
Finally, the concept of “Big Data” was born and people began to need more intelligent system
for processing and getting make sense of these data. For this purpose there are a lot of
algorithms have been developed until today. These algorithms are used for many research
area such as; image processing, speech recognition, biomedical area, and of course cyber
security domain.
Beside all of these, basically the main purpose of Machine Learning techniques is providing
decision mechanism to software as people do. Cyber security domain is one of the most
important research area worked on. The Centre for Strategic and International Studies in 2014
estimated annual costs to the global economy caused by cybercrimes was between $375
billion and $575 billion. Although sources differ, the average cost of a data breach incident to
large companies is over $3 million. Researchers have developed some intelligent systems for
cyber security domain with the purpose of reducing this cost.

MACHINE LEARNING
As a beginning, Artificial Intelligence (AI) focus on to gain ability to a computer act like human.
For this purpose, researchers tried to develop ai applications which can not be detected as
computer by real users. So, first generated ai applications tried to pass Turing Test
successfully. The Turing test is a test of a machine's ability to exhibit intelligent behaviour
equivalent to, or indistinguishable from, that of a human. After that researchers discovered
that it is not so easy to create an AI which works similar to human brain completely. Because
of this, AI was started to use more specific application domain such as face recognition, object
recognition etc.
Machine learning is a type of artificial intelligence
(AI) that provides computers with the ability to learn
without being explicitly programmed. Machine
learning focuses on the development of computer
programs that can change when exposed to new
data. Although it has gained a high momentum in
recent years, actually machine learning is almost as
old as computer history. Data which are produced
from computers or sensors are processed and
derived some meaning from this data since the use
of first computers. So why machine learning is so popular in recent years? Because, we have
as much data as never before and we need to make sense of this data. Therefore, it is called
as BIG DATA.
Big data is being generated by everything around us at all times. Every digital process and
social media exchange produces it. Systems, sensors and mobile devices transmit it. Big data
is often characterized by 3Vs: the extreme volume of data, the wide variety of data types and
the velocity at which the data must be processed. Although big data doesn't equate to any
specific volume of data, the term is often used to describe terabytes, petabytes and even
exabytes of data captured over time. With the commencement of widespread use of IoT
technology, the data to be processed will grow even larger in future.
It is impossible to analyze big data directly for humans. So people are developed some
intelligence systems using machine learning
to analyze big data more easily. Big Data and
Machine Learning are two component which
are complementary each other. If we want to
analyze Big Data, we have to use Machine
Learning techniques, on the other hand if we
want to create an intelligent system using
machine learning we have to use large
amount of data.

Deep Learning is one of the most trending topic in machine learning. Because, this technique
allow to gain high accuracy rate for intelligent systems with the power of big data.
Representative figure about artificial intelligence, machine learning and deep learning and
chronological improvement of this concepts is given below. (Source of image:
https://blogs.nvidia.com/blog/2016/07/29/whats-difference-artificial-intelligence-machine-
learning-deep-learning-ai/ ).
Machine Learning techniques are used wide range of application area in globe. So that every
human use countless intelligent systems which are developed using machine learning
techniques countless time in a single day. When using mobile phone, surfing on the internet,
buying something on the internet, we are facing a lot of intelligent systems. Companies that
develop technology have spent huge amount of money for developing more intelligent
systems. Almost all machines will be intelligent in the future, because intelligent systems make
life easier. And of course, people love applications that make life easier.
Gartner publishes emerging technology trends every year in hype cycle format. Hype cycle
format is representative graph about trending topics. This format assumes that before a
technology is used in worldwide, there are 5 steps to achieve. (1) Innovation trigger, (2) Peak
of Inflated Expectations, (3) Through of disillusionment, (4) Slope of Enlightenment, (5)
Plateau of Productivity. When a technology reaches to plateau, the technology is starting to
use worldwide. While reaching plateau takes a long time for some technologies, some
technologies can reach to plateau quickly. Hype cycle also represents required time for
reaching plateau. Gartner Hype Cycle for Emerging Technologies figure for 2016 is given
below.

As shown in the figure, Machine Learning is in the Peak of Inflated Expectation step, and time
to reach plateau will take about 2 to 5 years. When compared to other technologies this time
is very small. Big companies such as google, facebook, apple are already spending huge
amount of money for the improvement of artificial intelligence and machine learning. Most
of them are using deep learning technique in some of their projects. Detailed examples are
given in the next section.
With the widespread use of internet, there are huge amount of data flowing on the internet.
To detect events which have malicious behavior is getting more difficult with increasing data
flow. Like other application areas, cyber security domain need to strengthen their structure
using machine learning technique.
In continuation of this chapter, it is given examples about machine learning application areas
for the clear understanding. Then, it is given that brief introduction about deep learning and
application examples of deep learning in globe. Finally, it is given that technical review about
machine learning techniques in this section.

Application Areas in Daily Life
Deep Blue is one of the most important milestone in the AI
history. Deep Blue was a chess-playing computer
developed by IBM. It is known that the first computer
chess-playing system wins chess match against a reigning
world champion. Deep Blue won its first game against a
world champion Garry Kasparov on February 10, 1996.
However, Kasparov won three and drew two of the
following five games, defeating Deep Blue by a score of 4–
2. Deep Blue was then heavily upgraded, and played
Kasparov again in May 1997. Deep Blue won game six,
therefore winning the six-game rematch 3½–2½ and
becoming the first computer system to defeat a reigning
world champion in a match under standard chess
tournament time controls. (Deep Blue has given right.)
Chess was thought to be a game of intelligence. Playing
chess good is very hard task even for humans. Because of
this, the first chess match winning by a computer against
to world champion was talked about too much in those
years.
How it could be possible? Let’s take a deeper look.
The Shannon number is a conservative lower bound (not an estimate)
of the game-tree complexity of chess of 10120
, based on an average of
about 103
possibilities for a pair of moves consisting of a move for
White followed by one for Black, and a typical game lasting about 40
such pairs of moves. Shannon calculated it to demonstrate the impracticality of solving chess
by brute force, in his 1950 paper "Programming a Computer for Playing Chess". 10120
number
is very huge, as comparison total number of atoms in the universe are estimated between
1079
- 1081
. If a variation calculating takes 1 microsecond, calculating every variation takes 1090
years. For achieving this situation, computers need a lot of processor capacity. Since those
days technology did not allow it, therefore depth of tree which is calculated is limited.
As we have explained before, AI focus on creating system which works similar to human brain.
Due to some reasons, AI can be applied to some specific application area. Chess has been one
of the successfully applied artificial intelligence area to a field of practice.
There is some other mind game about challenge of artificial intelligence. Its name is GO. GO
which was invented in china is an abstract strategy board game for two players, in which the
aim is to surround more territory than the opponent. In the years when deep blue defeat
Kasparov, some people are considered that humans can not be defeated by a computer in GO
or it takes very very long time. Almost twenty years after 1996, AI is defeated humans in the
GO. Let’s take a look at the numbers about GO. Initially, GO table size is changeable, so people

can play GO with table size 7x7, 9x9, 19x19 or 21x21. In our example we think we want to play
GO with table size 19x19 against to computer. And assume that average move count is about
200 in the game of experts. (Because researches show
so.) Average choice count for every move is about 250.
Total number of variations that must be calculated by the
computer is 3×10511
, when table size is 19x19. This
number is much more than that of chess. (Do not forget
total count of all atoms in the universe are between 1079
-
1081
.) In professional games, overall move counts can take
350. Total number of variations that must be calculated
by the computer for this move count is 1.3×10895
. You get
the idea why this problem is so hard to solve. (If you want
to analyze more number about this topic check this link).
Humans are defeated by artificial intelligence application with name ALPHA GO in GO game.
AlphaGo is a computer program developed by Google DeepMind in London to play the board
game Go. In October 2015, it became the first Computer Go program to beat a professional
human Go player without handicaps on a full-sized 19×19 board. In March 2016, it beats Lee
Sedol (best player in the world ) in a five-game match, the first time a computer Go program
has beaten a 9-dan professional without handicaps. This is the one other most important
milestone in AI history. Alpha Go was trained with deep learning. How this success can achieve
using a program which has trained by deep learning is explained in the following sections.
The two examples which are explained above focus on defeat human in areas that require
intelligence. But most of AI applications focus on
support human instead of defeat them. This type of
applications use machine learning technique to learn
specific problem to support people. Every person in
the world uses many applications which has been
developed using machine learning in their daily life consciously or unconsciously. It is given
that some examples of these type of applications below.
Recommendation Systems are one of the well known machine learning topic in the literature
and business sector. Recommender Systems are software tools and techniques providing
suggestions for items to be of use to a user. The suggestions which has provided are aimed at
offering something to users in various decision-making processes, such as what items to buy,
what music to listen, what movie to watch or what news to read. Recommender systems have
proven to be valuable means for online users to cope with information overload and have
become one of the most powerful and popular tools in electronic commerce. Correspondingly,
various techniques for recommendation generation have been proposed and during the last
decade, many of them have also been successfully deployed in
commercial environments.
The interesting thing is that system
calculate information about items and
estimate approximately how much a user
will vote on a item not seen before.
Amazon recommends book or some other

products that the user probably likes. Facebook shows advertisements, recommends
friendship relations or some events that the user probably likes. Youtube recommends video
to user and Spotify recommends music to user. There are countless examples on this subject.
Recommendation systems are used widespread across the globe. According to a report
published by NetFlix in 2014, ⅔ of movies watched at NetFlix are watched as a result of
recommendation. Recommendations generate 38% more click through for Google News.
Similarly, 35% of amazon sales are made through suggestion systems. Youtube and some
other firms are using recommendation systems strongly. Recently recommendation systems
are developed using deep learning. Big companies such as Youtube and Facebook benefit
from power of deep learning in large quantities.
Another well known machine learning application area is the activity recognition. The main
purpose of this type of application is detecting which activity performed by user at certain
time. This process can be done on the mobile phone or some external devices such as
smartwatch. Big mobile phone producers research on this topic heavily. Such big companies
Apple and Samsung has mobile application for activity recognition which is one of the default
application for their phones. For the develop intelligence system for activity recognition, it is
needed informations which is produced by sensors. Accelerometer, gyroscope and GPS
sensors are most commonly used sensor in this area. It is used machine learning techniques
to detect which activity performed by user. This type of applications can give us informations
about burned calorie, how many kilometer walked or how healthy the user's daily life is.
Machine learning can also be used for
prediction about future. For example,
in weather forecasting applications
current weather data and past data
processed and gathering information
about future weather conditions.
Another example of prediction is atm
cache optimization. The money which is located on atm is not useful for a bank when that
money is not being used by customers. In this situation money neither useful for customer
nor bank. If it is developed an intelligent system to predict optimum money for atm weekly or
monthly, banks can use that money for other purposes. In a recent study, banks can double
the number of ATMs without changing the total amount of money in overall ATM’s using an
intelligent system that estimates the optimum amount of money in ATMs. Some other
example is house price prediction. In this type of problem, system try to predict actual value
for house using information about house, house location, knowledge of nearby transportation
vehicles or land value like informations. There are so many other examples of forecasting
about future.
Image processing is one of the most frequently used field of
machine learning techniques. In imaging science,
image processing is processing of images using
mathematical operations by using any form of
signal processing techniques. Inputs may be an
image, a series of images, or a video, such as a
photograph or video frame. The output of image processing may be either an

image or a set of characteristics or parameters related to
the image. Some examples about image processing using
machine learning techniques are; face recognition,
fingerprint recognition, moving object recognition,
information retrieval from image or medical applications.
Moving object recognition is widely used in the military
purposes or traffic intensity detection like applications.
There are a lot of study which has gained high accuracy
rate using deep learning technique in this field. Machine
learning can also be used for text based applications like language translate in real time, detect
main idea about an article etc.
Another trending topic of machine learning area is
Autonomous Car. An autonomous car (driverless car,
self-driving car, robotic car) is a vehicle that is capable
of sensing its environment and navigating without
human input. Autonomous cars can detect
surroundings using a variety of techniques such as
radar, lidar, GPS, odometry, and computer vision.
Google's self driving car is an autonomous car project.
For creating autonomous car, the system must be
equipped with a strong artificial intelligence. (Image source: https://waymo.com/ )
Project started in 2009 and completed in 2015. This project completed its first driverless ride
on public roads. This project is testing in Austin Texas now. In December 2016, Google
transitioned the project into a new company called Waymo, housed under Google’s parent
company Alphabet. Alphabet describes Waymo as “a self-driving tech company with a mission
to make it safe and easy for people and things to move around.” The new company plans to
make self-driving cars available to the public in 2020 (image source McKinsey & Company).
Google’s self drive car designed for autonomous driving, so this car has no pedal or steering
wheels in it. All processes are doing with sensor input.

There is no directly input from human. Sensor inputs are processing by machine learning
techniques. Google is not only autonomous car producer in sector. Many of the big companies
in the automobile industry are doing research on driverless cars.
What is Deep Learning?
Deep learning (also known as deep structured learning, hierarchical learning or deep machine
learning) is a branch of machine learning based on a set of algorithms that attempt to model
high level abstractions in data. Deep Learning is a subfield of machine learning concerned with
algorithms inspired by the structure and function of the brain called artificial neural networks.
(Source of figure: http://fortune.com/ai-artificial-intelligence-deep-machine-learning/ ).

It was developed following the early Perceptron learning algorithm, which was limited in its
ability to understand the ambiguity of “or” within natural language. To resolve this problem
several layers of learning algorithms needed to be developed. There may a lot layers in deep
learning according to problem complexity. And in this algorithm, we can use large amount
data to train system. Processing large amount of data and having a large number of neurons-
layers require high processor capacity. CPUs are inadequate for this job now. The system
which want to run deep learning need much more CPU power.
Here is where the GPUs came into play.
GPU-accelerated computing is the use of a graphics processing unit (GPU) together with a CPU
to accelerate deep learning, analytics, and engineering applications. GPUs play a huge role in
accelerating applications in platforms ranging from artificial intelligence to cars, drones, and
robots. (See more at: http://www.nvidia.com/object/what-is-gpu-computing.html).
A simple way to understand the difference between a GPU and a CPU is to compare how they
process tasks. A CPU consists of a few cores optimized for sequential serial processing while a
GPU has a massively parallel architecture consisting of thousands of smaller, more efficient
cores designed for handling multiple tasks simultaneously.
(Source: http://www.nvidia.com/object/what-is-gpu-computing.html).
The core of deep learning is that we now have fast enough computers and enough data to
actually train large neural networks. That as we construct larger neural networks and train
them with more and more data, their performance continues to increase. This is generally
different to other machine learning techniques that reach a plateau in performance. This is
the key point why deep learning has became so trending topic today. Representative figure is
given below. ( Source of image: Andrew Ng )

Deep learning methods aim at learning feature hierarchies with features from higher levels of
the hierarchy formed by the composition of lower level features. Automatically learning
features at different levels of deliberation permit a system to learn complex functions
mapping the input to the output directly from data, without depending completely on human-
crafted features. An example of working mechanism of deep learning is given below. (Source
of image: http://fortune.com/ai-artificial-intelligence-deep-machine-learning/ ).

There are a lot of companies which are already starting to use deep learning. Explanations
for the most famous ones are given continuation of this section. (Source of four texh giants
get serious about deep learning: http://fortune.com/ai-artificial-intelligence-deep-machine-
learning/ ).

Startup Deep Genomics, which is backed by Bloomberg Beta and True Ventures among others,
has fed deep learning machines tons of existing cellular information in order to teach
machines to predict outcomes from alterations to the genome, whether naturally occurring
or through medical treatment. The technology could provide the most precise understanding
of an individual’s specific disease or abnormality and how that person’s well being can best be
advanced.
A more devices become internet-enabled, hackers have an
increasing number of entry points to infiltrate systems and cloud
infrastructure. The best cybersecurity practices not only create
more secure systems but can predict where the nextattack will come from. This is critical since
hackers are always on the hunt for the next vulnerable endpoint, so protecting against cyber
attack requires “thinking” like a hacker. Companies like Israel-based and Blumberg Capital-

based Deep Instinct aim to use deep learning in order to recognize
new threats that have never been detected before and thus keep
organizations one step ahead of cyber criminals.
There are already plenty of cars on the road with driver-assistance
capabilities, but these cars still rely on users to take over when an
unforeseen event occurs that the car isn’t programmed to respond
to. As Sameep Tandon of startup Drive.ai notes, the challenge with
self-driving cars is handling the “edge cases,” such as weather. This
is why, using deep learning, Drive.ai plans to help the car build up experience through
simulations of many kinds of driving conditions. Nvidia is also working on self-driving car
technology. Nvidia says it has used deep learning to train a car to drive on marked and
unmarked roads and along the highway in various
weather conditions, without the need to program
every possible “if, then, else” statement. In this
sector, Google and Many of the big companies in
the automobile industry are doing research on
driverless cars.
Since deep learning has already seen widespread experimentation
and refinement for textual analysis, it’s no surprise that Google, the
leader in search, has made widespread deep learning-based
updates to its search technology. Google’s deep learning-based
RankBrain technology was added to how Google manages and fills
search queries back in 2015. The technology helps handle queries that have not
been seen before.
So Apple moved Siri voice recognition to a neural-net based system for US users
on that late July day (it went worldwide on August 15, 2014.) Some of the
previous techniques remained operational but now the system leverages
machine learning techniques, including types of deep learning. When users
made the upgrade, Siri still looked the same, but now it was supercharged with
deep learning.
(Some of examples are taken from this link. If you want read more, you may check this link
also.)

Technical Review
In the continuation of this documentation we've explained the subjects of cyber security
which are made more powerful with machine learning. Briefly these subjects are spam filters,
IDS/IPS systems, false alarm rate reduction, fraud detection, cyber security rating, incident
forecasting, secure user authentication, and botnet detection systems. Finally there is one
main title about bypassing security mechanism which is developed for offensive purposes.
Before starting to explain the subjects of cyber security, we want to give a brief introduction
for technical background about machine learning whereby one can understand following
topics more easily. As the beginning, let's take a quick look at the General Structure of
Machine Learning in figure below.(Source of image:
http://www.isaziconsulting.co.za/machinelearning.html)
Machine learning problems can be divided into three main categories according to the
characteristics of the problem. These are supervised learning, unsupervised learning and
reinforcement learning. Supervised learning techniques divided into two subcategories as
Classification and Regression. In classification problem, we have completely separate classes

and main work is defining test sample to find the class which actually belongs to. When our
dataset classes are not separate, so it means we have continuous data, this type of problems
are called regression problems.
Unsupervised learning techniques divided into two subcategories as Clustering and
Dimensionality Reduction. Clustering problem basically cluster samples according to
similarities of the samples regardless of class information. Another unsupervised learning
techniques is recommendation systems. Recommendation Systems are used to recommend
something for the users. It can be a movie, music or something which is sold in the market
place.
Basically, there is one main difference between supervised and unsupervised approaches.
Supervised learning techniques use labeled samples in order to use in train the model. Unlikely
unsupervised techniques using unlabeled samples for training step. Generally, quality of the
dataset which is used in training phase is one of the most important thing for high accuracy
rate. When our model is completely finished, test samples will be produced in real time data.
Reinforcement learning is an area of machine learning concerned with how software agents
ought to take actions in an environment so as to maximize some notion of cumulative reward.
Generally this type of techniques are used in robotic application areas.
For the use of machine learning techniques we must
implement two phase. First of these is training phase,
the latter is test phase. In training phase, system learns
a model with the algorithm which is used. This model
defines the solution of the problem which we want to
solve. And in the test phase, the model we use in the
first step is tested. So we can analyze how successful
our model is.

Finally there is one more thing we
want to explain about learning.
Learning can be done at once (batch
learning) or can be done continuously
(incremental learning).
Which one you use is totally up to
definition of the problem. Incremental
learning can be considered as a
version of batch learning which is
updated timely. In figure on the side,
you can see classification of the
subjects of the cyber security domain
with machine learning. It's highly
recommended to take a look at this
figure, before you start reading topics below.

SPAM FILTER
Spam mail (also known as Junk Mail) is a type of electronic spam where unsolicited messages
are sent by email. Many email spam messages are generated for commercial purpose in
general but it may also contain malicious content which looks like a popular website, but in
fact, it may be a phishing attack. Malicious content may include malware, scripts or executable
file attachments. Actually, when the user recognizes a spam mail, he/she can add that mail
source to a blacklist easily, but some emails are created professionally and most of the time it
can't be recognized easily as spam for standard users. For this case, every mail service
producer uses spam filter applications which are developed with machine learning techniques.
One of the most commonly known algorithm for spam detection is Naive Bayes algorithm
which is based on statistical approach. In this section, we will explain how Naive Bayes
algorithms works.
Spam filtering problem can be solved using supervised learning approaches. So Naive Bayes
algorithm is one of the most well-known supervised algorithm. As we explained before, every
machine learning algorithm has two phases; training and testing. Because of the nature of the
supervised problem, Naive Bayes algorithm uses dataset which has labeled samples.
Basically, Naive Bayes algorithm uses word frequency in the email text. Training dataset has
words, count of this words and class information for every sample. Basic dataset example has
given below. Every row represents a single mail information.
[id1, ham, word1, word1_Count, word2, word2_Count.............]
[id2, spam, word1, word1_Count, word2, word2_Count.............]
[id3, ham, word1, word1_Count, word2, word2_Count.............]
Naive Bayes algorithm is based on the Bayesian Theorem and it calculates following two steps
for training phase.
a. Initially calculates the probabilities of ham and spam classes.
P(ham) =
"ℎ𝑎𝑚" 𝑙𝑎𝑏𝑒𝑙𝑒𝑑 𝑚𝑎𝑖𝑙 𝑐𝑜𝑢𝑛𝑡
𝑜𝑣𝑒𝑟𝑎𝑙𝑙 𝑚𝑎𝑖𝑙
P(spam)=
"𝑠𝑝𝑎𝑚" 𝑙𝑎𝑏𝑒𝑙𝑒𝑑 𝑚𝑎𝑖𝑙 𝑐𝑜𝑢𝑛𝑡
𝑜𝑣𝑒𝑟𝑎𝑙𝑙 𝑚𝑎𝑖𝑙
b. Next, calculates the probabilities of ham and spam for each word.
P( Wi / Ham ) = 𝑝(ℎ𝑎𝑚 | 𝑊𝑖) ∗ 𝑝(𝑊𝑖) / 𝑝(ℎ𝑎𝑚)
P( Wi / Spam ) = 𝑝(𝑠𝑝𝑎𝑚 | 𝑊𝑖) ∗ 𝑝(𝑊𝑖) / 𝑝(𝑠𝑝𝑎𝑚)

After the training phase, we calculate the probability of ham and spam for
every sample using the words in that sample for the test phase. For this
calculation, the equation used is given below.
pHam(x) =∑ p(wi/ham) * p (ham)
pSpam(x) =∑ p(wi/spam) * p (spam)
Finally, pHam and pSpam are compared and ranked. And test sample is
assigned to that class.
Traditional filter mechanism uses textual information for filtering mechanism, for example,
Naive Bayes Algorithm (which is explained above). In response to this, attackers starting to
send spam with the image instead of using words. Sent pictures have textual information but
these information cannot be processed, because of inappropriate format of textual based
machine learning algorithms. Researchers responded this action using image processing
techniques in order to gather words from images, before running machine learning
algorithms. Of course, attackers responded this action too. This time, attackers locate text in
images in different angles to make it harder to recognize. As you can imagine, researchers
solve this problem too. In the next step, attackers use images with angled words which are
created letters with different colors. The war continues in this way. New techniques are
emerging day by day to bypass filter mechanisms. In response to this, new techniques are
developed day by day to prevent spams, too.
There is an open source tool spam detection with name SpamAssassin. SpamAssassin is a mail
filter to identify spam mails. It is an intelligent email filter which uses a diverse range of tests
to identify undesirable email messages, more commonly known as Spam. These tests are
applied to email headers and contents in order to
classify emails using advanced statistical methods
of machine learning. In addition, SpamAssassin has
a modular architecture that allows other
technologies to be quickly utilized against spam
mails and it is designed for easy integration into any
email system, virtually.
New techniques are developed to strengthen the
filtering mechanism in time. But spammers are also
able to bypass spam filtering systems by generating
more sophisticated spam. This tool is updated in a
regular basis. So, anyone can download and
implement the tool freely.

SpamAssassin uses the combined score from multiple types of checks to determine whether
a given message is spam or not. Its primary features are:
● Header tests
● Body phrase tests (SpamAssassinRules.)
● Bayesian filtering (BayesFaq)
● Automatic address whitelist/blacklist (AutoWhitelist)
● Automatic sender reputation system (TxRep)
● Manual address whitelist/blacklist (ManualWhitelist)
● Collaborative spam identification databases (DCC, Pyzor, Razor2)
(UsingNetworkTests).
● DNS Blocklists, also known as "RBLs" or "Realtime Blackhole Lists" (DnsBlocklists)
● Character sets and locales
Even though these tests may misidentify a Ham or Spam by themselves, but with their
combined score, it is hard to be mistaken.
SpamAssassin is starting to use a Perceptron model since released version 3.0.0 to perform
the same task in order to process faster. Perceptron is one of the neural network technique.
In the new algorithm, the training phase is performed with Stochastic Gradient Descent
method. It uses a single perceptron with a logsig activation function and maps the weights to
SpamAssassin score space.

IDS/IPS SYSTEMS WITH ML
Intrusion Detection and Intrusion Prevention Systems (IDS / IPS) basically analyze data packets
and determine whether it is an attack or not. After analyzing part, the system is able to take
some precautions according to the result. IDS/IPSs can be considered as two main categories
based on operational logic; (1) Signature Based IDS, (2) Anomaly Based IDS.
Signature Based IDS works with attack signature which is created with the information of
known vulnerabilities. Signatures contain detailed information about attacks. This type of
systems has high accuracy rate for known attacks, but they cannot detect unknown attacks.
Because of this fact, new signatures must be created when new attacks are discovered and
this signature must be imported to the system immediately. Whereas these systems are not
resistant to 0-Day Attacks, anomaly Based IDS is able to detect 0-Day attacks, but also has
high false alarm rate.
Signature Based IDS's operation logic is based on the basic classification problem. Incoming
events are compared with signatures, if a match found then an alert occurs, otherwise it
means no malicious event is found. So, Signature Based IDS has low flexibility and it uses low-
level machine learning structures. Conversely, Anomaly Based IDS has high flexibility and it
uses high-level machine learning structures. So, in this chapter, Anomaly Based IDS is
explained heavily, and more detailed information about this structure is given.
In generally, types of Anomaly are divided into three main categories such as; point anomaly,
contextual anomaly, and collective
anomaly. In addition, there are
four types of attack defined in
academic researchers. Each type
of attacks has specific behavior. In
the figure given in the right-hand
side, characteristic behaviors of
the attack are given.
For the academic purpose, there
are lots of datasets available on
the internet for public usage.
KDD99 which is firstly created in
1998 and last updated in 2008, is one of the most commonly known datasets in academic
literature. This dataset has 7-week network traffic which has connection based data.
Supervised and unsupervised approaches can be applied to create these systems.
Until today, there are a lot of academic research developed using both supervised and
unsupervised techniques. Researchers also are using the combination of these techniques in
recent years and they gain high accuracy rate. These results are discussable because, the
dataset which is used in training phase -mostly KDD99- out of date. Therefore, new attacks
which have been discovered after creating the dataset can not be imported to this database
easily. Researchers can not decide whether these new attacks can be recognized or not.

In supervised approaches, the system is working with labeled events which are occurred in
the network. These approaches are similar with Signature Based IDS but not the same, only
difference is that attack events which are used in training phase is created by network flow
data. As we mentioned before, attack signatures are used in Signature-Based IDS/IPS, but in
Anomaly Based IDS/IPS, network flow data is used. Until now, there a lot of supervised
techniques used in the literature, but most commonly known algorithms are Support Vector
Machine, Bayesian Network, Artificial Neural Network, Decision Tree, and k-Nearest Neighbor.
The biggest advantage of these type of approaches is that they recognize well known malicious
activities with high accuracy and low false alarm rate. The disadvantage of these types of
approaches is that they have a weak recognize capability of 0-Day attacks.
In unsupervised approaches, the dataset doesn’t consist of any class information. Such
approaches like this based on two main assumptions. One of these is that the user profile can
not change in high quantity in a short time, and the other one is that malicious activity causes
an abnormal change in network flow. Operational logic is based on clustering whole network
activity data and as a result, a certain number of classes are created by the algorithm. Some
of these classes have a huge event count, whereas others have a very small event count.
According to the assumptions which are explained above, the classes with huge event
represents normal user activity such as web browsing or e-mail traffic, unlikely the other
classes represent malicious activities which have produced by attackers. The advantage of
these type of approaches is having strong ability to detect 0-Day Attacks. The disadvantage of
these type of approaches is that attackers can produce network traffic intelligently and they
can bypass IDS/IPS systems, and another disadvantage is that high false alarm rate occurs. It
means normal user activity can be recognized as malicious activity. This problem is very
important and there are a lot of academic researchers developed to overcome this
undesirable results.
Both techniques have advantages and disadvantages, to combine advantages in an efficient
way, and eliminate disadvantages completely, some hybrid approaches are developed. A part
of detection mechanism is working with the supervised algorithm, and another part is working
with the unsupervised algorithm. In recent years most of the researches focus on hybrid
detection approaches.
Snort is a free and open source network intrusion prevention system
(NIPS) and network intrusion detection system (NIDS) and used all
around the world. Snort's open source network-based intrusion
detection system (NIDS) has the ability to perform real-time traffic
analysis and packet logging on Internet Protocol (IP) networks. Snort
performs protocol analysis, content searching, and matching. These
basic services have many purposes including application-aware
triggered quality of service, to de-prioritize bulk traffic when latency-
sensitive applications are in use. Snort can be configured in three
main modes: sniffer, packet logger, and network intrusion detection. In sniffer mode, the
program will read network packets and display them on the console. In packet logger mode,
the program will record packets to the disk. In intrusion detection mode, the program will
monitor network traffic and analyze it against a rule set defined by the user. The program will
then perform a specific action based on what has been identified (Source wikipedia).

A signature is defined as any detection method that relies on distinctive marks or
characteristics being present in exploits. These signatures are specifically designed to detect
known exploits as they contain distinctive marks; such as ego strings, fixed offsets, debugging
information, or any other unique marking that may or may not be actually related to exploiting
a vulnerability. In these type of detection systems, events are classified after the first
detection, since actual public exploits are necessary for these type of detection systems to
work. Anti-Virus companies utilize this type of technology for protecting their customers from
virus outbreaks. As we have seen over the years, this type of protection has only limited
protection capabilities since a signature can be written after a system is infected by a virus.
(Source Snort)
Rules-based approaches has a different methodology for performing detection, they has the
advantage of 0-day detection. So it makes rules-based approaches more enhanced. Unlike
signatures, rules are based on detecting the actual vulnerability, not an exploit or a unique
piece of data. Developing a rule requires a strong understanding of how the vulnerability
actually works. (Source Snort).
Traditional signature-based IDS/IPSs are using signatures of attacks in order to detect these
attacks. But, detecting only well-known attacks can not provide systems safe completely. An
intelligent IDS/IPS must detect 0-day attacks.
Attacks are changed with little variations in time, so the attacks -which we call new- are
actually not new, these attacks has little variation from older attacks, different but not much.
Rules provide a flexible definition of attacks, so we can detect 0-day attacks which have little
variation from older attacks.
There is one more thing about this topic. Attackers have developed new techniques to bypass
IDS/IPS systems day by day. There are some tools to create some malicious network activity
events, but these events seem to be produced by real users. Of course, in response to this,
IDS/IPS systems are being updated to recognize this type of attacks.

False Alarm Rate Reduction
In some cases, IDS / IPS Systems may classify an event correctly or falsely. Classified events
are evaluated in four categories in literature.
1. True Positives (TP): intrusive and anomalous,
2. False Negatives (FN): Not intrusive and not anomalous,
3. False Positives (FP): not intrusive but anomalous,
4. True Negatives (TN): Intrusive but not anomalous.
TP and FN represent correctly classified events, FP and TN represent wrongly classified events.
Recognizing TN (intrusive but not anomalous) is a very hard task and can not be detected by
the system itself, human factor must be involved to the mechanism for recognizing this type
of events. FP (not intrusive but anomalous) is an event classified as intrusive but it is actually
a normal user’s event. This is a very common occurrence in today’s systems. False alarm rate
reduction is a one of the challenging problem for especially IDS / IPS system which has been
used for commercial purpose.
In generally, for the purpose of reducing false alarm rate, an extra module (also known as the
filter) must be implemented
before IDS / IPS' output. In this
way, false alarms are eliminated
from outputs and network
administrator should only handle
a small amount of alarm which
can be really an intrusion
attempt. Thus, time and
manpower are saved. In this
chapter, it is explained that how
filter module works, and how it
reduces false alarms.
The majority of researchers have
provided a solution to alarm correlation for anomaly techniques since purely anomaly
techniques trigger more alarms than other techniques. Although hybrid approach optimizes
the visibility and performance of the system, it makes the alarm correlation more complicated.
There is a need to attract researchers’ attention to providing solutions for alarm management
for recently used hybrid detection methods.
There are two main assumptions for Anomaly Based IDSs, first of these intrusion events
represent anomaly behavior and the second one is that user profile does not change much in
a short amount of time. False alarms occur when the edges of these assumptions are not
defined well. Basically, outputs of the IDS/IPS are consist of two classes of events. First is the
attack events which are classified correctly and the other one is normal events which are
classified falsely as an attack. Actually, both attack events and normal events consist of many
classes. Since, we want to separate them into real alarm classes and false alarm classes, we
think that there are two classes in this output data.

Now, we have the output data, and we do not know which one is real alarm and which one is
not. In machine learning terminology this means that data have no labels. Because of this, we
can use unsupervised techniques (also known clustering techniques) to create two clusters
according to our purpose. There are so many algorithms developed for clustering. In general,
clustering algorithms use distance metrics to evaluate the similarity between samples. Every
sample is clustered with similar samples. So every cluster has samples which are similar to
each other. With this idea, after the algorithm works, we have two classes for alarm data. One
of these represents normal events, the other one represents attack events. Based on two main
assumptions which are explained above, we can infer small cluster as representing attack
events.
The approach which we have explained during the chapter is one of the basic level
approaches, so it is explained because of good understanding about the methodology. There
are a lot of different, complex and successful approaches developed in the literature. In recent
studies, researchers have used the combination more than one technique instead of a single
algorithm for reducing false alarm rate. For example, for two layered clustering, first layer
clusters suspicious events and non-suspicious events, and second layer gives the final decision
for clusters. Like this, there are so many hybrid approaches developed in the literature.

FRAUD DETECTION
Fraud is one of the ancient thing in human history. As there is always people who is fraudulent,
there is also people who defrauded. The money e.g. credit cards are well-known targets for
being targeted by fraudulent activities.
With the development of e-marketing
sector, the count of fraudulent activities are
rising day by day. Users credit cards
informations stored in some companies’
databases, such company types as banks,
online shopping companies or online
service providers. We witness a growing
presence of frauds on online transactions
with the widespread use of internet day by
day. As a consequence of this, the need of
automatic systems which able to detect and
fight fraudster is emerged.
Fraud detection is notably a challenging problem because;
● Fraud strategies change in time, as well as customers’ spending habits evolve.
● Few examples of frauds available, so it is hard to create a model of fraudulent
behaviour.
● Not all frauds are reported or reported with large delay.
● Few transactions can be timely investigated.
With the large number of transactions we witness everyday;
● We can not ask human analyst to check every transactions one by one.
● We wish to automatise to detection fraudulent transaction.
● We want accurate prediction, i.e. minimise missed frauds and false alarms.
It can be overcome this bad situations with systems which developed with machine learning
techniques. Systems can learn complex fraudulent pattern by examining the data in large
volumes. And this systems can also create optimal model for fraudulent activities which has
complex shapes. Thus, successful predict can be done for new type of fraud. And system can
adopt itself to timely changing distribution (fraud evolution). However systems need enough
samples to achieve successfully learning.
Basicly, user profile created for every user in the detection systems. This profile must be
updated timely. When the system has trained with enough samples, systems has detailed
information about users spending habits for monthly, weekly or daily. For example, suppose
that while a student can spend $100 for a week, a businessman can spend $1000 for week.
While a fraudulent activity with $400 spend at once has high fraud probability for student,
similarly it has very low fraud probability for businessman. Of course, special days such as new
year, birthday or weekends must be considered when creating algorithm for fraud detection,
because students can also spend too much in these days. Generally, fraudulent does not know

victim’s spending habits, because of that fraudulent activity has inappropriate matching to
user profile presumably. But if fraudulent activity fits to user profile, it may be hard to detect.
In machine learning literature, fraud detection systems can be build with supervised,
unsupervised or mixed approaches. Every type of approach has a little difference according to
working logic. Approaches and their working logics given Figure below.
In supervised learning, using labeled historical fraud data to create user profile in the train
phase . This type of approaches are similar to Signature Based IDS/IPS, so this type of
approaches can detect fraudulent activities if they are well known, but new type of fraudulent
activities can not detected by these systems. Systems which is trained with unsupervised
algorithms can detect unknown fraudulent activities. This type of approaches are similar to
Anomaly Based IDS/IPS, so although this type approaches can detect unknown fraudulent
activities, in some cases they may not detect well known fraudulent activities. To achieve the
disadvantages of both techniques, mixed approaches are developed. In this type of
approaches, supervised and unsupervised algorithms works together, so both well-known and
unknown fraudulent activities can be detected efficiently.

CYBER SECURITY RATING and INCIDENT
FORECASTING
Before starting explain how rating and forecasting mechanisms works and which machine
learning algorithms can be used in it, we want to give you brief introduction about why we
need cyber security rating, where it can be used in real world, and how these informations
can be useful for companies.
Purpose of the Cyber Security Rating
Annual cost of cyber security breaches is nearly $500 billion and average cost for large
companies is $3 million in “The Centre for Strategic and International Studies” report
published in 2014. Because of that, Cyber Security infrastructure has vital importance for all
companies, especially which stored valuable information in virtual environment and internet.
The purpose of improve cyber security infrastructure, companies have been spent huge
amount of money. But company's CEO or IT directors can not evaluate effectively how much
improvement achieved into cyber security infrastructure for their expenditure. This is called
return of investment in business terminology. For achieving this negative situation, it must be
calculated for evaluating metrics to be unsterdand how strongly build cyber security
infrastructure. Cyber Security Ratings stands right inmiddle of this calculation. With this score,
everyone can understand how good designed a company’s cyber security infrastructure easily.
So, understanding return of investment is the first main purpose of cyber security rating
mechanism.
This score must be addressed global scale. So, companies can have evaluate own
infrastructure in comparison with other companies in the same sector or can have quite
understanding of general situation in the globe. Comparable cyber security infrastructure with
others is the second main purpose of Cyber Security Ratings.
Manager like CEO’s are getting some tactical decision at certain times for the future of the
company. Risk factor has vital degree importance for the getting these tactical decisions.
Cyber security ratings support the manager to making tactical decision for the future.
Managers can also determine policies over their vendor using that vendor's cyber security
ratings. So, supporting to making tactical decision is the third main purpose of Cyber Security
Ratings.
Cyber Security Ratings can also be used for Vendor Risk Management (VRM). Large companies
working with so many 3rd party suppliers (also known as vendor), and shared some valuable
information of the company with these vendors. In principle of “YOU ARE STRONG AS YOUR
WEAKEST POINT”, the incidents which may cause to vendors can affect to other firms so
easily. For the really strong infrastructure, companies must be working with companies which
have strong cyber security infrastructure. In addition, large companies can see dynamic
changes in ratings of their vendors and can take some precaution if vendors have slow ratings.
So, analyse the risk and support to Vendor Risk Management is the fourth main purpose of
the Cyber Security Rating.

Insurer firms insures companies with several purposes. Recently, managers insure their
company for the risk of cyber security breaches. Insurers must have the information about
client’s security infrastructure before the insuring that company. Previously, insurers asked
questions for the informations about security infrastructure to client. This question list too
long and hard to answer quickly and effectively for the healthy feedbacks. Insurers also want
to take Penetration Test Report about the clients. But all of these information isn’t continuous,
so it can be change timely. It is very useful cyber security rating which is created by continuous
data for this problem. Finally, supporting cyber insurers to see dynamic changes in security
infrastructures about firms is the fifth main purpose of the cyber security rating.
In summary, all the main purpose of the cyber security ratings are listed below.
1. Understanding return of investment
2. Creating comparable cyber security mechanism
3. Support to making tactical decision
4. Analyse risks and support VRM
5. Support Cyber Insurers to see dynamic changes
Calculating Cyber Security Rating
For the purpose of creation of a global standard about cyber security rating. The informations
which will be used in rate calculation must be collected online and passively, in other words
data must be collected with not creating directly connection to target systems. For rating
systems must work continuously, continuous active scan is able to cause to exhaust the
system which is targeted or able to crush the systems. Beside these, active scan can only be
done in permissive situations, otherwise it’s illegal and this means crime. Because of this
requirements, data is collecting on the internet using public database, reputation sites,
blacklists, and some sources like this. More useful information of data is given in table below.
(Note that the reference of this table is Bitsight Tech.)

Cyber security infrastructure is dependent so much criteria, because of this, it is more
reasonable that having more than one score, in place of having a single score. Scoring
mechanism can be divided into subcategories. Every scores of subcategories are calculated
using totally different criterias. Examples of these subcategories are DNS Health, SSL Strength,
Asset Reputation, Leaked Email, SMTP Controls, Hacktivist Shares, etc… When calculation
scores of subcategories, it is used data which are associated with that category. After
calculating scores of every subcategory, it is calculated overall score according to
subcategories’ importance.

Cyber Incident Forecasting
For the calculate Cyber Security Rate, it is needed that implement an algorithm. When the
algorithm works, process this information and retrieve some knowledge like rating score or
forecasting incidents. In this chapter we will give you general information about Forecasting
Cyber Security Incidents”. If you want to read more detailed information about this topic, we
recommend you to take a close look to this paper. In this part we have benefited greatly from
this work.
Predict an incident before it occurs is a very useful innovation for preventing cost which is
caused by incidents. In real world applications, this type of predictions can save money or
human life. For example, predict an earthquake before it occurs can save time to people for
getting some precautions. Thus, the deaths due to the earthquake can be greatly reduced.
Another example, in old days, a canary went down to work with coal miners. An allusion to
caged canaries (birds) that miners would carry down into the mine tunnels with them. If
dangerous gases such as carbon monoxide collected in the mine, the gases would kill the
canary before killing the miners, thus providing a warning to exit the tunnels immediately.
Similarly, in cyber security domain forecasting mechanism can save money, reputation or
valuable informations such as source code of important application or some product’s
chemical formula etc.
For we can predict an hacking incident before it occurs using machine learning algorithms, we
need the dataset which include incident reports and externally observable features about the
firms in training phase.
In the referenced paper, it is defined two main category for defining security posture about
companies. First of these is Mismanagement Symptoms and the latter is Malicious Activity
Data. Mismanagement Symptoms has five features and every one of them shows
misconfiguration settings on a network. These features does not give directly information
about the whether system is vulnerable or not. But there is correlation between these
features and hacking incidents. The features defined as (1) Open Recursive Resolver, (2) DNS
Source Port Randomization, (3) BGP Misconfiguration, (4) Untrusted HTTPS Certificates, (5)
Open SMTP Mail Relays.
Malicious Activity Data separated three types: (1) Spam Activities, (2) Phishing and Malware
Activities, (3) Scanning Activities. This malicious activity data collected time based and
collected recent 14 days and recent 60 days. One finally dataset used in the paper, this dataset
has incident reports from three different resources; (1) VERIS Community Database, (2)
Hackmageddon, (3)Web Hacking Incident Reports. This dataset used for labeling the security
posture data about companies.
The informations about mismanagement symptoms and malicious activity data mapped to
companies which has information on the public dataset which used in this paper. In this way
training dataset has been created with externally observed data which is added label
informations such as ‘hacked’ or ‘not hacked’ according to incident reports.
All dataset which is used in this paper are given below.

Finally the dataset which is created by combining security posture data and incident reports
separated into two part. One part is used in training phase, and the other part is used in test
phase. Random Forest and Support Vector Machine Algorithms was implemented. As a result
%90 True Positive rate, %10 false positive rate and %90 overall accuracy were gained with
Random Forest Algorithm. (If you want to know detailed information about how they achieve
this success rate, we recommend you to read the paper which we gave the link above.)
There is one more very important thing which was not take into consideration into referenced
academic paper above for incident forecasting mechanism. That is POPULARITY of
COMPANIES. Generally, people think as if a company hacked, the reason is weak cyber
security infrastructure. In fact, most of time this is wrong. For example, researches show that
financial companies has stronger cyber security infrastructure than the companies which are
worked in other sectors ( health, education etc. ). But financial companies have faced much
more hacking incidents than others business sector’s companies..
Let me explain why it is so.
Attackers need a motivation to hack companies. The motivations can be getting attention,
steal valuable information or hacktivist reasons etc.
Now, consider a company with very very weak cyber security infrastructure and has lowest
score for almost all cyber security subrating categories, but this company very small, not
known large amount of people and has no valuable information which is not worth the money
in its network. In real world, this company may live long without suffer for hacking situations,
although it can be hack easily. The reason is that there is no motivation to hack for hackers.
But the other side, consider a very large business company with very strong cyber security
infrastructure and this company has highest score almost all cyber security rating
subcategories. This company may be suffered to hacking situation in short time. Because
hackers have very good motivation to hack this company.
Although companies have spent huge amount of money to strengthen their companies,
nevertheless they are hacked. Bank companies, governments, largest tech companies in the
world (apple, adobe, linkedin, yahoo...) have been targeted by hackers until today. The reason
is strong motivation for hacker due to popularity of companies. Because of this, popularity is
very important feature for cyber incident forecasting, this information must be added in
forecasting mechanism.
How can we define popularity? Actually there are so many data available for defining
popularity such as count of employee, sector, annual income, company value on the stock
market, number of customer, value of stored data on company’s database, location (country,
state...), count of company name passed in daily news etc.

We want to give you some examples about this situation.
1. 2014 JPMorgan Chase data breach was a cyber-attack against American bank
JPMorgan Chase that is believed to have compromised data associated with over 83
million accounts – 76 million households (approximately two out of three households
in the country) and 7 million small businesses. The data breach is considered one of
the most serious intrusions into an American corporation's information system and
one of the largest data breaches in history. (This paragraph copied on wikipedia.)
2. Dropbox Data Breach: A huge cache of personal data from Dropbox that contains the
usernames and passwords of nearly 70 million account holders has been discovered
online. The information, believed to have been stolen in a hack that occurred several
years ago, includes the passwords and email addresses of 68.7 million users of the
cloud storage service. (This paragraph copied on this link.)
3. Yahoo Says 1 Billion User Accounts Were Hacked: Yahoo, already reeling from its
September disclosure that 500 million user accounts had been hacked in 2014,
disclosed Wednesday that a different attack in 2013 compromised more than 1 billion
accounts. The two attacks are the largest known security breaches of one company’s
computer network. The newly disclosed 2013 attack involved sensitive user
information, including names, telephone numbers, dates of birth, encrypted
passwords and unencrypted security questions that could be used to reset a password.
(This paragraph copied on this link.)
(Click this link, if you want to read more information about hacked companies which are very
famous.)

The Difference Between Rating and Forecasting
Actually, we can not explain the similarity or dissimilarity of rating and forecasting, because
this two topic is not in the same category. Two topic are complementary each other. The main
purpose of rating mechanism is evaluating cyber security infrastructure with some metrics
according to some data which are collected passively from the internet. On the other hand
the main purpose of forecasting cyber incident detect hacking incidents before it occurs.
Forecasting mechanism must use passively collected informations to define cyber security
infrastructure. So, rating mechanism can be considered as a step in forecasting mechanism.
With the other words forecasting mechanisms must evaluate some metrics to determine how
cyber security infrastructure works strongly before predict incidents.
In the paper which is explained above in this chapter, this evaluation is doing rating in machine
learning algorithm, so we can not see the evaluated rating, because algorithm jump the
solution by learning how strongly build infrastructure in it. After algorithm works we can
evaluate features with their importance looking at how its effect for cyber incidents. So, it is
understood that rating mechanism works dispersed in algorithm. This is a approach using in
academic literature.
However, rating score can give us so many valuable information about cyber security
infrastructure, so there is one other approach developed. With this approach rating
mechanism splitted to different layer from machine learning algorithm.
Score which is calculated using rating mechanism is an input value for machine learning
algorithms in this type of approach. Representative figures have given below.
Cyber Security Rating can give us valuable information about cyber security infrastructure
even if it is not used in forecasting mechanism for we can understand how strong our
infrastructure against to cyber threats.

SECURE USER AUTHENTICATION
As a dictionary term, Authentication ( or Verification) is independent procedures that are used
together for checking that a product, service, user or system meets requirements and
specifications and that it fulfills its intended purpose. User verification is a mechanism which
gives permission to user to log in applications or systems. No one else can access to user
account except real user, in ideal systems. In general, username and password are used for
authentication to systems when the target system is an online service. These fields are
vulnerable to brute force attacks, if no preventive measures are taken. Attackers are able to
try all combinations to crack user’s passwords (trial and error).
It is strongly recommended to use secure passwords which have
numbers, letters, and special characters and also have minimum
length. Security-conscious companies maintain password
creation policies to make sure that every employee’s password
is safe. If a user takes this precautions, cracking his/her password
may take years through online brute force. Security-aware
companies store user passwords in database in hash format,
thus even if their systems are hacked, passwords can not be
cracked. Of course hash algorithm which is used must be strong,
such as adaptive hash algorithms (bcrypt). Beside these
precautions some additional security mechanisms are used to
prevent unauthorized access to systems such as captcha and
two-factor authentication.
Captcha is an additional security layer for
authentication to prevent brute force or
dictionary attacks, using captcha images.
Thus, automatic brute force tools can not
recognize these images and can not go
further after showing this image in
authentication mechanism.
Two-factor authentication is also
additional security layer for authentication to prevent unauthorized access. This type of
mechanism uses some additional information which is known by only the real user. This
information can be OTP (one time password) which is sent to a pre-registered cell phone, or
it can be biometric information for real world applications.

Authentication mechanisms are not used only in web applications. In real world applications,
for example entering to secure facility such as military building, some additional precautions
should be used for authentication like biometric verification.
In this chapter, ways to make an authentication mechanisms more powerful are explicated
using various machine learning techniques. The best popular way to authenticate user is using
as unique as possible information such as biometric data. But biometric verification
mechanism requires physical access to enter authentication information. In order to use these
type of authentication mechanism, specialized sensors are required which have monetary
value. Thus, these systems can be used in highly critical real world applications but can not be
practically used in classic web applications. There are some authentication techniques
developed to use users’ unique information for authentication to systems like web application
without using additional sensors.
There is two main categories for authentication mechanisms that utilizes machine learning.
These categories are (1) Biometric Verifications and (2) Activity Based Verifications.

Biometric Verification
Biometric verifications is an authentication system that use unique human information. This
informations have no mirror in the world. It is needed physical access to enter authentication
information. Because of this, this type of verification mechanism using in real world secure
mechanism. The most commonly known biometric verification systems are based on these;
1. Fingerprint Recognition
2. Finger Vein Recognition
3. Retina and Iris Recognition
4. Hand/Palm Recognition
5. Voice Recognition
6. Signature Recognition.
7. Face Recognition
These verification techniques commonly used around the world. For example it is commonly
witnessed that fingerprint recognition systems using in ATM machines, secure facility
entering and entrance to working area for employees. Voice recognition is commonly used in
call centers with the purpose of identify customers. Palm recognition is commonly used in
medical services for verification patient with high accuracy rate. Retina and Iris recognition is
commonly used secure facility entrance etc.
In this part, it is explained that how can system recognize these pattern using machine learning
techniques. Fingerprint recognition, retina-iris recognition, hand-palm recognition which are
most commonly used biometric systems are explained with some detailed information. Other
techniques explained shortly.

Fingerprint Recognition
Fingerprint recognition one of the most commonly using biometrics types in
the world. For the recognize fingerprint, firstly it must be scanned finger
with fingerprint scanner. Output of fingerprint scanner mechanism is a
single image which shows finger surface with black/white colored. And with
the image processing techniques, this image is processed and extracted
features.
In the training phase, it is collected fingerprint images for every user more
than once and calculated feature informations are saved to database.
In test phase, collect user fingerprint image through fingerprint scanner sensor in real time.
After this step, it is calculated feature information about test image. Finally, with the most
basic approach, this informations are compared with the informations which is stored in
database for real users. Distance metrics
are used for comparison. If test image is
similar enough to user’s train images,
authentication result is successful, else
authentication failed. Defining selected
features, selected machine learning
algorithm which will used for decision
mechanism, and selected distance
metrics are directly influential to
accuracy rate.
Fingerprint detection is one the most
commonly used authentication
technique in personal life. So that,
mobile phone producers implements
fingerprint detection systems into their
mobile phones. This type of systems
make authentication easier to mobile phone’s owner and make harder for other people. This
is a good example for the widespread use of user authentication systems with machine
learning in daily life.

Retina and Iris Recognition
Retina recognition is a biometric technique that uses the
unique patterns on a person's retina for person
identification. The retina is the layer of blood vessels
situated at the back of an eye. The eye is positioned in
front of the system at a capture distance ranging from 8
cm to one meter. The output of eye scanner sensor is a
blood vessel image of retina. Every human’s blood vessel
figure is unique. With image processing techniques,
extracted features of image, then making a decision
using machine learning techniques. Example of retina image given on the right.
Another biometric verification technique which is based on eye is a iris recognition. The iris is
the part of the eye that is colored and it is responsible for controlling the
amount of light entering the eye. Iris has a veined structure and unique for
every human in the world. Structure of iris extracted by image processing
techniques and creating a decision mechanism using machine learning
techniques.

Hand/Palm Recognition
Palm detection based two different logic; first of these
scanning hand surface like fingerprint scanning, the
second one is extracted blood vessel of palm. In hand
recognition systems, hand is scanned by visual scanner
and extracted surface of hand information. In palm
detection systems, palm scanned by infrared sensor,
and output of this type of techniques is a blood vessel of
palm picture. Features extracted by image processing
algorithms in both techniques. And creating decision
mechanism using machine learning algorithms.
Voice recognition is a type of signal processing. Every
human has unique voice, and this information
detectable by machine learning techniques. Biometric verification can be done using finger
vein information, signature shapes and face recognition. Accuracy rates of biometric
verification techniques have given in figure below.

Activity Based Verification
Identity theft is a crime in which hackers perpetrate fraudulent activity under stolen identities
by using credentials, such as passwords and smartcards, unlawfully obtained from legitimate
users or by using logged-on computers that are left unattended. User verification methods
provide a security layer in addition to the username and password by continuously validating
the identity of logged-on users based on their physiological and behavioral characteristics.
Every individual person use authentication mechanism to log in countless times in a single day.
In generally, only usernames and passwords are used for authentication to web applications.
And it is commonly known that companies even the largest ones are hackable by attackers
even now, and it is also known that in significant quantities of these have been hacked already.
Individual user's username and password informations may have already fallen down to
internet or darkweb in clear text format without the any knowledge of the user.
Now, think about new type of verification methods which are created by user unconsciously.
Even user’s own can not identify passwords correctly. Passwords are based on behavioral
knowledge about users. We want to give you an example for the clear understanding. In this
example activity based verification mechanism is not using for online service, but the main
idea is same.
The example takes place in a movie with name Mission: Impossible - Rogue Nation. Briefly, in
the movie, there is a secure facility and our guys want to enter this facility and steal a valuable
information. Facility has multi layered security mechanism. Our interest is final step of this
mechanism. Because, an activity based verification technique is used in the final step. In this
step, user walking in a tunnel which is monitored by cameras and some other sensors. These
sensor analyze users’ individuals walking behaviour. As you may notice, of course every
person’s walking behaviour is unique. This step can not be passed unless the attacker copy
this behavioral information. And it is nearly impossible. The reason is that this behavioral
information is abstract. In physical biometric verification techniques, attackers know what
must be copied for bypassing authentication mechanism. Because in physical biometric
systems, the things which attacker want to copy are physical part of human such as
fingerprint, iris etc. Behavioral informations are abstract knowledge of humans and can not
be copied. If attackers kidnap real user, even so attacker can not copy this information. You
can not copy the thing which you do not know what is and how it works.
In generally, mechanism which is used in movie works as we have described above. If you want
to see how it works we recommend you to watch the movie. (Note that, of course our guys
enter the systems with changing data about user behavioral information which is stored in
database.)
Because of things which we have described above, Activity Based Verification is more relevant
topic (in our opinion) about cyber security than physical biometric verification, because this
type of verification systems are using in online services which are targetable directly by
hackers. General structure of mechanism is given in Figure below.

Feature acquisition – captures the events generated by the various input devices used for the
interaction (e.g. keyboard, mouse) via their drivers.
Feature extraction – constructs a signature which characterizes the behavioral biometrics of
the user.
Classifier – Consists of a machine learning algorithm (e.g. Support Vector Machines, Artificial
Neural Networks, etc.) that is used to build the user verification model by training on past
behavior, often given by samples. During verification, the induced model is used to classify
new samples acquired from the user.
Signature database – A database of behavioral signatures that were used to train the model.
Upon entry of a username, the signature of the user is retrieved for the verification process.
Although the title about this topic is Activity Based Verification, this technique can be used in
two different ways for the same purposes. One of these way, checking user before login to
system, the other one is checking user after login to system. Until this point in this chapter,
we have explained the way of checking user before login to system. But the other way is also
interesting. General idea of second way is pursuing logged users by spying on them and detect
whether the logged user is real user or not. Google has a patent for this purpose (if you want
to check click this link.) In this patent, it is used that social network activity for logged users
for detect fraudulent activities.
In continuation of this section, it is explained that how can we build systems like these using
machine learning techniques. Most common behavioral verification techniques are based on:
(a) mouse dynamics, which are derived from the user-mouse interaction;
(b) keystroke dynamics, which are derived from the keyboard activity; and
(c) software interaction (such as game playing), which rely on features extracted from the
interaction of a user with a specific software tool.
Behavioral methods can also be characterized according to the learning approach that they
employ. Explicit learning methods monitor user activity while performing a predefined task
such as playing a memory game. Implicit learning techniques, on the other hand, monitor the
user during general day-to-day computer activity. Nevertheless, it is the best way to learn
unique user behavior characteristics such as frequently performed actions.

Keystroke Dynamics
Keystroke dynamics’ features are based on
calculating duration of pressing keys. Such
as the example given right. These
informations are unique for every human.
Keyboard dynamics features also include,
for example, latency between consecutive
keystrokes, flight time, dwell time – all
based on the key down/press/up events.
Keyboard-based methods are divided into
methods that analyze the user behavior
during an initial login attempt and
methods that continuously verify the user
throughout the session. The former
typically construct classification models according to feature vectors that are extracted while
the users type a predefined text (such as a password) while the latter extract feature vectors
from free text that the users type. In recent paper, Stefan et al. evaluated the security of
keystroke-dynamics authentication against synthetic forgery attacks. The results showed that
keystroke dynamics are robust against the two specific types of synthetic forgery attacks that
were used. Although being effective, keyboard-based verification is less suitable for web
browsers since they are mostly interacted with via the mouse.
Mouse Dynamics
People are able to surfing on the internet with the purposes of read newspaper, watching
video or any action require only mouse interactions. This technique is useful for both before
login, and after login phase. Useable feature informations for mause dynamic based
authentication are given below. These informations are used in machine learning algorithms
in order to detect real users.
● Mousemove Event (m) – occurs when the user moves the mouse from one location to
another. Many events of this type occur during the entire movement – their quantity
depends on the mouse resolution/sensitivity, mouse driver and operating system
settings.
● Mouse Left Button Down Event (ld) – occurs when the left mouse button is pressed.
● Mouse Right Button Down Event (rd) – occurs when the right mouse button is pressed.
● Mouse Left Button Up Event (lu) – occurs after the left mouse button is released.
● Mouse Right Button Up Event (ru) – occurs after the right mouse button is released.

Software Interaction
Finally, several types of software have been suggested in the academic literature to
characterize behavioral biometrics of users for authentication and verification purposes.
These include board games, memory games, web browsers, email clients, programming
development tools, command line shells and drawing applications. These behavioral biometric
features may be partially incorporated in user verification systems.
We want to give you one interesting information about this topic. Everyone knows the
Recaptcha System which has developed by Google. Basically this system can decide that the
created connection by real user or bot. To do so, system asks a question which is easy for
human, tough for bots. The interesting thing is in some cases real users pass captcha
mechanism without encountering any question. Because, Recaptcha uses behavioral
biometric verification methods from the moment of entering the site, so real users can enter
the systems easily. Used behavioral verification technique is based on software interaction.
System collects cookie information and browser characteristics which has located on browser
and analyze that information, after that system decides whether or not the connection is
created by the real user. (Detailed information about captcha mechanism is given in Captcha
Bypassing section.) When results which have analyzed is suspicious or cookie information is
not enough, system ask question to user to enter the system.
Conclusion of User Authentication
Recently, due to the limitations of user authentication systems that employ a single user
characteristic such as mouse dynamics or iris patterns, a multi-modal approach has been
proposed in various papers. There are many studies developed in the literature using
combining various authentication techniques, because using only one technique is not
feasible.

BYPASSING SECURITY MECHANISM
Until this chapter, we have talked about the topics of preventing attackers using machine
learning techniques. But this chapter is different, the main topic of the chapter is that the ways
to bypass security mechanism with machine learning techniques in order to understand
attackers perspectives. Attackers can already bypass most of the security mechanisms easily
by using some techniques and tools. These tools are capable of doing so complex jobs. In this
chapter, we will explain how security mechanisms can be bypassed using some intelligence
systems.

Captcha Bypassing
Before we explain how captcha mechanism can be bypassed, we want to give you a brief
introduction about what captcha mechanism is and how it works.
The main purpose of captcha mechanism is to provide secure authentication for users with
asking some questions which are easy for human, however tough for bots. It is imperative to
render the process of solving a captcha challenge as effortless as possible for legitimate users,
while remaining robust against automated solvers. Thus, bots can not try to enter systems
automatically.
Firstly created mechanism was using single image and want
to enter numbers or characters which are located in this
image to a textbox. Sample images are given in the right of
the paragraph. Maybe you have already noticed that line noise in the images. There is a reason
for the presence of those lines. Digital numbers or characters are detectable easily using image
processing techniques. Therefore images which are used in captcha mechanism are
transformed to more complex type in order to make it more difficult to break. First times,
these transformations are done by adding noise to image. Such images are given below.
Nevertheless, these images are broken in time. All of these type of images are crackable now
with 100 percent success rate. Then of course, it is started to use more complex images in
captcha mechanisms. Such images are given below. Cracking these images using image
processing is harder than cracking pictures above.
As the complexity of the images increases, image processing techniques are developed, too.
As improving system to get more secure authentication mechanism, cracking systems are also
improving timely in order to crack these new systems.
ReCaptcha mechanism which is developed by Google is one of the well known system. With
the developed new technique, this system took the captchas one step further. In the
continuation of this chapter, the brief introduction of ReCaptcha working mechanism is given.
After that part, the ways to crack the mechanism is also explained. There is a news about this
topic which is published in April, 2016. If you want to read more detailed information about
reCaptcha working mechanism and how it can be cracked, we recommend you to take a close
look at this paper. In the section, we have utilized greatly from this work.

Recaptcha mechanism has two main verification module. First of them requires only single
click from user for authentication process. In this module, system analyzes user’s cookies and
browser characteristics which are located in their browser. In the analyzing part, confidence
scores are calculated for every user. This score shows that request is originated from an
honest user which is not suspicious or originated from a bot. For high confidence scores, the
user is only required to click within a checkbox. For lower scores, the user may be presented
with a new challenges. In the second module, the user will have to deal with difficult questions
which are based on image or text.
There is three type of challenges which are varying from user to user. These challenges are;
(1) No captcha reCaptcha, (2) Image reCaptcha, (3) Text Based reCaptcha. No captcha
reCaptcha is used in first module. In the second module, users which have low level confidence
score are encountered two new type of challenges; Image reCaptcha and Text Based
reCaptcha.
No captcha reCaptcha (Checkbox Captcha): The new user-
friendly version is designed to remove the difficulty of solving
captchas completely. Upon clicking the checkbox in the widget, if
the advanced risk analysis system considers that the user have
high reputation, the challenge will be consider solved and no
action will be required from the user.
Image reCaptcha: This new version is built on identifying images
with similar content. The challenge contains a sample image and
9 candidate images, and the user is requested to select those that
are similar to the sample. The challenge usually contains a keyword
describing the content of the images that the user is required to
select. The number of correct images varies between 2 and 4.
Text reCaptcha: Examples for this type are given below. These
distorted texts are returned when the advanced risk analyses
consider the user having a lower reputation. (e) is fallback captcha
which will be selected when the User-Agent fails certain browser
checks, the widget automatically fetches and presents a challenge
of this type, before the checkbox is clicked. Over the period of the following 6 months, text
captchas appeared to be gradually “phased out”, with the image captcha now being the
default type returned, as these captchas are harder for humans to solve despite being solvable
by bots. TextBased reCaptchas can be cracked with nearly %100 accuracy rate using Deep
Learning technique. ( If you want to read how it can be possible, take a look at this blog.)
(Note: Definition of captcha types are copied from referenced paper.)

If we remember, reCaptcha has two completely different module. First of them works tracking
cookies and checking browser characteristics, the latter one asks question which is based
image or text to user. Because there are two different modules operating in the system,
captcha cracking methods can be applied in two different ways which correspond to each
module. Shortly, two modules have been developed to crack captcha mechanism. First
component is doing that creating artificial cookie and browser characteristics to mislead
module one (checkbox captcha), so that it can influence the risk analysis process. In the
reference paper, it has been stated that creating cookie for 9 days is fair enough to bypass
module one. Of course, creating cookie represents normal user activities and must be
undetectable, so it must be created intelligently.
Text-based captcha is not widely used anymore. Image based captchas took text based
captchas place. Because of this, second module has designed to crack only image based
reCaptcha. Before we begin to explain how can we crack this module, take a quick look at the
sample image and try to figure out what system want from user.
In the example, image which is given above, the question is “Select all wine below.”. A real
user can easily understand what system wants, and select all pictures related to wine. Now
the question is: “How automated system can do it?”. Processing an image for identifying
objects and assigning semantic information to it, is considered a complex computer vision
problem. To do this job, initially system has to understand the things which are desired.
Keywords for the question can be identified with NLP(Natural Language Processing) or sample
image can be processed by some tools to detect what is in it. Google has a tool for to do this
work with name GRIS (Google Reverse Image Search). There are so much successful online
tool for information retrieval from image. After detecting keywords, same work is applied to
all question images. As a consequence, we have images with tags. Finally, sample image tags

and question image tags are compared. Relevant question images are marked, other images
are kept unmarked.
Success rate of this type of systems are strongly depend on image processing tools. As we said
before, information retrieval from image is a difficult task in image processing problems.
Academic studies show that Deep Learning based approaches has significantly high accuracy
rate for information retrieval problems. There are many tool which services online with the
purpose of information retrieval using deep learning. There are also several free online
services and libraries that offer relevant functionality, ranging from assigning tags (keywords)
to providing free-form descriptions of images. Some example output from these tools are
given in figure below. (If you want to try this tools, here are the links, GRIS, Alchemy, Clarifai,
TDL, NeuralTalk, Caffe)
Brief information about the tools are given below;
GRIS has ability to conduct a search-based on an image. If the search is successful it may return
a “best guess” description of the image. Alchemy is also built upon deep learning, and offers
an API for image recognition. For each submitted image, the service returns a set of tags and
a confidence score for each tag. Claifai is built on the deconvolutional neural networks (so
using deep learning), and returns a set of 20 tags describing the image. TDL has released as an
app for demonstrating the image classification capabilities of their deep learning system.
NeuralTalk is developed for generating free-form descriptions of an image’s contents using a
Recurrent Neural Network architecture. Caffe has been released as a deep learning
framework, which we also leverage for processing images locally. Caffe returns a set of 10
labels; 5 with the highest confidence scores and 5 that are more specific as keywords but may
have lower confidence scores.
As a consequence, with using explained methods above, referenced study has %70.78
successfully solving rate on image reCaptcha challenges doing this work automatically. And
this system also applied to Facebook image captcha challenges, and %83,5 success rate has
been achieved.

MACHINE LEARNING IN CYBERSECURITY

MACHINE LEARNING IN CYBERSECURITY

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Semelhante a MACHINE LEARNING IN CYBERSECURITY

Semelhante a MACHINE LEARNING IN CYBERSECURITY (20)

Mais de BGA Cyber Security

Mais de BGA Cyber Security (20)

Último

Último (20)

MACHINE LEARNING IN CYBERSECURITY