Congrats ! You got your Data Science Job

Finish this Book and get your
Data Scientist Job

Data Science Introduction
Data Science is a combination of multiple disciplines that uses statistics, data analysis, and
machine learning to analyze data and to extract knowledge and insights from it.
What is Data Science?
Data Science is about data gathering, analysis and decision-making.
Data Science is about finding patterns in data, through analysis, and make future
predictions.
By using Data Science, companies are able to make:
Better decisions (should we choose A or B)
Predictive analysis (what will happen next?)
Pattern discoveries (find pattern, or maybe hidden information in the data)

Where is Data Science Needed?
 Data Science is used in many industries in the world today, e.g. banking,
consultancy, healthcare, and manufacturing.
 Examples of where Data Science is needed:
 For route planning: To discover the best routes to ship
 To foresee delays for flight/ship/train etc. (through predictive analysis)
 To create promotional offers
 To find the best suited time to deliver goods
 To forecast the next years revenue for a company
 To analyze health benefit of training
 To predict who will win elections

 Where is Data Science Needed?
 Data Science can be applied in nearly every part of a business where
data is available. Examples are:
 Consumer goods
 Stock markets
 Industry
 Politics
 Logistic companies
 E-commerce

 How Does a Data Scientist Work?
 A Data Scientist requires expertise in several backgrounds:
 Machine Learning
 Statistics
 Programming (Python or R)
 Mathematics
 Databases
 A Data Scientist must find patterns within the data. Before he/she can find
the patterns, he/she must organize the data in a standard format.

Here is how a Data Scientist works:
Ask the right questions - To understand the business problem.
Explore and collect data - From database, web logs, customer feedback,
etc.
Extract the data - Transform the data to a standardized format.
Clean the data - Remove erroneous values from the data.
Find and replace missing values - Check for missing values and replace
them with a suitable value (e.g. an average value).
Normalize data - Scale the values in a practical range (e.g. 140 cm is
smaller than 1,8 m. However, the number 140 is larger than 1,8. - so scaling
is important).
Analyze data, find patterns and make future predictions.
Represent the result - Present the result with useful insights in a way the
"company" can understand.

 What is Data?
 Data is a collection of information.
 One purpose of Data Science is to structure data, making it interpretable
and easy to work with.
 Data can be categorized into two groups:
 Structured data
 Unstructured data

Unstructured Data
Unstructured data is not organized.
We must organize the data for analysis
purposes.
Structured Data
Structured data is
organized and
easier to work with.
Data Types?

 How to Structure Data?
We can use an array or a database table to structure or present data.
Example of an array:
[80, 85, 90, 95, 100, 105, 110, 115, 120, 125]
The following example shows how to create an array in Python:
#Example
Array = [80, 85, 90, 95, 100, 105, 110, 115, 120, 125]
print(Array)
o/p: [80, 85, 90, 95, 100, 105, 110, 115, 120, 125]

Data Science - Database Table:
What is Database Table?
A database table is a table with structured data.
The following table shows a database table with health data extracted from a sports watch:
This dataset contains
information of a typical
training session such as
duration, average pulse,
calorie burnage etc.

Database Table Structure
A database table consists of column(s) and row(s):
A row is a
horizontal
representation of
data.
A column is a
vertical
representation of
data.

Variables
A variable is defined as something that can be measured or counted.
Examples can be
characters,
numbers or time.
In the example
under, we can
observe that each
column represents
a variable.

Variables
A variable is defined as something that can be measured or counted.
There are 6 columns, meaning that there are 6 variables (Duration, Average_Pulse,
Max_Pulse, Calorie_Burnage, Hours_Work, Hours_Sleep).
There are 11 rows, meaning that
each variable has 10 observations.
But if there are 11 rows, how come
there are only 10 observations?
It is because the first row is the
label, meaning that it is the name
of the variable.

Data Science & Python
Python
Python is a programming language widely used by Data Scientists.
Python has in-built mathematical libraries and functions, making it easier to calculate
mathematical problems and to perform data analysis.
Python Libraries
Python has libraries with large collections of mathematical functions and analytical tools.
In this course, we will use the following libraries:
 Pandas - This library is used for structured data operations, like import CSV files, create
dataframes, and data preparation
 Numpy - This is a mathematical library. Has a powerful N-dimensional array object, linear
algebra, Fourier transform, etc.
 Matplotlib - This library is used for visualization of data.
 SciPy - This library has linear algebra modules

Data Science - Python DataFrame
Create a DataFrame with Pandas
A data frame is a structured representation of data.
Let's define a data frame with 3 columns and 5 rows with fictional numbers:
Example
import pandas as pd
d = {'col1': [1, 2, 3, 4, 7], 'col2': [4, 5, 6, 9, 5], 'col3': [7, 8, 12, 1, 11]}
df = pd.DataFrame(data=d)
print(df)
Example Explained
Import the Pandas library as pd
Define data with column and rows in a variable named d
Create a data frame using the function pd.DataFrame()
The data frame contains 3 columns and 5 rows
Print the data frame output with the print() function

Data Science - Python DataFrame
We write pd. in front of DataFrame() to let Python know that we want to activate the DataFrame()
function from the Pandas library.
Be aware of the capital D and F in DataFrame!
Interpreting the Output : This is the output:
We see that "col1", "col2" and "col3" are the names of the
columns.
Do not be confused about the vertical numbers ranging from 0-
4. They tell us the information about the position of the rows.
In Python, the numbering of rows starts with zero.
Now, we can use Python to count the columns and rows.
We can use df.shape[1] to find the number of columns:

Data Science Functions
The Sports Watch Data Set
The data set above consists of 6 variables,
each with 10 observations:
 Duration - How long lasted the training
session in minutes?
 Average_Pulse - What was the average
pulse of the training session? This is
measured by beats per minute
 Max_Pulse - What was the max pulse of
the training session?
 Calorie_Burnage - How much calories
were burnt on the training session?
 Hours_Work - How many hours did we
work at our job before the training
session?
 Hours_Sleep - How much did we sleep
the night before the training session?
We use underscore (_) to separate strings
because Python cannot read space as
separator.

DataThe min() function
The Python min() function is used to find the lowest value in an array. Science Functions
Average_pulse_min = min(80, 85, 90, 95, 100, 105, 110, 115, 120, 125)
print (Average_pulse_min)
o/p: 80
Average_pulse_max = max(80, 85, 90, 95, 100, 105, 110, 115, 120, 125)
print (Average_pulse_max)
o/p: 125
Data Science Functions
The max() function
The Python max() function is used to find the highest value in an array.
The mean() function
The NumPy mean() function is used to find the average value of an array.
import numpy as np
Calorie_burnage = [240, 250, 260, 270, 280, 290, 300, 310, 320, 330]
Average_calorie_burnage = np.mean(Calorie_burnage)
print(Average_calorie_burnage)
o/p: 285.0

Extract and Read Data With Pandas
• Before analyzing data, a Data Scientist must extract the data, and
make it clean and valuable.
• Before data can be analyzed, it must be imported/extracted.
In the example below, we show you how to import data using Pandas in Python.
We use the read_csv() function to import a CSV file with the health data:
import pandas as pd
health_data = pd.read_csv("data.csv", header=0, sep=",")
print(health_data)
Data Science - Data Preparation

Example Explained
• Import the Pandas library
• Name the data frame as health_data.
• header=0 means that the headers for the variable names are to be found in the first row
(note that 0 means the first row in Python)
• sep="," means that "," is used as the separator between the values. This is because we are
using the file type .csv (comma separated values)
• Tip: If you have a large CSV file, you can use the head() function to only show the top
5rows:

Data Cleaning
Look at the imported data. As you can see, the data are "dirty" with wrongly or unregistered
values:
• There are some blank fields
• Average pulse of 9 000 is not possible
• 9 000 will be treated as non-numeric, because of the space separator
• One observation of max pulse is denoted as "AF", which does not make sense
• So, we must clean the data in order to perform the analysis.

Data Cleaning
Remove Blank Rows
We see that the non-numeric values (9 000 and AF) are in the same rows with missing values.
Solution: We can remove the rows with missing observations to fix this problem.
When we load a data set using Pandas, all blank cells are automatically converted into
"NaN" values.
So, removing the NaN cells gives us a clean data set that can be analyzed.
We can use the dropna() function to remove the NaNs. axis=0 means that we want to
remove all rows that have a NaN value:
import pandas as pd
health_data.dropna(axis=0,inplace=True)
print(health_data)

Data Categories
To analyze data, we also need to know the types of data we are dealing with.
Data can be split into three main categories:
o Numerical - Contains numerical values. Can be divided into two categories:
Discrete: Numbers are counted as "whole". Example: You cannot have trained 2.5 sessions, it is either
2 or 3
Continuous: Numbers can be of infinite precision. For example, you can sleep for 7 hours, 30 minutes
and 20 seconds, or 7.533 hours
o Categorical - Contains values that cannot be measured up against each other. Example: A color or
a type of training
o Ordinal - Contains categorical data that can be measured up against each other. Example: School
grades where A is better than B and so on
o By knowing the type of your data, you will be able to know what technique to use when analyzing
them.

Data Types
We can use the info() function to list the data types within our data set:
import pandas as pd
print(health_data.info())
o/p:
We see that this data set has two different
types of data:
Float64
Object
We cannot use objects to calculate and perform
analysis here. We must convert the type object
to float64 (float64 is a number with a decimal in
Python).
We cannot use objects to calculate and perform
analysis here. We must convert the type object
to float64 (float64 is a number with a decimal in
Python).

Analyze the Data
import pandas as pd
pd.set_option('display.max_columns',None)
print(health_data.describe())
When we have cleaned the data set, we can start analyzing the data.
We can use the describe() function in Python to summarize data:
Count - Counts the number of
observations
Mean - The average value
Std - Standard deviation
(explained in the statistics
chapter)
Min - The lowest value
25%, 50% and 75% are
percentiles (explained in the
statistics chapter)
Max - The highest value

Data Science - Linear Functions
DS Math
Mathematical functions are important to know as a data scientist, because we want to make
predictions and interpret them.
Linear Functions
In mathematics a function is used to relate one variable to another variable.
Suppose we consider the relationship between calorie burnage and average pulse. It is reasonable
to assume that, in general, the calorie burnage will change as the average pulse changes - we say
that the calorie burnage depends upon the average pulse.
Furthermore, it may be reasonable to assume that as the average pulse increases, so will the calorie
burnage. Calorie burnage and average pulse are the two variables being considered.
Because the calorie burnage depends upon the average pulse, we say that calorie burnage is the
dependent variable and the average pulse is the independent variable.
The relationship between a dependent and an independent variable can often be expressed
mathematically using a formula (function).

DS Math
Mathematical functions are important to know as a data scientist, because we want to make
predictions and interpret them.
A linear function has one independent variable (x) and one dependent variable (y), and has the following form:
y = f(x) = ax + b
This function is used to calculate a value for the dependent variable when we choose a value for the
independent variable.
Explanation:
o f(x) = the output (the dependant variable)
o x = the input (the independant variable)
o a = slope = is the coefficient of the independent variable. It gives the rate of change of the dependent
variable
o b = intercept = is the value of the dependent variable when x = 0. It is also the point where the diagonal line
crosses the vertical axis.

DS Math
Linear Function With One Explanatory Variable
A function with one explanatory variable means that we use one variable for prediction.
Let us say we want to predict calorie burnage using average pulse. We have the following formula:
f(x) = 2x + 80
Here, the numbers and variables means:
o f(x) = The output. This number is where we get the predicted value of Calorie_Burnage
o x = The input, which is Average_Pulse
o 2 = Slope = Specifies how much Calorie_Burnage increases if Average_Pulse increases by one. It
tells us how "steep" the diagonal line is
o 80 = Intercept = A fixed value. It is the value of the dependent variable when x = 0

DS Math
Plotting a Linear Function
The term linearity means a "straight line". So, if you show a linear function graphically, the line
will always be a straight line. The line can slope upwards, downwards, and in some cases may be
horizontal or vertical. Here is a graphical representation of the mathematical function above:
Graph Explanations:
o The horizontal axis is generally called the x-axis. Here,
it represents Average_Pulse.
o The vertical axis is generally called the y-axis. Here, it
represents Calorie_Burnage.
o Calorie_Burnage is a function of Average_Pulse,
because Calorie_Burnage is assumed to be dependent
on Average_Pulse.
o In other words, we use Average_Pulse to predict
Calorie_Burnage.
o The blue (diagonal) line represents the structure of
the mathematical function that predicts calorie
burnage.

Data Science - Plotting Linear Functions
DS Math
Take a look at our health data set:
Plot the Existing Data in Python
Now, we can first plot the values of
Average_Pulse against
Calorie_Burnage using the matplotlib
library.
The plot() function is used to make a
2D hexagonal binning plot of points
x,y:

DS Math
#Three lines to make our compiler able to draw:
import sys
import matplotlib
matplotlib.use('Agg')
import pandas as pd
import matplotlib.pyplot as plt
health_data.plot(x ='Average_Pulse', y='Calorie_Burnage', kind='line')
plt.ylim(ymin=0)
plt.xlim(xmin=0)
plt.show()
#Two lines to make our compiler able to draw:
plt.savefig(sys.stdout.buffer)
sys.stdout.flush()
o Example Explained
o Import the pyplot module of the matplotlib library
o Plot the data from Average_Pulse against
Calorie_Burnage
o kind='line' tells us which type of plot we want.
Here, we want to have a straight line
o plt.ylim() and plt.xlim() tells us what value we want
the axis to start on. Here, we want the axis to begin
from zero
o plt.show() shows us the output
The code above will produce the following result:

DS Math
The Graph Output
As we can see, there is a
relationship between
Average_Pulse and
Calorie_Burnage. Calorie_Burnage
increases proportionally with
Average_Pulse. It means that we
can use Average_Pulse to predict
Calorie_Burnage.

DS Math
Why is The Line Not Fully Drawn Down to The y-axis?
The reason is that we do not have observations where Average_Pulse or Calorie_Burnage are
equal to zero. 80 is the first observation of Average_Pulse and 240 is the first observation of
Calorie_Burnage.
Look at the line.
What happens to
calorie burnage if
average pulse
increases from 80 to
90?

We can use the diagonal line to find the mathematical function to predict
calorie burnage.
DS Math
As it turns out:
o If the average pulse is 80, the calorie burnage is 240
o There is a pattern. If average pulse increases by 10, the calorie burnage increases by 20.

Data Science - Slope and
Intercept
Slope and Intercept
Now we will explain how we found
the slope and intercept of our
function:
f(x) = 2x + 80
The image points to the Slope -
which indicates how steep the line
is, and the Intercept - which is the
value of y, when x = 0 (the point
where the diagonal line crosses the
vertical axis). The red line is the
continuation of the blue line from
previous page.
DS Math

Data Science - Slope and Intercept
Find The Slope
The slope is defined as how much calorie burnage increases, if average pulse increases by one. It tells
us how "steep" the diagonal line is.
We can find the slope by using the proportional difference of two points from the graph.
If the average pulse is 80, the calorie burnage is 240
If the average pulse is 90, the calorie burnage is 260
We see that if average pulse increases with 10, the calorie burnage increases by 20.
Slope = 20/10 = 2
The slope is 2.
DS Math

Find The Slope
Mathematically, Slope is Defined as:
Slope = f(x2) - f(x1) / x2-x1
f(x2) = Second observation of Calorie_Burnage = 260
f(x1) = First observation of Calorie_Burnage = 240
x2 = Second observation of Average_Pulse = 90
x1 = First observation of Average_Pulse = 80
Slope = (260-240) / (90 - 80) = 2
Be consistent to define the observations in the correct order! If not,
the prediction will not be correct!
DS Math
Use Python to Find the Slope
Calculate the slope with the following
code:
def slope(x1, y1, x2, y2):
s = (y2-y1)/(x2-x1)
return s
print(slope(80,240,90,260))
o/p: 2.0

Find The Intercept
The intercept is used to fine tune the functions ability to predict Calorie_Burnage.
The intercept is where the diagonal line crosses the y-axis, if it were fully drawn.
The intercept is the value of y, when x = 0.
Here, we see that if average pulse (x) is zero, then the calorie burnage (y) is 80.
So, the intercept is 80.
Sometimes, the intercept has a practical meaning. Sometimes not.
Does it make sense that average pulse is zero?
No, you would be dead and you certainly would not burn any calories.
However, we need to include the intercept in order to complete the mathematical function's ability to predict
Calorie_Burnage correctly.
Other examples where the intercept of a mathematical function can have a practical meaning:
Predicting next years revenue by using marketing expenditure (How much revenue will we have next year, if
marketing expenditure is zero?). It is likely to assume that a company will still have some revenue even though if
it does not spend money on marketing.
Fuel usage with speed (How much fuel do we use if speed is equal to 0 mph?). A car that uses gasoline will still
use fuel when it is idle.
DS Math

Find the Slope and Intercept Using Python
The np.polyfit() function returns the slope and intercept.
If we proceed with the following code, we can both get the slope and intercept from the function.
import pandas as pd
import numpy as np
x = health_data["Average_Pulse"]
y = health_data["Calorie_Burnage"]
slope_intercept = np.polyfit(x,y,1)
print(slope_intercept)
o/p: 2.80
DS Math
Example Explained:
Isolate the variables Average_Pulse (x) and
Calorie_Burnage (y) from health_data.
Call the np.polyfit() function.
The last parameter of the function
specifies the degree of the function, which
in this case is "1".

Find the Slope and Intercept Using Python
The np.polyfit() function returns the slope and intercept.
If we proceed with the following code, we can both get the slope and intercept from the function.
import pandas as pd
import numpy as np
x = health_data["Average_Pulse"]
y = health_data["Calorie_Burnage"]
slope_intercept = np.polyfit(x,y,1)
print(slope_intercept)
o/p: 2.80
DS Math
We have now calculated the slope (2)
and the intercept (80). We can write the
mathematical function as follow:
Predict Calorie_Burnage by using a
mathematical expression:
f(x) = 2x + 80

Task:
Now, we want to predict calorie burnage if average pulse is 135.
Remember that the intercept is a constant. A constant is a number that does not change.
We can now substitute the input x with 135:
f(135) = 2 * 135 + 80 = 350
If average pulse is 135, the calorie burnage is 350.
Define the Mathematical Function in Python
Here is the exact same mathematical function, but in Python. The function returns 2*x + 80, with x
as the input:
#Try to replace x with 140 and 150.
def my_function(x):
return 2*x + 80
print (my_function(135))
o/p: 350
DS Math

Plot a New Graph in Python
Here, we plot the same graph as earlier, but formatted the
axis a little bit.
Max value of the y-axis is now 400 and for x-axis is 150:
health_data.plot(x ='Average_Pulse', y='Calorie_Burnage', kind='line'),
plt.ylim(ymin=0, ymax=400)
plt.xlim(xmin=0, xmax=150)
plt.show()
DS Math
Example Explained
Import the pyplot module of the matplotlib library
Plot the data from Average_Pulse against Calorie_Burnage
kind='line' tells us which type of plot we want. Here, we want to
have a straight line
plt.ylim() and plt.xlim() tells us what value we want the axis to start
and stop on.
plt.show() shows us the output

Introduction to Statistics
Statistics is the discipline that concerns the collection, organization, analysis, interpretation, and
presentation of data. When we have created a model for prediction, we must assess the prediction's
reliability.
Statistics is a method of interpreting, analyzing and summarizing the data.
The types of statistics are categorized based on these features:
Descriptive and inferential statistics
Based on the representation of data such as using pie charts, bar graphs, or tables, we analyse and
interpret it.
DS- Statistics

Descriptive Statistics
import pandas as pd
full_health_data = pd.read_csv("data.csv", header=0, sep=",")
pd.set_option('display.max_columns',None)
pd.set_option('display.max_rows',None)
print (full_health_data.describe())
DS- Statistics

Statistics Percentiles
25%, 50% and 75% - Percentiles
Percentiles are used in statistics to give you a number that describes the value that a given percent of
the values are lower than.
DS- Statistics

Let us try to explain it by some examples, using Average_Pulse.
The 25% percentile of Average_Pulse means that 25% of all of the training sessions have an average
pulse of 100 beats per minute or lower. If we flip the statement, it means that 75% of all of the
training sessions have an average pulse of 100 beats per minute or higher
The 75% percentile of Average_Pulse means that 75% of all the training session have an average pulse
of 111 or lower. If we flip the statement, it means that 25% of all of the training sessions have an
average pulse of 111 beats per minute or higher
Task: Find the 10% percentile for Max_Pulse
The following example shows how to do it in Python:
DS- Statistics

import pandas as pd
import numpy as np
Max_Pulse= full_health_data["Max_Pulse"]
percentile10 = np.percentile(Max_Pulse, 10)
print(percentile10)
o/p: 120.00
Max_Pulse = full_health_data["Max_Pulse"] - Isolate the variable Max_Pulse from the full health
data set.
np.percentile() is used to define that we want the 10% percentile from Max_Pulse.
The 10% percentile of Max_Pulse is 120. This means that 10% of all the training sessions have a
Max_Pulse of 120 or lower.
DS- Statistics

Statistics Standard Deviation
Standard Deviation
Standard deviation is a number that describes how spread out the observations are
A mathematical function will have difficulties in predicting precise values, if the observations are
"spread". Standard deviation is a measure of uncertainty.
A low standard deviation means that most of the numbers are close to the mean (average) value.
A high standard deviation means that the values are spread out over a wider range.
DS- Statistics

Standard Deviation
import pandas as pd
import numpy as np
std = np.std(full_health_data)
print(std)
DS- Statistics

Coefficient of Variation
The coefficient of variation is used to get an idea of how large the standard deviation is.
Mathematically, the coefficient of variation is defined as:
Coefficient of Variation = Standard Deviation / Mean
We can do this in Python if we proceed with the following code:
import numpy as np
cv = np.std(full_health_data) / np.mean(full_health_data)
print(cv)
o/p:
DS- Statistics
We see that the variables Duration, Calorie_Burnage and
Hours_Work has a high Standard Deviation compared to
Max_Pulse, Average_Pulse and Hours_Sleep.

Data Science - Statistics Variance
Variance
Variance is another number that indicates how spread out the values are.
In fact, if you take the square root of the variance, you get the standard deviation. Or the other way around, if
you multiply the standard deviation by itself, you get the variance!
We will first use the data set with 10 observations to give an example of how we can calculate the variance:
DS- Statistics
Tip: Variance is often represented by the symbol Sigma Square: σ^2

Variance
Step 1 to Calculate the Variance: Find the
Mean
We want to find the variance of
Average_Pulse.
1. Find the mean:
(80+85+90+95+100+105+110+115+120+125) / 10
= 102.5
The mean is 102.5
DS- Statistics
Step 2: For Each Value - Find the Difference
From the Mean
2. Find the difference from the mean for each
value:
80 - 102.5 = -22.5
85 - 102.5 = -17.5
90 - 102.5 = -12.5
95 - 102.5 = -7.5
100 - 102.5 = -2.5
105 - 102.5 = 2.5
110 - 102.5 = 7.5
115 - 102.5 = 12.5
120 - 102.5 = 17.5
125 - 102.5 = 22.5

Step 3: For Each Difference - Find the
Square Value
3. Find the square value for each
difference:
(-22.5)^2 = 506.25
(-17.5)^2 = 306.25
(-12.5)^2 = 156.25
(-7.5)^2 = 56.25
(-2.5)^2 = 6.25
2.5^2 = 6.25
7.5^2 = 56.25
12.5^2 = 156.25
17.5^2 = 306.25
22.5^2 = 506.25
Note: We must square the values to get
the total spread.
DS- Statistics
Step 4: The Variance is the Average Number
of These Squared Values
4. Sum the squared values and find the
average:
(506.25 + 306.25 + 156.25 + 56.25 + 6.25 + 6.25
+ 56.25 + 156.25 + 306.25 + 506.25) / 10 =
206.25
The variance is 206.25.

Variance
Use Python to Find the Variance of health_data
We can use the var() function from Numpy to find the variance (remember that we now use the first data set
with 10 observations):
import numpy as np
var = np.var(health_data)
print(var)
o/p:
DS- Statistics
Here we calculate the variance for each column for the
full data set:
import numpy as np
var_full = np.var(full_health_data)
print(var_full)
o/p:

Correlation
Correlation measures the relationship between two variables.
We mentioned that a function has a purpose to predict a value, by converting input (x) to output
(f(x)). We can say also say that a function uses the relationship between two variables for
prediction.
DS - Statistics Correlation
Correlation Coefficient
The correlation coefficient measures the relationship between two variables.
The correlation coefficient can never be less than -1 or higher than 1.
o 1 = there is a perfect linear relationship between the variables (like Average_Pulse against
Calorie_Burnage)
o 0 = there is no linear relationship between the variables
o -1 = there is a perfect negative linear relationship between the variables (e.g. Less hours
worked, leads to higher calorie burnage during a training session)

Example of a Perfect Linear Relationship (Correlation Coefficient = 1)
We will use scatterplot to visualize the relationship between Average_Pulse and Calorie_Burnage
(we have used the small data set of the sports watch with 10 observations).
This time we want scatter plots, so we change kind to "scatter":
health_data.plot(x ='Average_Pulse',
y='Calorie_Burnage', kind='scatter')
plt.show()

Example of a Perfect Negative Linear Relationship (Correlation Coefficient = -1)
We have plotted fictional data here. The x-axis
represents the amount of hours worked at our
job before a training session. The y-axis is
Calorie_Burnage.
If we work longer hours, we tend to have lower
calorie burnage because we are exhausted
before the training session.
The correlation coefficient here is -1.

Example of a Perfect Negative Linear Relationship (Correlation Coefficient = -1)
import pandas as pd
negative_corr = {'Hours_Work_Before_Training': [10,9,8,7,6,5,4,3,2,1],
'Calorie_Burnage': [220,240,260,280,300,320,340,360,380,400]}
negative_corr = pd.DataFrame(data=negative_corr)
negative_corr.plot(x ='Hours_Work_Before_Training', y='Calorie_Burnage',
kind='scatter')
plt.show()

Example of No Linear Relationship (Correlation coefficient = 0)
Here, we have plotted Max_Pulse against Duration
from the full_health_data set.
As you can see, there is no linear relationship
between the two variables. It means that longer
training session does not lead to higher Max_Pulse.
The correlation coefficient here is 0.
full_health_data.plot(x ='Duration', y='Max_Pulse', kind='scatter')
plt.show()
o/p:

Correlation Matrix
A matrix is an array of numbers arranged in rows and columns.
A correlation matrix is simply a table showing the correlation coefficients between variables.
Here, the variables are represented in the first row, and in the first column:
D S - Statistics Correlation Matrix
The table here has used data from the full health data set.
Observations:
We observe that Duration and Calorie_Burnage are closely
related, with a correlation coefficient of 0.89. This makes
sense as the longer we train, the more calories we burn
We observe that there is almost no linear relationships
between Average_Pulse and Calorie_Burnage (correlation
coefficient of 0.02)
Can we conclude that Average_Pulse does not affect
Calorie_Burnage? No. We will come back to answer this
question later!

Correlation Matrix
Correlation Matrix in Python
We can use the corr() function in Python to create a correlation matrix. We also use the round()
function to round the output to two decimals:
import pandas as pd
Corr_Matrix = round(full_health_data.corr(),2)
print(Corr_Matrix)
o/p:

Correlation Matrix
Using a Heatmap
We can use a Heatmap to Visualize the Correlation Between Variables:
The closer the correlation coefficient is to 1, the
greener the squares get.
The closer the correlation coefficient is to -1, the
browner the squares get.

Correlation Matrix
Use Seaborn to Create a Heatmap
We can use the Seaborn library to create a correlation heat map (Seaborn is a
visualization library based on matplotlib):
import seaborn as sns
correlation_full_health = full_health_data.corr()
axis_corr = sns.heatmap(
correlation_full_health,
vmin=-1, vmax=1, center=0,
cmap=sns.diverging_palette(50, 500, n=500),
square=True
)
plt.show()

Correlation Matrix
Use Seaborn to Create a Heatmap
We can use the Seaborn library to create a correlation heat map (Seaborn is a
visualization library based on matplotlib):
o Example Explained:
o Import the library seaborn as sns.
o Use the full_health_data set.
o Use sns.heatmap() to tell Python that we want a heatmap to
visualize the correlation matrix.
o Use the correlation matrix. Define the maximal and minimal
values of the heatmap. Define that 0 is the center.
o Define the colors with sns.diverging_palette. n=500 means
that we want 500 types of color in the same color palette.
o square = True means that we want to see squares.

Correlation Does Not Imply Causality
Correlation measures the numerical relationship between two variables.
A high correlation coefficient (close to 1), does not mean that we can for sure conclude
an actual relationship between two variables.
A classic example:
During the summer, the sale of ice cream at a beach increases
Simultaneously, drowning accidents also increase as well
Does this mean that increase of ice cream sale is a direct cause of increased drowning
accidents?
The Beach Example in Python
Here, we constructed a fictional data set for you to try:
D S - Statistics Correlation vs. Causality

Here, we constructed a fictional data set for you to try:
import pandas as pd
Drowning_Accident = [20,40,60,80,100,120,140,160,180,200]
Ice_Cream_Sale = [20,40,60,80,100,120,140,160,180,200]
Drowning = {"Drowning_Accident": [20,40,60,80,100,120,140,160,180,200],
"Ice_Cream_Sale": [20,40,60,80,100,120,140,160,180,200]}
Drowning = pd.DataFrame(data=Drowning)
Drowning.plot(x="Ice_Cream_Sale", y="Drowning_Accident", kind="scatter")
plt.show()
correlation_beach = Drowning.corr()
print(correlation_beach)

o/p:

Correlation vs Causality - The Beach Example
In other words: can we use ice cream sale to predict drowning accidents?
The answer is - Probably not.
It is likely that these two variables are accidentally correlating with each other.
What causes drowning then?
Unskilled swimmers
Waves
Cramp
Seizure disorders
Lack of supervision
Alcohol (mis)use
etc.

Let us reverse the argument:
Does a low correlation coefficient (close to zero) mean that change in x does not affect y?
Back to the question:
Can we conclude that Average_Pulse does not affect Calorie_Burnage because of a low correlation
coefficient?
The answer is no.
There is an important difference between correlation and causality:
Correlation is a number that measures how closely the data are related
Causality is the conclusion that x causes y.
Tip: Always critically reflect over the concept of causality when doing predictions!

Let us reverse the argument:
Does a low correlation coefficient (close to zero) mean that change in x does not affect y?
Back to the question:
Can we conclude that Average_Pulse does not affect Calorie_Burnage because of a low
correlation coefficient?
The answer is no.
There is an important difference between correlation and causality:
Correlation is a number that measures how closely the data are related
Causality is the conclusion that x causes y.

Statistics gives us methods of gaining knowledge from data.
What is Statistics Used for?
Statistics is used in all kinds of science and business
applications.
Statistics gives us more accurate knowledge which
helps us make better decisions.
Statistics can focus on making predictions about
what will happen in the future. It can also focus on
explaining how different things are connected.

Typical Steps of Statistical Methods
 The typical steps are:
 Gathering data
 Describing and visualizing data
 Making conclusions
It is important to keep all three steps in mind for any questions we want more knowledge
about.
Knowing which types of data are available can tell you what kinds of questions you can
answer with statistical methods.
Knowing which questions you want to answer can help guide what sort of data you need. A
lot of data might be available, and knowing what to focus on is important.

How is Statistics Used?
Statistics can be used to explain things in a precise way. You can use it to understand and make
conclusions about the group that you want to know more about. This group is called
the population.
• A population could be many different kinds of groups. It could be:
• All of the people in a country
• All the businesses in an industry
• All the customers of a business
• All people that play football who are older than 45 and so on
- it just depends on what you want to know about.
Gathering data about the population will give you a sample. This is a part of the whole
population. Statistical methods are then used on that sample.
The results of the statistical methods from the sample is used to make conclusions about the
population.

Important Concepts in Statistics
o Predictions and Explanations
o Populations and Samples
o Parameters and Sample Statistics
o Sampling Methods
o Data Types
o Measurement Level
o Descriptive Statistics
o Random Variables
o Univariate and Multivariate Statistics
o Probability Calculation
o Probability Distributions
o Statistical Inference
o Parameter Estimation
o Hypothesis Testing
o Correlation
o Regression Analysis
o Causal Inference

Statistics and Programming:
Statistical analysis is typically done with computers. Small amounts of data can
analyzed reasonably well without computers.
Historically, all data analysis was performed by manually. It was time-consuming and
prone to errors.
Nowadays, programming and software is typically used for data analysis.
In this course, we will see examples of code to do statistics with the programming
languages Python and R.

Statistics - Describing Data
Describing data is typically the second step of statistical analysis after gathering data.
Descriptive Statistics
The information (data) from your sample or population can be visualized with graphs
or summarized by numbers. This will show key information in a simpler way than just
looking at raw data. It can help us understand how the data is distributed.
Graphs can visually show the data distribution.
Examples of graphs include:
o Histograms
o Pie charts
o Bar graphs
o Box plots

Statistics - Describing Data
Some graphs have a close connection to numerical summary statistics. Calculating those gives
us the basis of these graphs.
For example, a box plot visually shows the quartiles of a data distribution.
Quartiles are the data split into four equal size parts, or quarters. A quartile is one type of
summary statistics.
Summary statistics take a large amount of information and sums it up in a few key values.
Numbers are calculated from the data which also describe the shape of the distributions.
These are individual 'statistics'.

Statistics - Making Conclusions
Using statistics to make conclusions about a population is called statistical
inference.
Statistics from the data in the sample is used to make conclusions about the
whole population. This is a type of statistical inference.
Probability theory is used to calculate the certainty that those statistics also
apply to the population.
When using a sample, there will always be some uncertainty about what the
data looks like for the population.
When using a sample, there will always be some uncertainty about what the data
looks like for the population.
Uncertainty is often expressed as confidence intervals.

Confidence intervals are numerical ways of showing how likely it is that the true
value of this statistic is within a certain range for the population.
Hypothesis testing is a another way of checking if a statement about a
population is true. More precisely, it checks how likely it is that a hypothesis is
true is based on the sample data.
Some examples of statements or questions that can be checked with hypothesis
testing:
People in the Netherlands taller than people in Denmark
Do people prefer Pepsi or Coke?
Does a new medicine cure a disease?

Causal Inference
Causal inference is used to investigate if something causes another thing.
For example: Does rain make plants grow?
If we think two things are related we can investigate to see if they correlate.
Statistics can be used to find out how strong this relation is.
Even if things are correlated, finding out of something is caused by other things
can be difficult. It can be done with good experimental design or other special
statistical techniques.
Note: Good experimental design is often difficult to achieve because of ethical concerns or other practical
reasons.

Statistics - Prediction and Explanation
Some types of statistical methods are focused on predicting what will happen.
Other types of statistical methods are focused on explaining how things are connected.
Prediction
Some statistical methods are not focused on explaining how things are connected. Only the
accuracy of prediction is important.
Many statistical methods are successful at predicting without giving insight into how things are
connected.
Some types of machine learning let computers do the hard work, but the way they predict is
difficult to understand. These approaches can also be vulnerable to mistakes if the circumstances
change, since the how they work is less clear.

Some types of statistical methods are focused on predicting what will happen.
Other types of statistical methods are focused on explaining how things are connected.
Prediction
Some statistical methods are not focused on explaining how things are connected. Only the
accuracy of prediction is important.
Many statistical methods are successful at predicting without giving insight into how things are
connected.
Some types of machine learning let computers do the hard work, but the way they predict is
difficult to understand. These approaches can also be vulnerable to mistakes if the circumstances
change, since the how they work is less clear.
Note: Predictions about future events are called forecasts. Not all predictions are about the future.
Some predictions can be about something else that is unknown, even if it is not in the future.

Explanation
Different statistical methods are often used for explaining how things are connected. These
statistical methods may not make good predictions.
These statistical methods often explain only small parts of the whole situation. But, if you only
want to know how a few things are connected, the rest might not matter.
If these methods accurately explains how all the relevant things are connected, they will also be
good at prediction. But managing to explain every detail is often challenging.
Some times we are specifically interested in figuring out if one thing causes another. This is called
causal inference.
If we are looking at complicated situations, many things are connected. To figure out what causes
what, we need to untangle every way these things are connected.

Statistics - Population and Samples
Population: Everything in the group that we want to learn about.
Sample: A part of the population.
For good statistical analysis, the sample needs to be as "similar" as possible to the
population. If they are similar enough, we say that the sample is representative of the
population.
The sample is used to make conclusions about the whole population. If the sample is not
similar enough to the whole population, the conclusions could be useless.

Statistics - Parameters and Statistics
The terms 'parameter' and (sample) 'statistic' refer to key concepts that are closely related in
statistics.
They are also directly connected to the concepts of populations and samples.
Parameter: A number that describes something about the whole population.
Sample statistic: A number that describes something about the sample.
The parameters are the key things we want to learn about. The parameters are usually
unknown.
Sample statistics gives us estimates for parameters.
There will always be some uncertainty about how accurate estimates are. More certainty
gives us more useful knowledge.
For every parameter we want to learn about we can get a sample and calculate a sample
statistic, which gives us an estimate of the parameter.

Statistics - Parameters and Statistics
Some Important Examples
Mean, median and mode are different types of
averages (typical values in a population).
For example:
The typical age of people in a country
The typical profits of a company
The typical range of an electric car
Variance and standard deviation are two
types of values describing how spread out
the values are.
A single class of students in a school would
usually be about the same age. The age of
the students will have low variance and
standard deviation.
A whole country will have people of all kinds
of different ages. The variance and standard
deviation of age in the whole country would
then be bigger than in a single school grade.

Statistics - Study Types
A statistical study can be a part of the process of gathering data.
There are different types of studies. Some are better than others, but they might be harder to do.
Main Types of Statistical Studies
The main types of statistical studies are observational and experimental studies.
We are often interested in knowing if something is the cause of another thing.
Experimental studies are generally better than observational studies for investigating this, but
usually require more effort.
An observational study is when observe and gather data without changing anything.

Statistics - Study Types
Experimental Studies
In an experimental study, the circumstances around the sample is changed. Usually, we compare
two groups from a population and these two groups are treated differently.
One example can be a medical study to see if a new medicine is effective.
One group receives the medicine and the other does not. These are the different circumstances
around those samples.
We can compare the health of both groups afterwards and see if the results are different.
Experimental studies can allow us to investigate causal relationships. A well designed experimental
study can be useful since it can isolate the relationship we are interested in from other effects.
Then we can be more confident that we are measuring the true effect.

Statistics - Sample Types
A study needs participants and there are different ways of gathering them.
Some methods are better than others, but they might be more difficult.
Different Types of Sampling Methods:
Random Sampling
A random sample is where every member of the population has an equal chance to be chosen.
Random sampling is the best. But, it can be difficult, or impossible, to make sure that it is completely
random.
Note: Every other sampling method is compared to how close it is to a random sample - the closer,
the better.
Convenience Sampling
A convenience sample is where the participants that are the easiest to reach are chosen.
Note: Convenience sampling is the easiest to do.
In many cases this sample will not be similar enough to the population, and the conclusions can
potentially be useless.

Systematic Sampling
A systematic sample is where the participants are chosen by some regular system.
For example:
The first 30 people in a queue
Every third on a list
The first 10 and the last 10
Stratified Sampling
A stratified sample is where the population is split into smaller groups called 'strata'.
The 'strata' can, for example, be based on demographics, like:
Different age groups
Professions
Stratification of a sample is the first step. Another sampling method (like random sampling) is used for
the second step of choosing participants from all of the smaller groups (strata).

Clustered Sampling
A clustered sample is where the population is split into smaller groups called 'clusters'.
The clusters are usually natural, like different cities in a country.
The clusters are chosen randomly for the sample.
All members of the clusters can participate in the sample, or members can be chosen
randomly from the clusters in a third step.

Statistics - Data Types
Data can be different types, and require different types of statistical methods to analyze
Different types of data
There are two main types of data: Qualitative (or 'categorical') and quantitative (or 'numerical').
These main types also have different sub-types depending on their measurement level.
Qualitative Data
Information about something that can be sorted into different categories that can't be described
directly by numbers.
Examples:
• Brands
• Nationality
• Professions

Statistics - Data Types
With categorical data we can calculate statistics like proportions. For example, the proportion of
Indian people in the world, or the percent of people who prefer one brand to another.
Quantitative Data
Information about something that is described by numbers.
Examples:
• Income
• Age
• Height
With numerical data we can calculate statistics like the average income in a country, or the range
of heights of players in a football team.

Statistics - Measurement Levels
Different data types have different measurement levels.
Measurement levels are important for what types of statistics can be calculated and how to best
present the data.
The main types of data are Qualitative (categories) and Quantitative (numerical). These are further
split into the following measurement levels.
These measurement levels are also called measurement 'scales'
Nominal Level
Categories (qualitative data) without any order.
Examples:
• Brand names
• Countries
• Colors

Ordinal level
Categories that can be ordered (from low to high), but the precise "distance" between each is not
meaningful.
Examples:
• Letter grade scales from F to A
• Military ranks
• Level of satisfaction with a product
Consider letter grades from F to A: Is the grade A precisely twice as good as a B? And, is the grade B
also twice as good as C?
Exactly how much distance it is between grades is not clear and precise. If the grades are based on
amounts of points on a test, you can say that there is a precise "distance" on the point scale, but not
the grades themselves.

Interval Level
Data that can be ordered and the distance between them is objectively meaningful. But there is no
natural 0-value where the scale originates.
Examples:
Years in a calendar
Temperature measured in Fahrenheit
Note: Interval scales are usually invented by people, like degrees of temperature.
0 degrees Celsius is 32 degrees of Fahrenheit. There is consistent distances between each degree (for
every 1 extra degree of Celsius, there is 1.8 extra Fahrenheit), but they do not agree on where 0 degrees
is.

Statistics - Descriptive Statistics
Descriptive statistics gives us insight into data without having to look at all of it in
detail.
Key Features to Describe about Data:
Getting a quick overview of how the data is distributed is a important step in statistical methods.
We calculate key numerical values about the data that tells us about the distribution of the data.
We also draw graphs showing visually how the data is distributed.
Key Features of Data:
• Where is the center of the data? (location)
• How much does the data vary? (scale)
• What is the shape of the data? (shape)
• These can be described by summary statistics (numerical values).

The Center of the Data
The center of the data is where most of the values are concentrated.
Different kinds of averages, like mean, median and mode, are measures of the
center.
Note: Measures of the center are also called location parameters, because they tell us something
about where data is 'located' on a number line.
The Variation of the Data
The variation of the data is how spread out the data are around the center.
Statistics like standard deviation, range and quartiles are measures of variation.
Note: Measures of variation are also called scale parameters.

The Shape of the Data
The shape of the data can refer to the how the data are bunched up on either side
of the center.
Statistics like skew describe if the right or left side of the center is bigger. Skew is
one type of shape parameters.
Frequency Tables
One typical of presenting data is with frequency tables.
A frequency table counts and orders data into a table. Typically, the data will need
to be sorted into intervals.
Frequency tables are often the basis for making graphs to visually present the data.

Visualizing Data
Different types of graphs are used for different kinds of data. For example:
• Pie charts for qualitative data
• Histograms for quantitative data
• Scatter plots for bivariate data
• Graphs often have a close connection to numerical summary statistics.
For example, box plots show where the quartiles are.
Quartiles also tell us where the minimum and maximum values, range, interquartile
range, and median are.

Statistics - Frequency Tables
We can see that there is only
one winner from ages 10 to
19. And that the highest
number of winners are in their
60s.

Relative Frequency Tables
Relative frequency means the number of
times a value appears in the data
compared to the total amount. A
percentage is a relative frequency.
Here are the relative frequencies of ages
of Noble Prize winners. Now, all the
frequencies are divided by the total (934)
to give percentages.

Cumulative Frequency Tables
Cumulative frequency counts up to a
particular value.
Here are the cumulative frequencies of
ages of Nobel Prize winners. Now, we can
see how many winners have been younger
than a certain age.
Cumulative frequency tables can also be
made with relative frequencies
(percentages).

Statistics - Histograms
A histogram visually presents quantitative data.
A histogram is a widely used graph to show the
distribution of quantitative (numerical) data.
It shows the frequency of values in the data,
usually in intervals of values. Frequency is the
amount of times that value appeared in the data.
Each interval is represented with a bar, placed
next to the other intervals on a number line.
The height of the bar represents the frequency of
values in that interval.
Here is a histogram of the age of all 934 Nobel
Prize winners up to the year 2020:
This histogram uses age intervals from 10 to 19, 20
to 29, and so on. Note: Histograms are similar to bar graphs, which
are used for qualitative data.

Statistics - Histograms
Bin Width
The intervals of values are often called 'bins'. And
the length of an interval is called 'bin width'.
We can choose any width. It is best with a bin width
that shows enough detail without being confusing.
Here is a histogram of the same Nobel Prize winner
data, but with bin widths of 5 instead of 10:
This histogram uses age intervals from from 15 to
19, 20 to 24, 25 to 29, and so on.
Smaller intervals gives a more detailed look at the
distribution of the age values in the data.

Statistics - Bar Graphs
A bar graph visually presents qualitative data.
Bar graphs are used to show the distribution of
qualitative (categorical) data.
It shows the frequency of values in the data.
Frequency is the amount of times that value
appeared in the data.
Each category is represented with a bar. The height
of the bar represents the frequency of values from
that category in the data.
Here is a bar graph of the number of people who
have won a Nobel Prize in each category up to the
year 2020:
Some of the categories have existed longer
than others. Multiple winners are also more
common in some categories. So there is a
different number of winners in each
category.
Note: Bar graphs are similar to histograms,
which are used for quantitative data

Statistics - Pie Charts
A pie chart visually presents qualitative data.
Pie graphs are used to show the distribution of
qualitative (categorical) data.
It shows the frequency or relative frequency of values
in the data.
Frequency is the amount of times that value appeared
in the data. Relative frequency is the percentage of the
total.
Each category is represented with a slice in the 'pie'
(circle). The size of each slice represents the frequency
of values from that category in the data.
Here is a pie chart of the number of people who have
won a Nobel Prize in each category up to the year 2020:
This pie chart shows relative frequency. So
each slice is sized by the percentage for
each category.
Some of the categories have existed
longer than others. Multiple winners are
also more common in some categories. So
there is a different number of winners in
each category.

Statistics - Box Plots
A box plot is a graph used to show key features of quantitative data.
A box plot is a good way to show many important features of quantitative
(numerical) data.
It shows the median of the data. This is the middle value of the data and one
type of an average value.
It also shows the range and the quartiles of the data. This tells us something
about how spread out the data is.
Note: Box plots are also called 'box and whiskers plots'.
Here is a box plot of the age of all the Nobel Prize winners up to the year
2020:

The median is the red line through
the middle of the 'box'. We can see
that this is just above the number
60 on the number line below. So
the middle value of age is 60 years.
The left side of the box is the
1st quartile. This is the value that
separates the first quarter, or 25%
of the data, from the rest. Here,
this is 51 years.
The right side of the box is the
3rd quartile. This is the value that
separates the first three quarters,
or 75% of the data, from the rest.
Here, this is 69 years.

The distance between the sides of
the box is called the inter-quartile
range (IQR). This tells us where the
'middle half' of the values are.
Here, half of the winners were
between 51 and 69 years.
The ends of the lines from the box
at the left and the right are the
minimum and maximum values in
the data. The distance between
these is called the range.
The youngest winner was 17 years
old, and the oldest was 97 years
old. So the range of the age of
winners was 80 years.

Statistics - Average
An average is a measure of where most of the values in the data are located.
The center of the data is where most of the values in the data are located. Averages are
measures of the location of the center.
There are different types of averages. The most commonly used are:
o Mean
o Median
o Mode
Note: In statistics, averages are often referred to as 'measures of central tendency'.
For example, using the values:
40, 21, 55, 21, 48, 13, 72

Median
The median is the 'middle value' of the data.
The median is found by ordering all the values in the data and picking the middle value:
13, 21, 21, 40, 48, 55, 72
The median is less influenced by extreme values in the data than the mean.
Changing the last value to 356 does not change the median:
13, 21, 21, 40, 48, 55, 356
The median is still 40.
Changing the last value to 356 changes the mean a lot:
(13 + 21 + 21 + 40 + 48 + 55 + 72)/7 = 38.57
(13 + 21 + 21 + 40 + 48 + 55 + 356)/7 = 79.14
Note: Extreme values are values in the data that are much smaller or larger than the average values in
the data.

Mode
The mode is the value(s) that appears most often in the data:
40, 21, 55, 21, 48, 13, 72
Here, 21 appears two times, and the other values only once. The mode of this data is 21.
The mode is also used for categorical data, unlike the median and mean. Categorical data can't
be described directly with numbers, like names:
Alice, John, Bob, Maria, John, Julia, Carol
Here, John appears two times, and the other values only once. The mode of this data is John.
Note: There can be more than one mode if multiple values appear the same number of times in the
data.

Statistics - Mean
The mean is a type of average value, which describes where center of the data is located. Mean
The mean is usually referred to as 'the average'.
The mean is the sum of all the values in the data divided by the total number of values in the data.
The mean is calculated for numerical variables. A variable is something in the data that can vary,
like:
Age
Height
Income
Note: There are multiple types of mean values. The most common type of mean is the arithmetic mean.
In this tutorial 'mean' refers to the arithmetic mean.

Statistics - Mean
Calculating the Mean
You can calculate the mean for both the population and the sample.
The formulas are the same and uses different symbols to refer to the population mean (μ) and sample mean
(x¯).

Statistics - Mean
Calculation with Programming
The mean can easily be calculated with many programming languages.
Using software and programming to calculate statistics is more common for bigger sets of data, as
calculating by hand becomes difficult.
Example
With Python use the NumPy library mean() method to find the mean of the values 4,11,7,14:
import numpy
values = [4,11,7,14]
x = numpy.mean(values)
print(x)
o/p: 9.0

Statistics - Mean
Calculation with Programming
The mean can easily be calculated with many programming languages.
Example
With Python use the NumPy library mean() method to find the mean of the values 4,11,7,14:
import numpy
values = [4,11,7,14]
x = numpy.mean(values)
print(x)
o/p: 9.0
Use the R mean() function to find the mean of the
values 4,11,7,14:
values <- c(4,7,11,14)
mean(values)
o/p:
[1] 9

Statistics - Median
The median is a type of average value, which describes where the center of the data is located.
The median is the middle value in a data set ordered from low to high.
Finding the Median
The median can only be calculated for numerical variables.
The formula for finding the middle value is:
Where n is the total number of observations.
If the total number of observations is an odd number, the formula gives a whole number and the
value of this observation is the median.
13, 21, 21, 40, 48, 55, 72
Here, there are 7 total observations, so the median is the 4th value:
The 4th value in the ordered list is 40, so that is the median.

Statistics - Median
If the total number of observations is an even number, the formula gives a
decimal number between two observations.
13, 21, 21, 40, 42, 48, 55, 72
Here, there are 8 total observations, so the median is between the 4th and 5th values:
The 4th and 5th values in the ordered list is 40 and 42, so the median is the mean of these two
values. That is, the sum of those two values divided by 2:
Note: It is important that the numbers are ordered before you can find the median.

Statistics - Median
Finding the Median with Programming
The median can easily be found with many programming languages.
finding it manually becomes difficult.
The median can easily be found with many programming languages.
Example
With Python use the NumPy library median() method to find the median of the values 13, 21, 21,
40, 42, 48, 55, 72:
import numpy
values = [13,21,21,40,42,48,55,72]
x = numpy.median(values)
print(x)
o/p: 41.0

Statistics - Mode
The mode is a type of average value, which describes where most of the data is located.
Mode
The mode is the value(s) that are the most common in the data.
A dataset can have multiple values that are modes.
A distribution of values with only one mode is called unimodal.
A distribution of values with two modes is called bimodal. In general, a distribution with more than
one mode is called multimodal.
Mode can be found for both categorical and numerical data.

Statistics - Mode
Finding the Mode
Here is a numerical example:
4, 7, 3, 8, 11, 7, 10, 19, 6, 9, 12, 12
Both 7 and 12 appears two times each, and the other values only once. The modes of this data is 7
and 12.
Here is a categorical example with names:
Alice, John, Bob, Maria, John, Julia, Carol
John appears two times, and the other values only once. The mode of this data is John.

Statistics - Mode
Finding the Mode with Programming
The mode can easily be found with many programming languages.
calculating manually becomes difficult.
Example
With Python use the statistics library multimode() method to find the modes of the values
4,7,3,8,11,7,10,19,6,9,12,12:
from statistics import multimode
values = [4,7,3,8,11,7,10,19,6,9,12,12]
x = multimode(values)
print(x)
o/p: [7, 12]

Statistics - Variation
Variation is a measure of how spread out the data is around the center of the data.
Measures of variation are statistics of how far away the values in the observations (data points) are
from each other.
There are different measures of variation. The most commonly used are:
o Range
o Quartiles and Percentiles
o Interquartile Range
o Standard Deviation
Measures of variation combined with an average (measure of center) gives a good picture of the
distribution of the data.
Note: These measures of variation can only be calculated for numerical data.

Range
The range is the difference between the smallest and the largest value of the data.
Range is the simplest measure of variation.
Here is a histogram of the age of all 934 Nobel Prize winners up to the year 2020, showing the
range:
The youngest winner was 17
years and the oldest was 97
years. The range of ages for
Nobel Prize winners is then 80
years.

Quartiles and Percentiles
Quartiles and percentiles are ways of separating equal numbers of values in the data into parts.
Quartiles are values that separate the data into four equal parts.
Percentiles are values that separate the data into 100 equal parts.
quartiles:
The quartiles (Q0,Q1,Q2,Q3,Q4) are the
values that separate each quarter.
Between Q0 and Q1 are the 25% lowest
values in the data. Between Q1 and Q2
are the next 25%. And so on.
o Q0 is the smallest value in the data.
o Q2 is the middle value (median).
o Q4 is the largest value in the data.

Interquartile Range
Interquartile range is the difference between the first and third quartiles (Q1 and Q3).
The 'middle half' of the data is between the first and third quartile.
interquartile range (IQR):
Here, the middle half of is between 51
and 69 years. The interquartile range for
Nobel Prize winners is then 18 years.

Standard deviation : It is the most used measure of variation.
Standard deviation (σ) measures how far a 'typical' observation is from the average of the data (μ).
Standard deviation is important for many statistical methods.
Here is a histogram of the age of all 934 Nobel Prize winners up to the year 2020, showing standard
deviations:
Note: Values within one standard
deviation (σ) are considered to be typical.
Values outside three standard deviations
are considered to be outliers.

Statistics - Range
The range is a measure of variation, which describes how spread out the data is.
Range
The range is the difference between the smallest and the largest value of the data.
Range is the simplest measure of variation.
range:
The youngest winner was 17
years and the oldest was 97
years. The range of ages for
Nobel Prize winners is then 80
years.

Statistics - Range
Calculating the Range
The range can only be calculated for numerical data.
First, find the smallest and largest values of this example:
13, 21, 21, 40, 48, 55, 72
Calculate the difference by subtracting the smallest from the largest:
72 - 13 = 59

Statistics - Range
Calculating the Range with Programming
The range can easily be found with many programming languages.
Example
With Python use the NumPy library ptp() method to find the range of the values 13, 21, 21, 40, 48,
55, 72:
import numpy
values = [13,21,21,40,48,55,72]
x = numpy.ptp(values)
print(x)
o/p: 59

Statistics - Quartiles and Percentiles
Quartiles and percentiles are a measures of variation, which describes how spread out the data is.
Quartiles and percentiles are both types of quantiles.
Quartiles
Quartiles are values that separate the data into four equal parts.
quartiles:
The quartiles (Q0,Q1,Q2,Q3,Q4) are the values that separate
each quarter.
Between Q0 and Q1 are the 25% lowest values in the data.
Between Q1 and Q2 are the next 25%. And so on.
Q0 is the smallest value in the data.
Q1 is the value separating the first quarter from the second
quarter of the data.
Q2 is the middle value (median), separating the bottom from
the top half.
Q3 is the value separating the third quarter from the fourth
quarter
Q4 is the largest value in the data.

Statistics - Quartiles and Percentiles
Calculating Quartiles with Programming
Quartiles can easily be found with many programming languages.
Example
With Python use the NumPy library quantile() method to find the quartiles of the values 13, 21, 21,
40, 42, 48, 55, 72:
import numpy
values = [13,21,21,40,42,48,55,72]
x = numpy.quantile(values, [0,0.25,0.5,0.75,1])
print(x)
O/P : [13. 21. 41. 49.75 72. ]

Statistics - Interquartile Range
Interquartile range is a measure of variation, which describes how spread out the data is.
Interquartile Range is the difference between the first and third quartiles (Q1 and Q3).
The 'middle half' of the data is between the first and third quartile.
The first quartile is the value in the data that separates the bottom 25% of values from the top
75%.
The third quartile is the value in the data that separates the bottom 75% of the values from the top
25%
interquartile range (IQR):

Here, the middle half of is between 51 and 69 years. The interquartile range for Nobel Prize winners
is then 18 years.

Calculating the Interquartile Range with Programming
The interquartile range can easily be found with many programming languages.
Example
With Python use the SciPy library iqr() method to find the interquartile range of the values 13, 21,
21, 40, 42, 48, 55, 72:
from scipy import stats
values = [13,21,21,40,42,48,55,72]
x = stats.iqr(values)
print(x)
O/P : 28.75

Statistics - Standard Deviation
Standard deviation is the most commonly used measure of variation, which describes how spread
out the data is.
Standard deviation (σ) measures how far a 'typical' observation is from the average of the data (μ).
Standard deviation is important for many statistical methods.
Here is a histogram of the age of all 934 Nobel Prize winners up to the year 2020, showing standard
deviations:
Each dotted line in the histogram shows a shift of
one extra standard deviation.
If the data is normally distributed:
o Roughly 68.3% of the data is within 1 standard
deviation of the average (from μ-1σ to μ+1σ)
deviations of the average (from μ-2σ to μ+2σ)
deviations of the average (from μ-3σ to μ+3σ)
Note: A normal distribution has a "bell" shape and
spreads out equally on both sides.

Calculating the Standard Deviation
You can calculate the standard deviation for both the population and the sample.
The formulas are almost the same and uses different symbols to refer to the standard deviation
() and sample standard deviation (s).
Calculating the standard deviation (σ) is done with this formula:

Calculating the Standard Deviation with Programming
The standard deviation can easily be calculated with many programming languages.
Population Standard Deviation
Example
With Python use the NumPy library std() method to find the standard deviation of the values
4,11,7,14:
import numpy
values = [4,11,7,14]
x = numpy.std(values)
print(x)
o/p:
3.8078865529319543

Calculating the Standard Deviation with Programming
The standard deviation can easily be calculated with many programming languages.
Population Standard Deviation
Example
With Python use the NumPy library std() method to find the standard deviation of the values
4,11,7,14:
import numpy
values = [4,11,7,14]
x = numpy.std(values)
print(x)
o/p:
3.8078865529319543
Sample Standard Deviation
x = numpy.std(values, ddof=1)
print(x)
o/p: 4.396968652757639

Statistics - Statistical Inference
Statistical Inference
Using data analysis and statistics to make conclusions about a population is called statistical
inference.
The main types of statistical inference are:
• Estimation
• Hypothesis testing
Estimation
Statistics from a sample are used to estimate population parameters.
The most likely value is called a point estimate.
There is always uncertainty when estimating.
The uncertainty is often expressed as confidence intervals defined by a likely lowest and highest
value for the parameter.
An example could be a confidence interval for the number of bicycles a Dutch person owns:
"The average number of bikes a Dutch person owns is between 3.5 and 6."

Statistics - Statistical Inference
Hypothesis Testing
Hypothesis testing is a method to check if a claim about a population is true. More precisely, it
checks how likely it is that a hypothesis is true is based on the sample data.
There are different types of hypothesis testing.
The steps of the test depends on:
Type of data (categorical or numerical)
If you are looking at:
o A single group
o Comparing one group to another
o Comparing the same group before and after a change
Some examples of claims or questions that can be checked with hypothesis testing:
o 90% of Australians are left handed
o Is the average weight of dogs more than 40kg?
o Do doctors make more money than lawyers?

Statistics - Normal Distribution
The normal distribution is an important probability distribution used in statistics.
Many real world examples of data are normally distributed.
Normal Distribution
The normal distribution is described by the mean (μ) and the standard deviation (σ).
The normal distribution is often referred to as a 'bell curve' because of it's shape:
Most of the values are around the center (μ)
The median and mean are equal
It has only one mode
It is symmetric, meaning it decreases the same amount on the left and the right of the center
The area under the curve of the normal distribution represents probabilities for the data.
The area under the whole curve is equal to 1, or 100%
Here is a graph of a normal distribution with probabilities between standard deviations (σ):

• Roughly 68.3% of the data is
within 1 standard deviation
of the average (from μ-1σ to
μ+1σ)
within 2 standard deviations
μ+2σ)
within 3 standard deviations
μ+3σ)
Note: Probabilities of the normal distribution can only be calculated for intervals (between two values).

Different Mean and Standard Deviations
The mean describes where the center of the normal distribution is.
Here is a graph showing three different normal distributions with the same standard deviation but
different means.
The standard deviation describes
how spread out the normal
distribution is.
Here is a graph showing three
different normal distributions with
the same mean but different
standard deviations.

Different Mean and Standard Deviations
The mean describes where the center of the normal distribution is.
The purple curve has the biggest standard deviation and the black curve has the smallest standard
deviation.
The area under each of the curves is still 1, or 100%.
.

A Real Data Example of Normally Distributed Data
Real world data is often normally distributed.
Here is a histogram of the age of Nobel Prize winners when they won the prize:
The normal distribution drawn on top of
the histogram is based on the population
mean (μ) and standard deviation (σ) of the
real data.
We can see that the histogram close to a
normal distribution.
Examples of real world variables that can
be normally distributed:
• Test scores
• Height
• Birth weight

Probability Distributions
Probability distributions are functions that calculates the probabilities of the outcomes of random
variables.
Typical examples of random variables are coin tosses and dice rolls.
Here is an graph showing the results of a growing number of coin tosses and the expected values
of the results (heads or tails).
The expected values of the coin toss is the probability distribution of the coin toss.
Notice how the result of random coin tosses
gets closer to the expected values (50%) as
the number of tosses increases.
Similarly, here is a graph showing the results
of a growing number of dice rolls and the
expected values of the results (from 1 to 6).

Notice again how the result of random dice
rolls gets closer to the expected values (1/6,
or 16.666%) as the number of rolls increases.
When the random variable is a sum of dice
rolls the results and expected values take a
different shape.
The different shape comes from there being
more ways of getting a sum of near the
middle, than a small or large sum.

As we keep increasing the number of dice for a sum the shape of the results and expected
values look more and more like a normal distribution.
Many real world variables follow a similar pattern and naturally form normal distributions.
Normally distributed variables can be analyzed with well-known techniques.

Statistics - Standard Normal Distribution
The standard normal distribution is a normal distribution where the mean is 0 and the standard
deviation is 1.
Normally distributed data can be transformed into a standard normal distribution.
Standardizing normally distributed data makes it easier to compare different sets of data.
The standard normal distribution is used for:
• Calculating confidence intervals
• Hypothesis tests
Here is a graph of the standard normal distribution with probability values (p-values) between the
standard deviations:

Standardizing makes it easier
to calculate probabilities.
The functions for calculating
probabilities are complex and
difficult to calculate by hand.
Typically, probabilities are
found by looking up tables of
pre-calculated values, or by
using software and
programming.
The standard normal
distribution is also called the
'Z-distribution' and the values
are called 'Z-values' (or Z-
scores).

Z-Values
Z-values express how many standard deviations from the mean a value is.
The formula for calculating a Z-value is:
Z=(x−μ)/σ
x is the value we are standardizing, μ is the mean, and σ is the standard deviation.
For example, if we know that:
The mean height of people in Germany is 170 cm (μ)
The standard deviation of the height of people in Germany is 10 cm (σ)
Bob is 200 cm tall (x)
Bob is 30 cm taller than the average person in Germany.
30 cm is 3 times 10 cm. So Bob's height is 3 standard deviations larger than mean height in
Germany.
Using the formula:

Finding the P-value of a Z-Value
Using a Z-table or programming we can calculate how many people Germany are shorter than
Bob and how many are taller.
Example
With Python use the Scipy Stats library norm.cdf() function find the probability of getting less
than a Z-value of 3:
import scipy.stats as stats
print(stats.norm.cdf(3))
O/P: 0.9986501019683699

Statistics - Student's T Distribution
The student's t-distribution is similar to a normal distribution and used in statistical inference to
adjust for uncertainty. It is used for estimation and hypothesis testing of a population mean
(average).
The t-distribution is adjusted for the extra uncertainty of estimating the mean.
If the sample is small, the t-distribution is wider. If the sample is big, the t-distribution is narrower.
The bigger the sample size is, the closer the t-distribution gets to the standard normal
distribution.

Notice how some of the curves have bigger tails.
This is due to the uncertainty from a smaller
sample size.
The green curve has the smallest sample size.
For the t-distribution this is expressed as 'degrees
of freedom' (df), which is calculated by subtracting 1
from the sample size (n).
For example a sample size of 30 will make 29
degrees of freedom for the t-distribution.
The t-distribution is used to find critical t-
values and p-values (probabilities) for estimation
and hypothesis testing.
Note: Finding the critical t-values and p-values of the
t-distribution is similar z-values and p-values of the
standard normal distribution. But make sure to use
the correct degrees of freedom.

Finding the P-Value of a T-Value
You can find the p-values of a t-value by using a t-
table or with programming.
Example
With Python use the Scipy Stats library t.cdf()
function find the probability of getting less than a
t-value of 2.1 with 29 degrees of freedom:
print(stats.t.cdf(2.1, 29))
O/P: 0.9777290209818548
Finding the T-value of a P-Value
You can find the t-values of a p-value by
using a t-table or with programming.
Example
With Python use the Scipy Stats library
t.ppf() function find the t-value
separating the top 25% from the bottom
75% with 29 degrees of freedom:
print(stats.t.ppf(0.75, 29))
O/P: 0.6830438592467808

Statistics - Estimation
Point estimates are the most likely value for a population parameter.
Confidence intervals express the uncertainty of an estimated population parameter.
A point estimate is calculated from a sample.
The point estimate depends on the type of data:
• Categorical data: the number of occurrences divided by the sample size.
• Numerical data: the mean (the average) of the sample.
One example could be:
The point estimate for the average height of people in Denmark is 180 cm.
Estimates are always uncertain. This uncertainty can be expressed with a confidence interval.

Statistics - Estimation
Confidence Intervals
The confidence interval is defined by a lower bound and an upper bound.
This gives us a range of values that the true parameter is likely to be between.
For example that:
The average height of people in Denmark is between 170 cm and 190 cm.
Here, 170 cm is the lower bound, and 190 cm is the upper bound.
The lower and upper bounds of a confidence interval is based on the confidence level.

Statistics - Hypothesis Testing
Hypothesis testing is a formal way of checking if a hypothesis about a population is true or not.
A hypothesis is a claim about a population parameter.
A hypothesis test is a formal procedure to check if a hypothesis is true or not.
Examples of claims that can be checked:
The average height of people in Denmark is more than 170 cm.
The share of left handed people in Australia is not 10%.
The average income of dentists is less the average income of dentists.

The Null and Alternative Hypothesis
Hypothesis testing is based on making two different claims about a population parameter.
The null hypothesis (H0) and the alternative hypothesis (H1) are the claims.
The two claims needs to be mutually exclusive, meaning only one of them can be true.
The alternative hypothesis is typically what we are trying to prove.
For example, we want to check the following claim:
"The average height of people in Denmark is more than 170 cm."
In this case, the parameter is the average height of people in Denmark (μ).
The null and alternative hypothesis would be:
Null hypothesis: The average height of people in Denmark is 170 cm.
Alternative hypothesis: The average height of people in Denmark is more than 170 cm.

The claims are often expressed with symbols like this:
:
:
If the data supports the alternative hypothesis, we reject the null hypothesis and accept the
alternative hypothesis.
If the data does not support the alternative hypothesis, we keep the null hypothesis.
Note: The alternative hypothesis is also referred to as (H_{A})

The Significance Level
The significance level (α) is the uncertainty we accept when rejecting the null hypothesis in
the hypothesis test.
The significance level is a percentage probability of accidentally making the wrong
conclusion.
Typical significance levels are:
•α=0.1 (10%)
•α=0.05 (5%)
•α=0.01 (1%)
A lower significance level means that the evidence in the data needs to be stronger to reject the null
hypothesis.
There is no "correct" significance level - it only states the uncertainty of the conclusion.
Note: A 5% significance level means that when we reject a null hypothesis:
We expect to reject a true null hypothesis 5 out of 100 times.

The Critical Value and P-Value Approach
There are two main approaches used for hypothesis tests:
The critical value approach compares the test statistic with the critical value of the significance
level.
The p-value approach compares the p-value of the test statistic and with the significance level.
The Critical Value Approach
The critical value approach checks if the test statistic is in the rejection region.
The rejection region is an area of probability in the tails of the distribution.
The size of the rejection region is decided by the significance level ().
The value that separates the rejection region from the rest is called the critical value.

Here is a graphical illustration:
If the test statistic
is inside this rejection region,
the null hypothesis
is rejected.
For example, if the test
statistic is 2.3 and the critical
value is 2 for a significance
level (α=0.05):
We reject the null hypothesis
(H0) at 0.05 significance level
(α)

The P-Value Approach
It checks if the p-value of the test statistic is smaller than the significance level ().
The p-value of the test statistic is the area of probability in the tails of the distribution from the
value of the test statistic.
Here is a graphical illustration:

If the p-value is smaller than the significance level, the null hypothesis is rejected.
The p-value directly tells us the lowest significance level where we can reject the null
hypothesis.
For example, if the p-value is 0.03:
We reject the null hypothesis (Ho) at a 0.05 significance level (α)
We keep the null hypothesis (Ho) at a 0.01 significance level (α)
Note: The two approaches are only different in how they present the conclusion.

Steps for a Hypothesis Test
The following steps are used for a hypothesis test:
1. Check the conditions
2. Define the claims
3. Decide the significance level
4. Calculate the test statistic
5. Conclusion
One condition is that the sample is randomly selected from the population.
The other conditions depends on what type of parameter you are testing the hypothesis for.
Common parameters to test hypotheses are:
o Proportions (for qualitative data)
o Mean values (for numerical data)

We are missing one important variable that affects Calorie_Burnage, which is the
Duration of the training session.
 Duration in combination with Average_Pulse will together explain Calorie_Burnage
more precisely.

Data Science - Linear Regression
The term regression is used when you try to find the relationship between variables.
In Machine Learning and in statistical modeling, that relationship is used to predict the
outcome of events.
In this module, we will cover the following questions:
Can we conclude that Average_Pulse and Duration are related to Calorie_Burnage?
Can we use Average_Pulse and Duration to predict Calorie_Burnage?

Least Square Method Least Square Method
Linear regression uses the least square
method.
The concept is to draw a line through all the
plotted data points. The line is positioned
in a way that it minimizes the distance to all
of the data points.
The distance is called "residuals" or
"errors".
The red dashed lines represents the
distance from the data points to the drawn
mathematical function.

Least Square Method
Linear Regression Using One Explanatory Variable
In this example, we will try to predict Calorie_Burnage with Average_Pulse using Linear
Regression:
Example
import pandas as pd
from scipy import stats
full_health_data = pd.read_csv("data.csv", header=0,
sep=",")
x = full_health_data["Average_Pulse"]
y = full_health_data ["Calorie_Burnage"]
slope, intercept, r, p, std_err = stats.linregress(x, y)
def myfunc(x):
return slope * x + intercept
mymodel = list(map(myfunc, x))
plt.scatter(x, y)
plt.plot(x, slope * x + intercept)
plt.ylim(ymin=0, ymax=2000)
plt.xlim(xmin=0, xmax=200)
plt.xlabel("Average_Pulse")
plt.ylabel ("Calorie_Burnage")
plt.show()

Least Square Method
Linear Regression Using One Explanatory Variable
In this example, we will try to predict Calorie_Burnage with Average_Pulse using Linear
Regression:
Do you think that the line is
able to predict
Calorie_Burnage precisely?
We will show that the
variable Average_Pulse alone
is not enough to make
precise prediction of
Calorie_Burnage.

Congrats ! You got your Data Science Job

Congrats ! You got your Data Science Job

Recomendados

Recomendados

Mais conteúdo relacionado

Semelhante a Congrats ! You got your Data Science Job

Semelhante a Congrats ! You got your Data Science Job (20)

Mais de Rohit Dubey

Mais de Rohit Dubey (14)

Último

Último (20)

Congrats ! You got your Data Science Job