SlideShare uma empresa Scribd logo
1 de 305
Baixar para ler offline
Finish this Book and get your
Data Scientist Job
Data Science Introduction
Data Science is a combination of multiple disciplines that uses statistics, data analysis, and
machine learning to analyze data and to extract knowledge and insights from it.
What is Data Science?
Data Science is about data gathering, analysis and decision-making.
Data Science is about finding patterns in data, through analysis, and make future
predictions.
By using Data Science, companies are able to make:
Better decisions (should we choose A or B)
Predictive analysis (what will happen next?)
Pattern discoveries (find pattern, or maybe hidden information in the data)
Where is Data Science Needed?
 Data Science is used in many industries in the world today, e.g. banking,
consultancy, healthcare, and manufacturing.
 Examples of where Data Science is needed:
 For route planning: To discover the best routes to ship
 To foresee delays for flight/ship/train etc. (through predictive analysis)
 To create promotional offers
 To find the best suited time to deliver goods
 To forecast the next years revenue for a company
 To analyze health benefit of training
 To predict who will win elections
 Where is Data Science Needed?
 Data Science can be applied in nearly every part of a business where
data is available. Examples are:
 Consumer goods
 Stock markets
 Industry
 Politics
 Logistic companies
 E-commerce
 How Does a Data Scientist Work?
 A Data Scientist requires expertise in several backgrounds:
 Machine Learning
 Statistics
 Programming (Python or R)
 Mathematics
 Databases
 A Data Scientist must find patterns within the data. Before he/she can find
the patterns, he/she must organize the data in a standard format.
Here is how a Data Scientist works:
Ask the right questions - To understand the business problem.
Explore and collect data - From database, web logs, customer feedback,
etc.
Extract the data - Transform the data to a standardized format.
Clean the data - Remove erroneous values from the data.
Find and replace missing values - Check for missing values and replace
them with a suitable value (e.g. an average value).
Normalize data - Scale the values in a practical range (e.g. 140 cm is
smaller than 1,8 m. However, the number 140 is larger than 1,8. - so scaling
is important).
Analyze data, find patterns and make future predictions.
Represent the result - Present the result with useful insights in a way the
"company" can understand.
 What is Data?
 Data is a collection of information.
 One purpose of Data Science is to structure data, making it interpretable
and easy to work with.
 Data can be categorized into two groups:
 Structured data
 Unstructured data
Unstructured Data
Unstructured data is not organized.
We must organize the data for analysis
purposes.
Structured Data
Structured data is
organized and
easier to work with.
Data Types?
 How to Structure Data?
We can use an array or a database table to structure or present data.
Example of an array:
[80, 85, 90, 95, 100, 105, 110, 115, 120, 125]
The following example shows how to create an array in Python:
#Example
Array = [80, 85, 90, 95, 100, 105, 110, 115, 120, 125]
print(Array)
o/p: [80, 85, 90, 95, 100, 105, 110, 115, 120, 125]
Data Science - Database Table:
What is Database Table?
A database table is a table with structured data.
The following table shows a database table with health data extracted from a sports watch:
This dataset contains
information of a typical
training session such as
duration, average pulse,
calorie burnage etc.
Data Science - Database Table:
Database Table Structure
A database table consists of column(s) and row(s):
A row is a
horizontal
representation of
data.
A column is a
vertical
representation of
data.
Data Science - Database Table:
Variables
A variable is defined as something that can be measured or counted.
Examples can be
characters,
numbers or time.
In the example
under, we can
observe that each
column represents
a variable.
Data Science - Database Table:
Variables
A variable is defined as something that can be measured or counted.
There are 6 columns, meaning that there are 6 variables (Duration, Average_Pulse,
Max_Pulse, Calorie_Burnage, Hours_Work, Hours_Sleep).
There are 11 rows, meaning that
each variable has 10 observations.
But if there are 11 rows, how come
there are only 10 observations?
It is because the first row is the
label, meaning that it is the name
of the variable.
Data Science & Python
Python
Python is a programming language widely used by Data Scientists.
Python has in-built mathematical libraries and functions, making it easier to calculate
mathematical problems and to perform data analysis.
Python Libraries
Python has libraries with large collections of mathematical functions and analytical tools.
In this course, we will use the following libraries:
 Pandas - This library is used for structured data operations, like import CSV files, create
dataframes, and data preparation
 Numpy - This is a mathematical library. Has a powerful N-dimensional array object, linear
algebra, Fourier transform, etc.
 Matplotlib - This library is used for visualization of data.
 SciPy - This library has linear algebra modules
Data Science - Python DataFrame
Create a DataFrame with Pandas
A data frame is a structured representation of data.
Let's define a data frame with 3 columns and 5 rows with fictional numbers:
Example
import pandas as pd
d = {'col1': [1, 2, 3, 4, 7], 'col2': [4, 5, 6, 9, 5], 'col3': [7, 8, 12, 1, 11]}
df = pd.DataFrame(data=d)
print(df)
Example Explained
Import the Pandas library as pd
Define data with column and rows in a variable named d
Create a data frame using the function pd.DataFrame()
The data frame contains 3 columns and 5 rows
Print the data frame output with the print() function
Data Science - Python DataFrame
We write pd. in front of DataFrame() to let Python know that we want to activate the DataFrame()
function from the Pandas library.
Be aware of the capital D and F in DataFrame!
Interpreting the Output : This is the output:
We see that "col1", "col2" and "col3" are the names of the
columns.
Do not be confused about the vertical numbers ranging from 0-
4. They tell us the information about the position of the rows.
In Python, the numbering of rows starts with zero.
Now, we can use Python to count the columns and rows.
We can use df.shape[1] to find the number of columns:
Data Science Functions
The Sports Watch Data Set
The data set above consists of 6 variables,
each with 10 observations:
 Duration - How long lasted the training
session in minutes?
 Average_Pulse - What was the average
pulse of the training session? This is
measured by beats per minute
 Max_Pulse - What was the max pulse of
the training session?
 Calorie_Burnage - How much calories
were burnt on the training session?
 Hours_Work - How many hours did we
work at our job before the training
session?
 Hours_Sleep - How much did we sleep
the night before the training session?
We use underscore (_) to separate strings
because Python cannot read space as
separator.
DataThe min() function
The Python min() function is used to find the lowest value in an array. Science Functions
Average_pulse_min = min(80, 85, 90, 95, 100, 105, 110, 115, 120, 125)
print (Average_pulse_min)
o/p: 80
Average_pulse_max = max(80, 85, 90, 95, 100, 105, 110, 115, 120, 125)
print (Average_pulse_max)
o/p: 125
Data Science Functions
The max() function
The Python max() function is used to find the highest value in an array.
The mean() function
The NumPy mean() function is used to find the average value of an array.
import numpy as np
Calorie_burnage = [240, 250, 260, 270, 280, 290, 300, 310, 320, 330]
Average_calorie_burnage = np.mean(Calorie_burnage)
print(Average_calorie_burnage)
o/p: 285.0
Extract and Read Data With Pandas
• Before analyzing data, a Data Scientist must extract the data, and
make it clean and valuable.
• Before data can be analyzed, it must be imported/extracted.
In the example below, we show you how to import data using Pandas in Python.
We use the read_csv() function to import a CSV file with the health data:
import pandas as pd
health_data = pd.read_csv("data.csv", header=0, sep=",")
print(health_data)
Data Science - Data Preparation
Example Explained
• Import the Pandas library
• Name the data frame as health_data.
• header=0 means that the headers for the variable names are to be found in the first row
(note that 0 means the first row in Python)
• sep="," means that "," is used as the separator between the values. This is because we are
using the file type .csv (comma separated values)
• Tip: If you have a large CSV file, you can use the head() function to only show the top
5rows:
Data Science - Data Preparation
Data Cleaning
Look at the imported data. As you can see, the data are "dirty" with wrongly or unregistered
values:
• There are some blank fields
• Average pulse of 9 000 is not possible
• 9 000 will be treated as non-numeric, because of the space separator
• One observation of max pulse is denoted as "AF", which does not make sense
• So, we must clean the data in order to perform the analysis.
Data Science - Data Preparation
Data Cleaning
Remove Blank Rows
We see that the non-numeric values (9 000 and AF) are in the same rows with missing values.
Solution: We can remove the rows with missing observations to fix this problem.
When we load a data set using Pandas, all blank cells are automatically converted into
"NaN" values.
So, removing the NaN cells gives us a clean data set that can be analyzed.
We can use the dropna() function to remove the NaNs. axis=0 means that we want to
remove all rows that have a NaN value:
import pandas as pd
health_data = pd.read_csv("data.csv", header=0, sep=",")
health_data.dropna(axis=0,inplace=True)
print(health_data)
Data Science - Data Preparation
Data Cleaning
Remove Blank Rows
We see that the non-numeric values (9 000 and AF) are in the same rows with missing values.
Solution: We can remove the rows with missing observations to fix this problem.
When we load a data set using Pandas, all blank cells are automatically converted into
"NaN" values.
So, removing the NaN cells gives us a clean data set that can be analyzed.
We can use the dropna() function to remove the NaNs. axis=0 means that we want to
remove all rows that have a NaN value:
import pandas as pd
health_data = pd.read_csv("data.csv", header=0, sep=",")
health_data.dropna(axis=0,inplace=True)
print(health_data)
Data Science - Data Preparation
Data Categories
To analyze data, we also need to know the types of data we are dealing with.
Data can be split into three main categories:
o Numerical - Contains numerical values. Can be divided into two categories:
Discrete: Numbers are counted as "whole". Example: You cannot have trained 2.5 sessions, it is either
2 or 3
Continuous: Numbers can be of infinite precision. For example, you can sleep for 7 hours, 30 minutes
and 20 seconds, or 7.533 hours
o Categorical - Contains values that cannot be measured up against each other. Example: A color or
a type of training
o Ordinal - Contains categorical data that can be measured up against each other. Example: School
grades where A is better than B and so on
o By knowing the type of your data, you will be able to know what technique to use when analyzing
them.
Data Science - Data Preparation
Data Types
We can use the info() function to list the data types within our data set:
import pandas as pd
health_data = pd.read_csv("data.csv", header=0, sep=",")
print(health_data.info())
o/p:
Data Science - Data Preparation
We see that this data set has two different
types of data:
Float64
Object
We cannot use objects to calculate and perform
analysis here. We must convert the type object
to float64 (float64 is a number with a decimal in
Python).
We cannot use objects to calculate and perform
analysis here. We must convert the type object
to float64 (float64 is a number with a decimal in
Python).
Analyze the Data
Data Science - Data Preparation
import pandas as pd
health_data = pd.read_csv("data.csv", header=0, sep=",")
pd.set_option('display.max_columns',None)
print(health_data.describe())
When we have cleaned the data set, we can start analyzing the data.
We can use the describe() function in Python to summarize data:
Count - Counts the number of
observations
Mean - The average value
Std - Standard deviation
(explained in the statistics
chapter)
Min - The lowest value
25%, 50% and 75% are
percentiles (explained in the
statistics chapter)
Max - The highest value
Data Science - Linear Functions
DS Math
Mathematical functions are important to know as a data scientist, because we want to make
predictions and interpret them.
Linear Functions
In mathematics a function is used to relate one variable to another variable.
Suppose we consider the relationship between calorie burnage and average pulse. It is reasonable
to assume that, in general, the calorie burnage will change as the average pulse changes - we say
that the calorie burnage depends upon the average pulse.
Furthermore, it may be reasonable to assume that as the average pulse increases, so will the calorie
burnage. Calorie burnage and average pulse are the two variables being considered.
Because the calorie burnage depends upon the average pulse, we say that calorie burnage is the
dependent variable and the average pulse is the independent variable.
The relationship between a dependent and an independent variable can often be expressed
mathematically using a formula (function).
Data Science - Linear Functions
DS Math
Mathematical functions are important to know as a data scientist, because we want to make
predictions and interpret them.
A linear function has one independent variable (x) and one dependent variable (y), and has the following form:
y = f(x) = ax + b
This function is used to calculate a value for the dependent variable when we choose a value for the
independent variable.
Explanation:
o f(x) = the output (the dependant variable)
o x = the input (the independant variable)
o a = slope = is the coefficient of the independent variable. It gives the rate of change of the dependent
variable
o b = intercept = is the value of the dependent variable when x = 0. It is also the point where the diagonal line
crosses the vertical axis.
Data Science - Linear Functions
DS Math
Linear Function With One Explanatory Variable
A function with one explanatory variable means that we use one variable for prediction.
Let us say we want to predict calorie burnage using average pulse. We have the following formula:
f(x) = 2x + 80
Here, the numbers and variables means:
o f(x) = The output. This number is where we get the predicted value of Calorie_Burnage
o x = The input, which is Average_Pulse
o 2 = Slope = Specifies how much Calorie_Burnage increases if Average_Pulse increases by one. It
tells us how "steep" the diagonal line is
o 80 = Intercept = A fixed value. It is the value of the dependent variable when x = 0
Data Science - Linear Functions
DS Math
Plotting a Linear Function
The term linearity means a "straight line". So, if you show a linear function graphically, the line
will always be a straight line. The line can slope upwards, downwards, and in some cases may be
horizontal or vertical. Here is a graphical representation of the mathematical function above:
Graph Explanations:
o The horizontal axis is generally called the x-axis. Here,
it represents Average_Pulse.
o The vertical axis is generally called the y-axis. Here, it
represents Calorie_Burnage.
o Calorie_Burnage is a function of Average_Pulse,
because Calorie_Burnage is assumed to be dependent
on Average_Pulse.
o In other words, we use Average_Pulse to predict
Calorie_Burnage.
o The blue (diagonal) line represents the structure of
the mathematical function that predicts calorie
burnage.
Data Science - Plotting Linear Functions
DS Math
The Sports Watch Data Set
Take a look at our health data set:
Plot the Existing Data in Python
Now, we can first plot the values of
Average_Pulse against
Calorie_Burnage using the matplotlib
library.
The plot() function is used to make a
2D hexagonal binning plot of points
x,y:
Data Science - Plotting Linear Functions
DS Math
The Sports Watch Data Set
Take a look at our health data set:
#Three lines to make our compiler able to draw:
import sys
import matplotlib
matplotlib.use('Agg')
import pandas as pd
import matplotlib.pyplot as plt
health_data = pd.read_csv("data.csv", header=0, sep=",")
health_data.plot(x ='Average_Pulse', y='Calorie_Burnage', kind='line')
plt.ylim(ymin=0)
plt.xlim(xmin=0)
plt.show()
#Two lines to make our compiler able to draw:
plt.savefig(sys.stdout.buffer)
sys.stdout.flush()
o Example Explained
o Import the pyplot module of the matplotlib library
o Plot the data from Average_Pulse against
Calorie_Burnage
o kind='line' tells us which type of plot we want.
Here, we want to have a straight line
o plt.ylim() and plt.xlim() tells us what value we want
the axis to start on. Here, we want the axis to begin
from zero
o plt.show() shows us the output
The code above will produce the following result:
Data Science - Plotting Linear Functions
DS Math
The Sports Watch Data Set
Take a look at our health data set:
The Graph Output
As we can see, there is a
relationship between
Average_Pulse and
Calorie_Burnage. Calorie_Burnage
increases proportionally with
Average_Pulse. It means that we
can use Average_Pulse to predict
Calorie_Burnage.
Data Science - Plotting Linear Functions
DS Math
Why is The Line Not Fully Drawn Down to The y-axis?
The reason is that we do not have observations where Average_Pulse or Calorie_Burnage are
equal to zero. 80 is the first observation of Average_Pulse and 240 is the first observation of
Calorie_Burnage.
Look at the line.
What happens to
calorie burnage if
average pulse
increases from 80 to
90?
Data Science - Plotting Linear Functions
We can use the diagonal line to find the mathematical function to predict
calorie burnage.
DS Math
As it turns out:
o If the average pulse is 80, the calorie burnage is 240
o If the average pulse is 90, the calorie burnage is 260
o If the average pulse is 100, the calorie burnage is 280
o There is a pattern. If average pulse increases by 10, the calorie burnage increases by 20.
Data Science - Slope and
Intercept
Slope and Intercept
Now we will explain how we found
the slope and intercept of our
function:
f(x) = 2x + 80
The image points to the Slope -
which indicates how steep the line
is, and the Intercept - which is the
value of y, when x = 0 (the point
where the diagonal line crosses the
vertical axis). The red line is the
continuation of the blue line from
previous page.
DS Math
Data Science - Slope and Intercept
Find The Slope
The slope is defined as how much calorie burnage increases, if average pulse increases by one. It tells
us how "steep" the diagonal line is.
We can find the slope by using the proportional difference of two points from the graph.
If the average pulse is 80, the calorie burnage is 240
If the average pulse is 90, the calorie burnage is 260
We see that if average pulse increases with 10, the calorie burnage increases by 20.
Slope = 20/10 = 2
The slope is 2.
DS Math
Data Science - Slope and Intercept
Find The Slope
Mathematically, Slope is Defined as:
Slope = f(x2) - f(x1) / x2-x1
f(x2) = Second observation of Calorie_Burnage = 260
f(x1) = First observation of Calorie_Burnage = 240
x2 = Second observation of Average_Pulse = 90
x1 = First observation of Average_Pulse = 80
Slope = (260-240) / (90 - 80) = 2
Be consistent to define the observations in the correct order! If not,
the prediction will not be correct!
DS Math
Use Python to Find the Slope
Calculate the slope with the following
code:
def slope(x1, y1, x2, y2):
s = (y2-y1)/(x2-x1)
return s
print(slope(80,240,90,260))
o/p: 2.0
Data Science - Slope and Intercept
Find The Intercept
The intercept is used to fine tune the functions ability to predict Calorie_Burnage.
The intercept is where the diagonal line crosses the y-axis, if it were fully drawn.
The intercept is the value of y, when x = 0.
Here, we see that if average pulse (x) is zero, then the calorie burnage (y) is 80.
So, the intercept is 80.
Sometimes, the intercept has a practical meaning. Sometimes not.
Does it make sense that average pulse is zero?
No, you would be dead and you certainly would not burn any calories.
However, we need to include the intercept in order to complete the mathematical function's ability to predict
Calorie_Burnage correctly.
Other examples where the intercept of a mathematical function can have a practical meaning:
Predicting next years revenue by using marketing expenditure (How much revenue will we have next year, if
marketing expenditure is zero?). It is likely to assume that a company will still have some revenue even though if
it does not spend money on marketing.
Fuel usage with speed (How much fuel do we use if speed is equal to 0 mph?). A car that uses gasoline will still
use fuel when it is idle.
DS Math
Data Science - Slope and Intercept
Find the Slope and Intercept Using Python
The np.polyfit() function returns the slope and intercept.
If we proceed with the following code, we can both get the slope and intercept from the function.
import pandas as pd
import numpy as np
health_data = pd.read_csv("data.csv", header=0, sep=",")
x = health_data["Average_Pulse"]
y = health_data["Calorie_Burnage"]
slope_intercept = np.polyfit(x,y,1)
print(slope_intercept)
o/p: 2.80
DS Math
Example Explained:
Isolate the variables Average_Pulse (x) and
Calorie_Burnage (y) from health_data.
Call the np.polyfit() function.
The last parameter of the function
specifies the degree of the function, which
in this case is "1".
Data Science - Slope and Intercept
Find the Slope and Intercept Using Python
The np.polyfit() function returns the slope and intercept.
If we proceed with the following code, we can both get the slope and intercept from the function.
import pandas as pd
import numpy as np
health_data = pd.read_csv("data.csv", header=0, sep=",")
x = health_data["Average_Pulse"]
y = health_data["Calorie_Burnage"]
slope_intercept = np.polyfit(x,y,1)
print(slope_intercept)
o/p: 2.80
DS Math
We have now calculated the slope (2)
and the intercept (80). We can write the
mathematical function as follow:
Predict Calorie_Burnage by using a
mathematical expression:
f(x) = 2x + 80
Data Science - Slope and Intercept
Task:
Now, we want to predict calorie burnage if average pulse is 135.
Remember that the intercept is a constant. A constant is a number that does not change.
We can now substitute the input x with 135:
f(135) = 2 * 135 + 80 = 350
If average pulse is 135, the calorie burnage is 350.
Define the Mathematical Function in Python
Here is the exact same mathematical function, but in Python. The function returns 2*x + 80, with x
as the input:
#Try to replace x with 140 and 150.
def my_function(x):
return 2*x + 80
print (my_function(135))
o/p: 350
DS Math
Data Science - Slope and Intercept
Plot a New Graph in Python
Here, we plot the same graph as earlier, but formatted the
axis a little bit.
Max value of the y-axis is now 400 and for x-axis is 150:
import matplotlib.pyplot as plt
health_data.plot(x ='Average_Pulse', y='Calorie_Burnage', kind='line'),
plt.ylim(ymin=0, ymax=400)
plt.xlim(xmin=0, xmax=150)
plt.show()
DS Math
Example Explained
Import the pyplot module of the matplotlib library
Plot the data from Average_Pulse against Calorie_Burnage
kind='line' tells us which type of plot we want. Here, we want to
have a straight line
plt.ylim() and plt.xlim() tells us what value we want the axis to start
and stop on.
plt.show() shows us the output
Introduction to Statistics
Statistics is the discipline that concerns the collection, organization, analysis, interpretation, and
presentation of data. When we have created a model for prediction, we must assess the prediction's
reliability.
Statistics is a method of interpreting, analyzing and summarizing the data.
The types of statistics are categorized based on these features:
Descriptive and inferential statistics
Based on the representation of data such as using pie charts, bar graphs, or tables, we analyse and
interpret it.
DS- Statistics
Descriptive Statistics
import pandas as pd
full_health_data = pd.read_csv("data.csv", header=0, sep=",")
pd.set_option('display.max_columns',None)
pd.set_option('display.max_rows',None)
print (full_health_data.describe())
DS- Statistics
Statistics Percentiles
25%, 50% and 75% - Percentiles
Percentiles are used in statistics to give you a number that describes the value that a given percent of
the values are lower than.
DS- Statistics
Statistics Percentiles
Let us try to explain it by some examples, using Average_Pulse.
The 25% percentile of Average_Pulse means that 25% of all of the training sessions have an average
pulse of 100 beats per minute or lower. If we flip the statement, it means that 75% of all of the
training sessions have an average pulse of 100 beats per minute or higher
The 75% percentile of Average_Pulse means that 75% of all the training session have an average pulse
of 111 or lower. If we flip the statement, it means that 25% of all of the training sessions have an
average pulse of 111 beats per minute or higher
Task: Find the 10% percentile for Max_Pulse
The following example shows how to do it in Python:
DS- Statistics
Statistics Percentiles
import pandas as pd
import numpy as np
full_health_data = pd.read_csv("data.csv", header=0, sep=",")
Max_Pulse= full_health_data["Max_Pulse"]
percentile10 = np.percentile(Max_Pulse, 10)
print(percentile10)
o/p: 120.00
Max_Pulse = full_health_data["Max_Pulse"] - Isolate the variable Max_Pulse from the full health
data set.
np.percentile() is used to define that we want the 10% percentile from Max_Pulse.
The 10% percentile of Max_Pulse is 120. This means that 10% of all the training sessions have a
Max_Pulse of 120 or lower.
DS- Statistics
Statistics Standard Deviation
Standard Deviation
Standard deviation is a number that describes how spread out the observations are
A mathematical function will have difficulties in predicting precise values, if the observations are
"spread". Standard deviation is a measure of uncertainty.
A low standard deviation means that most of the numbers are close to the mean (average) value.
A high standard deviation means that the values are spread out over a wider range.
DS- Statistics
Statistics Standard Deviation
Standard Deviation
import pandas as pd
import numpy as np
full_health_data = pd.read_csv("data.csv", header=0, sep=",")
std = np.std(full_health_data)
print(std)
DS- Statistics
Statistics Standard Deviation
Coefficient of Variation
The coefficient of variation is used to get an idea of how large the standard deviation is.
Mathematically, the coefficient of variation is defined as:
Coefficient of Variation = Standard Deviation / Mean
We can do this in Python if we proceed with the following code:
import numpy as np
cv = np.std(full_health_data) / np.mean(full_health_data)
print(cv)
o/p:
DS- Statistics
We see that the variables Duration, Calorie_Burnage and
Hours_Work has a high Standard Deviation compared to
Max_Pulse, Average_Pulse and Hours_Sleep.
Data Science - Statistics Variance
Variance
Variance is another number that indicates how spread out the values are.
In fact, if you take the square root of the variance, you get the standard deviation. Or the other way around, if
you multiply the standard deviation by itself, you get the variance!
We will first use the data set with 10 observations to give an example of how we can calculate the variance:
DS- Statistics
Tip: Variance is often represented by the symbol Sigma Square: σ^2
Data Science - Statistics Variance
Variance
Step 1 to Calculate the Variance: Find the
Mean
We want to find the variance of
Average_Pulse.
1. Find the mean:
(80+85+90+95+100+105+110+115+120+125) / 10
= 102.5
The mean is 102.5
DS- Statistics
Step 2: For Each Value - Find the Difference
From the Mean
2. Find the difference from the mean for each
value:
80 - 102.5 = -22.5
85 - 102.5 = -17.5
90 - 102.5 = -12.5
95 - 102.5 = -7.5
100 - 102.5 = -2.5
105 - 102.5 = 2.5
110 - 102.5 = 7.5
115 - 102.5 = 12.5
120 - 102.5 = 17.5
125 - 102.5 = 22.5
Step 3: For Each Difference - Find the
Square Value
3. Find the square value for each
difference:
(-22.5)^2 = 506.25
(-17.5)^2 = 306.25
(-12.5)^2 = 156.25
(-7.5)^2 = 56.25
(-2.5)^2 = 6.25
2.5^2 = 6.25
7.5^2 = 56.25
12.5^2 = 156.25
17.5^2 = 306.25
22.5^2 = 506.25
Note: We must square the values to get
the total spread.
DS- Statistics
Step 4: The Variance is the Average Number
of These Squared Values
4. Sum the squared values and find the
average:
(506.25 + 306.25 + 156.25 + 56.25 + 6.25 + 6.25
+ 56.25 + 156.25 + 306.25 + 506.25) / 10 =
206.25
The variance is 206.25.
Data Science - Statistics Variance
Variance
Use Python to Find the Variance of health_data
We can use the var() function from Numpy to find the variance (remember that we now use the first data set
with 10 observations):
import numpy as np
var = np.var(health_data)
print(var)
o/p:
DS- Statistics
Here we calculate the variance for each column for the
full data set:
import numpy as np
var_full = np.var(full_health_data)
print(var_full)
o/p:
Correlation
Correlation measures the relationship between two variables.
We mentioned that a function has a purpose to predict a value, by converting input (x) to output
(f(x)). We can say also say that a function uses the relationship between two variables for
prediction.
DS - Statistics Correlation
Correlation Coefficient
The correlation coefficient measures the relationship between two variables.
The correlation coefficient can never be less than -1 or higher than 1.
o 1 = there is a perfect linear relationship between the variables (like Average_Pulse against
Calorie_Burnage)
o 0 = there is no linear relationship between the variables
o -1 = there is a perfect negative linear relationship between the variables (e.g. Less hours
worked, leads to higher calorie burnage during a training session)
Example of a Perfect Linear Relationship (Correlation Coefficient = 1)
We will use scatterplot to visualize the relationship between Average_Pulse and Calorie_Burnage
(we have used the small data set of the sports watch with 10 observations).
This time we want scatter plots, so we change kind to "scatter":
DS - Statistics Correlation
import matplotlib.pyplot as plt
health_data.plot(x ='Average_Pulse',
y='Calorie_Burnage', kind='scatter')
plt.show()
Example of a Perfect Negative Linear Relationship (Correlation Coefficient = -1)
DS - Statistics Correlation
We have plotted fictional data here. The x-axis
represents the amount of hours worked at our
job before a training session. The y-axis is
Calorie_Burnage.
If we work longer hours, we tend to have lower
calorie burnage because we are exhausted
before the training session.
The correlation coefficient here is -1.
Example of a Perfect Negative Linear Relationship (Correlation Coefficient = -1)
DS - Statistics Correlation
import pandas as pd
import matplotlib.pyplot as plt
negative_corr = {'Hours_Work_Before_Training': [10,9,8,7,6,5,4,3,2,1],
'Calorie_Burnage': [220,240,260,280,300,320,340,360,380,400]}
negative_corr = pd.DataFrame(data=negative_corr)
negative_corr.plot(x ='Hours_Work_Before_Training', y='Calorie_Burnage',
kind='scatter')
plt.show()
Example of No Linear Relationship (Correlation coefficient = 0)
DS - Statistics Correlation
Here, we have plotted Max_Pulse against Duration
from the full_health_data set.
As you can see, there is no linear relationship
between the two variables. It means that longer
training session does not lead to higher Max_Pulse.
The correlation coefficient here is 0.
import matplotlib.pyplot as plt
full_health_data.plot(x ='Duration', y='Max_Pulse', kind='scatter')
plt.show()
o/p:
Correlation Matrix
A matrix is an array of numbers arranged in rows and columns.
A correlation matrix is simply a table showing the correlation coefficients between variables.
Here, the variables are represented in the first row, and in the first column:
D S - Statistics Correlation Matrix
The table here has used data from the full health data set.
Observations:
We observe that Duration and Calorie_Burnage are closely
related, with a correlation coefficient of 0.89. This makes
sense as the longer we train, the more calories we burn
We observe that there is almost no linear relationships
between Average_Pulse and Calorie_Burnage (correlation
coefficient of 0.02)
Can we conclude that Average_Pulse does not affect
Calorie_Burnage? No. We will come back to answer this
question later!
Correlation Matrix
Correlation Matrix in Python
We can use the corr() function in Python to create a correlation matrix. We also use the round()
function to round the output to two decimals:
D S - Statistics Correlation Matrix
import pandas as pd
full_health_data = pd.read_csv("data.csv", header=0, sep=",")
Corr_Matrix = round(full_health_data.corr(),2)
print(Corr_Matrix)
o/p:
Correlation Matrix
Using a Heatmap
We can use a Heatmap to Visualize the Correlation Between Variables:
D S - Statistics Correlation Matrix
The closer the correlation coefficient is to 1, the
greener the squares get.
The closer the correlation coefficient is to -1, the
browner the squares get.
Correlation Matrix
Use Seaborn to Create a Heatmap
We can use the Seaborn library to create a correlation heat map (Seaborn is a
visualization library based on matplotlib):
D S - Statistics Correlation Matrix
import matplotlib.pyplot as plt
import seaborn as sns
correlation_full_health = full_health_data.corr()
axis_corr = sns.heatmap(
correlation_full_health,
vmin=-1, vmax=1, center=0,
cmap=sns.diverging_palette(50, 500, n=500),
square=True
)
plt.show()
Correlation Matrix
Use Seaborn to Create a Heatmap
We can use the Seaborn library to create a correlation heat map (Seaborn is a
visualization library based on matplotlib):
D S - Statistics Correlation Matrix
o Example Explained:
o Import the library seaborn as sns.
o Use the full_health_data set.
o Use sns.heatmap() to tell Python that we want a heatmap to
visualize the correlation matrix.
o Use the correlation matrix. Define the maximal and minimal
values of the heatmap. Define that 0 is the center.
o Define the colors with sns.diverging_palette. n=500 means
that we want 500 types of color in the same color palette.
o square = True means that we want to see squares.
Correlation Does Not Imply Causality
Correlation measures the numerical relationship between two variables.
A high correlation coefficient (close to 1), does not mean that we can for sure conclude
an actual relationship between two variables.
A classic example:
During the summer, the sale of ice cream at a beach increases
Simultaneously, drowning accidents also increase as well
Does this mean that increase of ice cream sale is a direct cause of increased drowning
accidents?
The Beach Example in Python
Here, we constructed a fictional data set for you to try:
D S - Statistics Correlation vs. Causality
Correlation Does Not Imply Causality
The Beach Example in Python
Here, we constructed a fictional data set for you to try:
import pandas as pd
import matplotlib.pyplot as plt
Drowning_Accident = [20,40,60,80,100,120,140,160,180,200]
Ice_Cream_Sale = [20,40,60,80,100,120,140,160,180,200]
Drowning = {"Drowning_Accident": [20,40,60,80,100,120,140,160,180,200],
"Ice_Cream_Sale": [20,40,60,80,100,120,140,160,180,200]}
Drowning = pd.DataFrame(data=Drowning)
Drowning.plot(x="Ice_Cream_Sale", y="Drowning_Accident", kind="scatter")
plt.show()
correlation_beach = Drowning.corr()
print(correlation_beach)
D S - Statistics Correlation vs. Causality
Correlation Does Not Imply Causality
The Beach Example in Python
o/p:
D S - Statistics Correlation vs. Causality
Correlation Does Not Imply Causality
The Beach Example in Python
D S - Statistics Correlation vs. Causality
Correlation vs Causality - The Beach Example
In other words: can we use ice cream sale to predict drowning accidents?
The answer is - Probably not.
It is likely that these two variables are accidentally correlating with each other.
What causes drowning then?
Unskilled swimmers
Waves
Cramp
Seizure disorders
Lack of supervision
Alcohol (mis)use
etc.
Correlation Does Not Imply Causality
The Beach Example in Python
D S - Statistics Correlation vs. Causality
Let us reverse the argument:
Does a low correlation coefficient (close to zero) mean that change in x does not affect y?
Back to the question:
Can we conclude that Average_Pulse does not affect Calorie_Burnage because of a low correlation
coefficient?
The answer is no.
There is an important difference between correlation and causality:
Correlation is a number that measures how closely the data are related
Causality is the conclusion that x causes y.
Tip: Always critically reflect over the concept of causality when doing predictions!
Correlation Does Not Imply Causality
The Beach Example in Python
D S - Statistics Correlation vs. Causality
Let us reverse the argument:
Does a low correlation coefficient (close to zero) mean that change in x does not affect y?
Back to the question:
Can we conclude that Average_Pulse does not affect Calorie_Burnage because of a low
correlation coefficient?
The answer is no.
There is an important difference between correlation and causality:
Correlation is a number that measures how closely the data are related
Causality is the conclusion that x causes y.
Statistics gives us methods of gaining knowledge from data.
What is Statistics Used for?
Statistics is used in all kinds of science and business
applications.
Statistics gives us more accurate knowledge which
helps us make better decisions.
Statistics can focus on making predictions about
what will happen in the future. It can also focus on
explaining how different things are connected.
Typical Steps of Statistical Methods
 The typical steps are:
 Gathering data
 Describing and visualizing data
 Making conclusions
It is important to keep all three steps in mind for any questions we want more knowledge
about.
Knowing which types of data are available can tell you what kinds of questions you can
answer with statistical methods.
Knowing which questions you want to answer can help guide what sort of data you need. A
lot of data might be available, and knowing what to focus on is important.
How is Statistics Used?
Statistics can be used to explain things in a precise way. You can use it to understand and make
conclusions about the group that you want to know more about. This group is called
the population.
• A population could be many different kinds of groups. It could be:
• All of the people in a country
• All the businesses in an industry
• All the customers of a business
• All people that play football who are older than 45 and so on
- it just depends on what you want to know about.
Gathering data about the population will give you a sample. This is a part of the whole
population. Statistical methods are then used on that sample.
The results of the statistical methods from the sample is used to make conclusions about the
population.
Important Concepts in Statistics
o Predictions and Explanations
o Populations and Samples
o Parameters and Sample Statistics
o Sampling Methods
o Data Types
o Measurement Level
o Descriptive Statistics
o Random Variables
o Univariate and Multivariate Statistics
o Probability Calculation
o Probability Distributions
o Statistical Inference
o Parameter Estimation
o Hypothesis Testing
o Correlation
o Regression Analysis
o Causal Inference
Statistics and Programming:
Statistical analysis is typically done with computers. Small amounts of data can
analyzed reasonably well without computers.
Historically, all data analysis was performed by manually. It was time-consuming and
prone to errors.
Nowadays, programming and software is typically used for data analysis.
In this course, we will see examples of code to do statistics with the programming
languages Python and R.
Statistics - Describing Data
Describing data is typically the second step of statistical analysis after gathering data.
Descriptive Statistics
The information (data) from your sample or population can be visualized with graphs
or summarized by numbers. This will show key information in a simpler way than just
looking at raw data. It can help us understand how the data is distributed.
Graphs can visually show the data distribution.
Examples of graphs include:
o Histograms
o Pie charts
o Bar graphs
o Box plots
Statistics - Describing Data
Some graphs have a close connection to numerical summary statistics. Calculating those gives
us the basis of these graphs.
For example, a box plot visually shows the quartiles of a data distribution.
Quartiles are the data split into four equal size parts, or quarters. A quartile is one type of
summary statistics.
Summary statistics take a large amount of information and sums it up in a few key values.
Numbers are calculated from the data which also describe the shape of the distributions.
These are individual 'statistics'.
Statistics - Making Conclusions
Using statistics to make conclusions about a population is called statistical
inference.
Statistics from the data in the sample is used to make conclusions about the
whole population. This is a type of statistical inference.
Probability theory is used to calculate the certainty that those statistics also
apply to the population.
When using a sample, there will always be some uncertainty about what the
data looks like for the population.
When using a sample, there will always be some uncertainty about what the data
looks like for the population.
Uncertainty is often expressed as confidence intervals.
Statistics - Making Conclusions
Confidence intervals are numerical ways of showing how likely it is that the true
value of this statistic is within a certain range for the population.
Hypothesis testing is a another way of checking if a statement about a
population is true. More precisely, it checks how likely it is that a hypothesis is
true is based on the sample data.
Some examples of statements or questions that can be checked with hypothesis
testing:
People in the Netherlands taller than people in Denmark
Do people prefer Pepsi or Coke?
Does a new medicine cure a disease?
Statistics - Making Conclusions
Causal Inference
Causal inference is used to investigate if something causes another thing.
For example: Does rain make plants grow?
If we think two things are related we can investigate to see if they correlate.
Statistics can be used to find out how strong this relation is.
Even if things are correlated, finding out of something is caused by other things
can be difficult. It can be done with good experimental design or other special
statistical techniques.
Note: Good experimental design is often difficult to achieve because of ethical concerns or other practical
reasons.
Statistics - Prediction and Explanation
Some types of statistical methods are focused on predicting what will happen.
Other types of statistical methods are focused on explaining how things are connected.
Prediction
Some statistical methods are not focused on explaining how things are connected. Only the
accuracy of prediction is important.
Many statistical methods are successful at predicting without giving insight into how things are
connected.
Some types of machine learning let computers do the hard work, but the way they predict is
difficult to understand. These approaches can also be vulnerable to mistakes if the circumstances
change, since the how they work is less clear.
Statistics - Prediction and Explanation
Some types of statistical methods are focused on predicting what will happen.
Other types of statistical methods are focused on explaining how things are connected.
Prediction
Some statistical methods are not focused on explaining how things are connected. Only the
accuracy of prediction is important.
Many statistical methods are successful at predicting without giving insight into how things are
connected.
Some types of machine learning let computers do the hard work, but the way they predict is
difficult to understand. These approaches can also be vulnerable to mistakes if the circumstances
change, since the how they work is less clear.
Note: Predictions about future events are called forecasts. Not all predictions are about the future.
Some predictions can be about something else that is unknown, even if it is not in the future.
Statistics - Prediction and Explanation
Explanation
Different statistical methods are often used for explaining how things are connected. These
statistical methods may not make good predictions.
These statistical methods often explain only small parts of the whole situation. But, if you only
want to know how a few things are connected, the rest might not matter.
If these methods accurately explains how all the relevant things are connected, they will also be
good at prediction. But managing to explain every detail is often challenging.
Some times we are specifically interested in figuring out if one thing causes another. This is called
causal inference.
If we are looking at complicated situations, many things are connected. To figure out what causes
what, we need to untangle every way these things are connected.
Statistics - Population and Samples
Population: Everything in the group that we want to learn about.
Sample: A part of the population.
For good statistical analysis, the sample needs to be as "similar" as possible to the
population. If they are similar enough, we say that the sample is representative of the
population.
The sample is used to make conclusions about the whole population. If the sample is not
similar enough to the whole population, the conclusions could be useless.
Statistics - Parameters and Statistics
The terms 'parameter' and (sample) 'statistic' refer to key concepts that are closely related in
statistics.
They are also directly connected to the concepts of populations and samples.
Parameter: A number that describes something about the whole population.
Sample statistic: A number that describes something about the sample.
The parameters are the key things we want to learn about. The parameters are usually
unknown.
Sample statistics gives us estimates for parameters.
There will always be some uncertainty about how accurate estimates are. More certainty
gives us more useful knowledge.
For every parameter we want to learn about we can get a sample and calculate a sample
statistic, which gives us an estimate of the parameter.
Statistics - Parameters and Statistics
Some Important Examples
Mean, median and mode are different types of
averages (typical values in a population).
For example:
The typical age of people in a country
The typical profits of a company
The typical range of an electric car
Variance and standard deviation are two
types of values describing how spread out
the values are.
A single class of students in a school would
usually be about the same age. The age of
the students will have low variance and
standard deviation.
A whole country will have people of all kinds
of different ages. The variance and standard
deviation of age in the whole country would
then be bigger than in a single school grade.
Statistics - Study Types
A statistical study can be a part of the process of gathering data.
There are different types of studies. Some are better than others, but they might be harder to do.
Main Types of Statistical Studies
The main types of statistical studies are observational and experimental studies.
We are often interested in knowing if something is the cause of another thing.
Experimental studies are generally better than observational studies for investigating this, but
usually require more effort.
An observational study is when observe and gather data without changing anything.
Statistics - Study Types
Experimental Studies
In an experimental study, the circumstances around the sample is changed. Usually, we compare
two groups from a population and these two groups are treated differently.
One example can be a medical study to see if a new medicine is effective.
One group receives the medicine and the other does not. These are the different circumstances
around those samples.
We can compare the health of both groups afterwards and see if the results are different.
Experimental studies can allow us to investigate causal relationships. A well designed experimental
study can be useful since it can isolate the relationship we are interested in from other effects.
Then we can be more confident that we are measuring the true effect.
Statistics - Sample Types
A study needs participants and there are different ways of gathering them.
Some methods are better than others, but they might be more difficult.
Different Types of Sampling Methods:
Random Sampling
A random sample is where every member of the population has an equal chance to be chosen.
Random sampling is the best. But, it can be difficult, or impossible, to make sure that it is completely
random.
Note: Every other sampling method is compared to how close it is to a random sample - the closer,
the better.
Convenience Sampling
A convenience sample is where the participants that are the easiest to reach are chosen.
Note: Convenience sampling is the easiest to do.
In many cases this sample will not be similar enough to the population, and the conclusions can
potentially be useless.
Statistics - Sample Types
Systematic Sampling
A systematic sample is where the participants are chosen by some regular system.
For example:
The first 30 people in a queue
Every third on a list
The first 10 and the last 10
Stratified Sampling
A stratified sample is where the population is split into smaller groups called 'strata'.
The 'strata' can, for example, be based on demographics, like:
Different age groups
Professions
Stratification of a sample is the first step. Another sampling method (like random sampling) is used for
the second step of choosing participants from all of the smaller groups (strata).
Statistics - Sample Types
Clustered Sampling
A clustered sample is where the population is split into smaller groups called 'clusters'.
The clusters are usually natural, like different cities in a country.
The clusters are chosen randomly for the sample.
All members of the clusters can participate in the sample, or members can be chosen
randomly from the clusters in a third step.
Statistics - Data Types
Data can be different types, and require different types of statistical methods to analyze
Different types of data
There are two main types of data: Qualitative (or 'categorical') and quantitative (or 'numerical').
These main types also have different sub-types depending on their measurement level.
Qualitative Data
Information about something that can be sorted into different categories that can't be described
directly by numbers.
Examples:
• Brands
• Nationality
• Professions
Statistics - Data Types
With categorical data we can calculate statistics like proportions. For example, the proportion of
Indian people in the world, or the percent of people who prefer one brand to another.
Quantitative Data
Information about something that is described by numbers.
Examples:
• Income
• Age
• Height
With numerical data we can calculate statistics like the average income in a country, or the range
of heights of players in a football team.
Statistics - Measurement Levels
Different data types have different measurement levels.
Measurement levels are important for what types of statistics can be calculated and how to best
present the data.
The main types of data are Qualitative (categories) and Quantitative (numerical). These are further
split into the following measurement levels.
These measurement levels are also called measurement 'scales'
Nominal Level
Categories (qualitative data) without any order.
Examples:
• Brand names
• Countries
• Colors
Statistics - Measurement Levels
Ordinal level
Categories that can be ordered (from low to high), but the precise "distance" between each is not
meaningful.
Examples:
• Letter grade scales from F to A
• Military ranks
• Level of satisfaction with a product
Consider letter grades from F to A: Is the grade A precisely twice as good as a B? And, is the grade B
also twice as good as C?
Exactly how much distance it is between grades is not clear and precise. If the grades are based on
amounts of points on a test, you can say that there is a precise "distance" on the point scale, but not
the grades themselves.
Statistics - Measurement Levels
Interval Level
Data that can be ordered and the distance between them is objectively meaningful. But there is no
natural 0-value where the scale originates.
Examples:
Years in a calendar
Temperature measured in Fahrenheit
Note: Interval scales are usually invented by people, like degrees of temperature.
0 degrees Celsius is 32 degrees of Fahrenheit. There is consistent distances between each degree (for
every 1 extra degree of Celsius, there is 1.8 extra Fahrenheit), but they do not agree on where 0 degrees
is.
Statistics - Descriptive Statistics
Descriptive statistics gives us insight into data without having to look at all of it in
detail.
Key Features to Describe about Data:
Getting a quick overview of how the data is distributed is a important step in statistical methods.
We calculate key numerical values about the data that tells us about the distribution of the data.
We also draw graphs showing visually how the data is distributed.
Key Features of Data:
• Where is the center of the data? (location)
• How much does the data vary? (scale)
• What is the shape of the data? (shape)
• These can be described by summary statistics (numerical values).
Statistics - Descriptive Statistics
The Center of the Data
The center of the data is where most of the values are concentrated.
Different kinds of averages, like mean, median and mode, are measures of the
center.
Note: Measures of the center are also called location parameters, because they tell us something
about where data is 'located' on a number line.
The Variation of the Data
The variation of the data is how spread out the data are around the center.
Statistics like standard deviation, range and quartiles are measures of variation.
Note: Measures of variation are also called scale parameters.
Statistics - Descriptive Statistics
The Shape of the Data
The shape of the data can refer to the how the data are bunched up on either side
of the center.
Statistics like skew describe if the right or left side of the center is bigger. Skew is
one type of shape parameters.
Frequency Tables
One typical of presenting data is with frequency tables.
A frequency table counts and orders data into a table. Typically, the data will need
to be sorted into intervals.
Frequency tables are often the basis for making graphs to visually present the data.
Statistics - Descriptive Statistics
Visualizing Data
Different types of graphs are used for different kinds of data. For example:
• Pie charts for qualitative data
• Histograms for quantitative data
• Scatter plots for bivariate data
• Graphs often have a close connection to numerical summary statistics.
For example, box plots show where the quartiles are.
Quartiles also tell us where the minimum and maximum values, range, interquartile
range, and median are.
Statistics - Frequency Tables
We can see that there is only
one winner from ages 10 to
19. And that the highest
number of winners are in their
60s.
Statistics - Descriptive Statistics
Relative Frequency Tables
Relative frequency means the number of
times a value appears in the data
compared to the total amount. A
percentage is a relative frequency.
Here are the relative frequencies of ages
of Noble Prize winners. Now, all the
frequencies are divided by the total (934)
to give percentages.
Statistics - Descriptive Statistics
Cumulative Frequency Tables
Cumulative frequency counts up to a
particular value.
Here are the cumulative frequencies of
ages of Nobel Prize winners. Now, we can
see how many winners have been younger
than a certain age.
Cumulative frequency tables can also be
made with relative frequencies
(percentages).
Statistics - Histograms
A histogram visually presents quantitative data.
A histogram is a widely used graph to show the
distribution of quantitative (numerical) data.
It shows the frequency of values in the data,
usually in intervals of values. Frequency is the
amount of times that value appeared in the data.
Each interval is represented with a bar, placed
next to the other intervals on a number line.
The height of the bar represents the frequency of
values in that interval.
Here is a histogram of the age of all 934 Nobel
Prize winners up to the year 2020:
This histogram uses age intervals from 10 to 19, 20
to 29, and so on. Note: Histograms are similar to bar graphs, which
are used for qualitative data.
Statistics - Histograms
Bin Width
The intervals of values are often called 'bins'. And
the length of an interval is called 'bin width'.
We can choose any width. It is best with a bin width
that shows enough detail without being confusing.
Here is a histogram of the same Nobel Prize winner
data, but with bin widths of 5 instead of 10:
This histogram uses age intervals from from 15 to
19, 20 to 24, 25 to 29, and so on.
Smaller intervals gives a more detailed look at the
distribution of the age values in the data.
Statistics - Bar Graphs
A bar graph visually presents qualitative data.
Bar graphs are used to show the distribution of
qualitative (categorical) data.
It shows the frequency of values in the data.
Frequency is the amount of times that value
appeared in the data.
Each category is represented with a bar. The height
of the bar represents the frequency of values from
that category in the data.
Here is a bar graph of the number of people who
have won a Nobel Prize in each category up to the
year 2020:
Some of the categories have existed longer
than others. Multiple winners are also more
common in some categories. So there is a
different number of winners in each
category.
Note: Bar graphs are similar to histograms,
which are used for quantitative data
Statistics - Pie Charts
A pie chart visually presents qualitative data.
Pie graphs are used to show the distribution of
qualitative (categorical) data.
It shows the frequency or relative frequency of values
in the data.
Frequency is the amount of times that value appeared
in the data. Relative frequency is the percentage of the
total.
Each category is represented with a slice in the 'pie'
(circle). The size of each slice represents the frequency
of values from that category in the data.
Here is a pie chart of the number of people who have
won a Nobel Prize in each category up to the year 2020:
This pie chart shows relative frequency. So
each slice is sized by the percentage for
each category.
Some of the categories have existed
longer than others. Multiple winners are
also more common in some categories. So
there is a different number of winners in
each category.
Statistics - Box Plots
A box plot is a graph used to show key features of quantitative data.
A box plot is a good way to show many important features of quantitative
(numerical) data.
It shows the median of the data. This is the middle value of the data and one
type of an average value.
It also shows the range and the quartiles of the data. This tells us something
about how spread out the data is.
Note: Box plots are also called 'box and whiskers plots'.
Here is a box plot of the age of all the Nobel Prize winners up to the year
2020:
Statistics - Box Plots
The median is the red line through
the middle of the 'box'. We can see
that this is just above the number
60 on the number line below. So
the middle value of age is 60 years.
The left side of the box is the
1st quartile. This is the value that
separates the first quarter, or 25%
of the data, from the rest. Here,
this is 51 years.
The right side of the box is the
3rd quartile. This is the value that
separates the first three quarters,
or 75% of the data, from the rest.
Here, this is 69 years.
Statistics - Box Plots
The distance between the sides of
the box is called the inter-quartile
range (IQR). This tells us where the
'middle half' of the values are.
Here, half of the winners were
between 51 and 69 years.
The ends of the lines from the box
at the left and the right are the
minimum and maximum values in
the data. The distance between
these is called the range.
The youngest winner was 17 years
old, and the oldest was 97 years
old. So the range of the age of
winners was 80 years.
Statistics - Average
An average is a measure of where most of the values in the data are located.
The center of the data is where most of the values in the data are located. Averages are
measures of the location of the center.
There are different types of averages. The most commonly used are:
o Mean
o Median
o Mode
Note: In statistics, averages are often referred to as 'measures of central tendency'.
For example, using the values:
40, 21, 55, 21, 48, 13, 72
Statistics - Average
Median
The median is the 'middle value' of the data.
The median is found by ordering all the values in the data and picking the middle value:
13, 21, 21, 40, 48, 55, 72
The median is less influenced by extreme values in the data than the mean.
Changing the last value to 356 does not change the median:
13, 21, 21, 40, 48, 55, 356
The median is still 40.
Changing the last value to 356 changes the mean a lot:
(13 + 21 + 21 + 40 + 48 + 55 + 72)/7 = 38.57
(13 + 21 + 21 + 40 + 48 + 55 + 356)/7 = 79.14
Note: Extreme values are values in the data that are much smaller or larger than the average values in
the data.
Statistics - Average
Mode
The mode is the value(s) that appears most often in the data:
40, 21, 55, 21, 48, 13, 72
Here, 21 appears two times, and the other values only once. The mode of this data is 21.
The mode is also used for categorical data, unlike the median and mean. Categorical data can't
be described directly with numbers, like names:
Alice, John, Bob, Maria, John, Julia, Carol
Here, John appears two times, and the other values only once. The mode of this data is John.
Note: There can be more than one mode if multiple values appear the same number of times in the
data.
Statistics - Mean
The mean is a type of average value, which describes where center of the data is located. Mean
The mean is usually referred to as 'the average'.
The mean is the sum of all the values in the data divided by the total number of values in the data.
The mean is calculated for numerical variables. A variable is something in the data that can vary,
like:
Age
Height
Income
Note: There are multiple types of mean values. The most common type of mean is the arithmetic mean.
In this tutorial 'mean' refers to the arithmetic mean.
Statistics - Mean
Calculating the Mean
You can calculate the mean for both the population and the sample.
The formulas are the same and uses different symbols to refer to the population mean (μ) and sample mean
(x¯).
Statistics - Mean
Calculation with Programming
The mean can easily be calculated with many programming languages.
Using software and programming to calculate statistics is more common for bigger sets of data, as
calculating by hand becomes difficult.
Example
With Python use the NumPy library mean() method to find the mean of the values 4,11,7,14:
import numpy
values = [4,11,7,14]
x = numpy.mean(values)
print(x)
o/p: 9.0
Statistics - Mean
Calculation with Programming
The mean can easily be calculated with many programming languages.
Using software and programming to calculate statistics is more common for bigger sets of data, as
calculating by hand becomes difficult.
Example
With Python use the NumPy library mean() method to find the mean of the values 4,11,7,14:
import numpy
values = [4,11,7,14]
x = numpy.mean(values)
print(x)
o/p: 9.0
Use the R mean() function to find the mean of the
values 4,11,7,14:
values <- c(4,7,11,14)
mean(values)
o/p:
[1] 9
Statistics - Median
The median is a type of average value, which describes where the center of the data is located.
The median is the middle value in a data set ordered from low to high.
Finding the Median
The median can only be calculated for numerical variables.
The formula for finding the middle value is:
Where n is the total number of observations.
If the total number of observations is an odd number, the formula gives a whole number and the
value of this observation is the median.
13, 21, 21, 40, 48, 55, 72
Here, there are 7 total observations, so the median is the 4th value:
The 4th value in the ordered list is 40, so that is the median.
Statistics - Median
If the total number of observations is an even number, the formula gives a
decimal number between two observations.
13, 21, 21, 40, 42, 48, 55, 72
Here, there are 8 total observations, so the median is between the 4th and 5th values:
The 4th and 5th values in the ordered list is 40 and 42, so the median is the mean of these two
values. That is, the sum of those two values divided by 2:
Note: It is important that the numbers are ordered before you can find the median.
Statistics - Median
Finding the Median with Programming
The median can easily be found with many programming languages.
Using software and programming to calculate statistics is more common for bigger sets of data, as
finding it manually becomes difficult.
The median can easily be found with many programming languages.
Using software and programming to calculate statistics is more common for bigger sets of data, as
finding it manually becomes difficult.
Example
With Python use the NumPy library median() method to find the median of the values 13, 21, 21,
40, 42, 48, 55, 72:
import numpy
values = [13,21,21,40,42,48,55,72]
x = numpy.median(values)
print(x)
o/p: 41.0
Statistics - Mode
The mode is a type of average value, which describes where most of the data is located.
Mode
The mode is the value(s) that are the most common in the data.
A dataset can have multiple values that are modes.
A distribution of values with only one mode is called unimodal.
A distribution of values with two modes is called bimodal. In general, a distribution with more than
one mode is called multimodal.
Mode can be found for both categorical and numerical data.
Statistics - Mode
Finding the Mode
Here is a numerical example:
4, 7, 3, 8, 11, 7, 10, 19, 6, 9, 12, 12
Both 7 and 12 appears two times each, and the other values only once. The modes of this data is 7
and 12.
Here is a categorical example with names:
Alice, John, Bob, Maria, John, Julia, Carol
John appears two times, and the other values only once. The mode of this data is John.
Statistics - Mode
Finding the Mode with Programming
The mode can easily be found with many programming languages.
Using software and programming to calculate statistics is more common for bigger sets of data, as
calculating manually becomes difficult.
Example
With Python use the statistics library multimode() method to find the modes of the values
4,7,3,8,11,7,10,19,6,9,12,12:
from statistics import multimode
values = [4,7,3,8,11,7,10,19,6,9,12,12]
x = multimode(values)
print(x)
o/p: [7, 12]
Statistics - Variation
Variation is a measure of how spread out the data is around the center of the data.
Measures of variation are statistics of how far away the values in the observations (data points) are
from each other.
There are different measures of variation. The most commonly used are:
o Range
o Quartiles and Percentiles
o Interquartile Range
o Standard Deviation
Measures of variation combined with an average (measure of center) gives a good picture of the
distribution of the data.
Note: These measures of variation can only be calculated for numerical data.
Statistics - Variation
Range
The range is the difference between the smallest and the largest value of the data.
Range is the simplest measure of variation.
Here is a histogram of the age of all 934 Nobel Prize winners up to the year 2020, showing the
range:
The youngest winner was 17
years and the oldest was 97
years. The range of ages for
Nobel Prize winners is then 80
years.
Statistics - Variation
Quartiles and Percentiles
Quartiles and percentiles are ways of separating equal numbers of values in the data into parts.
Quartiles are values that separate the data into four equal parts.
Percentiles are values that separate the data into 100 equal parts.
Here is a histogram of the age of all 934 Nobel Prize winners up to the year 2020, showing the
quartiles:
The quartiles (Q0,Q1,Q2,Q3,Q4) are the
values that separate each quarter.
Between Q0 and Q1 are the 25% lowest
values in the data. Between Q1 and Q2
are the next 25%. And so on.
o Q0 is the smallest value in the data.
o Q2 is the middle value (median).
o Q4 is the largest value in the data.
Statistics - Variation
Interquartile Range
Interquartile range is the difference between the first and third quartiles (Q1 and Q3).
The 'middle half' of the data is between the first and third quartile.
Here is a histogram of the age of all 934 Nobel Prize winners up to the year 2020, showing the
interquartile range (IQR):
Here, the middle half of is between 51
and 69 years. The interquartile range for
Nobel Prize winners is then 18 years.
Statistics - Variation
Standard deviation : It is the most used measure of variation.
Standard deviation (σ) measures how far a 'typical' observation is from the average of the data (μ).
Standard deviation is important for many statistical methods.
Here is a histogram of the age of all 934 Nobel Prize winners up to the year 2020, showing standard
deviations:
Note: Values within one standard
deviation (σ) are considered to be typical.
Values outside three standard deviations
are considered to be outliers.
Statistics - Range
The range is a measure of variation, which describes how spread out the data is.
Range
The range is the difference between the smallest and the largest value of the data.
Range is the simplest measure of variation.
Here is a histogram of the age of all 934 Nobel Prize winners up to the year 2020, showing the
range:
The youngest winner was 17
years and the oldest was 97
years. The range of ages for
Nobel Prize winners is then 80
years.
Statistics - Range
The range is a measure of variation, which describes how spread out the data is.
Calculating the Range
The range can only be calculated for numerical data.
First, find the smallest and largest values of this example:
13, 21, 21, 40, 48, 55, 72
Calculate the difference by subtracting the smallest from the largest:
72 - 13 = 59
Statistics - Range
The range is a measure of variation, which describes how spread out the data is.
Calculating the Range with Programming
The range can easily be found with many programming languages.
Using software and programming to calculate statistics is more common for bigger sets of data, as
finding it manually becomes difficult.
Example
With Python use the NumPy library ptp() method to find the range of the values 13, 21, 21, 40, 48,
55, 72:
import numpy
values = [13,21,21,40,48,55,72]
x = numpy.ptp(values)
print(x)
o/p: 59
Statistics - Quartiles and Percentiles
Quartiles and percentiles are a measures of variation, which describes how spread out the data is.
Quartiles and percentiles are both types of quantiles.
Quartiles
Quartiles are values that separate the data into four equal parts.
Here is a histogram of the age of all 934 Nobel Prize winners up to the year 2020, showing the
quartiles:
The quartiles (Q0,Q1,Q2,Q3,Q4) are the values that separate
each quarter.
Between Q0 and Q1 are the 25% lowest values in the data.
Between Q1 and Q2 are the next 25%. And so on.
Q0 is the smallest value in the data.
Q1 is the value separating the first quarter from the second
quarter of the data.
Q2 is the middle value (median), separating the bottom from
the top half.
Q3 is the value separating the third quarter from the fourth
quarter
Q4 is the largest value in the data.
Statistics - Quartiles and Percentiles
The range is a measure of variation, which describes how spread out the data is.
Calculating Quartiles with Programming
Quartiles can easily be found with many programming languages.
Using software and programming to calculate statistics is more common for bigger sets of data, as
finding it manually becomes difficult.
Example
With Python use the NumPy library quantile() method to find the quartiles of the values 13, 21, 21,
40, 42, 48, 55, 72:
import numpy
values = [13,21,21,40,42,48,55,72]
x = numpy.quantile(values, [0,0.25,0.5,0.75,1])
print(x)
O/P : [13. 21. 41. 49.75 72. ]
Statistics - Interquartile Range
Interquartile range is a measure of variation, which describes how spread out the data is.
Interquartile Range is the difference between the first and third quartiles (Q1 and Q3).
The 'middle half' of the data is between the first and third quartile.
The first quartile is the value in the data that separates the bottom 25% of values from the top
75%.
The third quartile is the value in the data that separates the bottom 75% of the values from the top
25%
Here is a histogram of the age of all 934 Nobel Prize winners up to the year 2020, showing the
interquartile range (IQR):
Statistics - Interquartile Range
Here, the middle half of is between 51 and 69 years. The interquartile range for Nobel Prize winners
is then 18 years.
Statistics - Interquartile Range
Calculating the Interquartile Range with Programming
The interquartile range can easily be found with many programming languages.
Using software and programming to calculate statistics is more common for bigger sets of data, as
finding it manually becomes difficult.
Example
With Python use the SciPy library iqr() method to find the interquartile range of the values 13, 21,
21, 40, 42, 48, 55, 72:
from scipy import stats
values = [13,21,21,40,42,48,55,72]
x = stats.iqr(values)
print(x)
O/P : 28.75
Statistics - Standard Deviation
Standard deviation is the most commonly used measure of variation, which describes how spread
out the data is.
Standard deviation (σ) measures how far a 'typical' observation is from the average of the data (μ).
Standard deviation is important for many statistical methods.
Here is a histogram of the age of all 934 Nobel Prize winners up to the year 2020, showing standard
deviations:
Each dotted line in the histogram shows a shift of
one extra standard deviation.
If the data is normally distributed:
o Roughly 68.3% of the data is within 1 standard
deviation of the average (from μ-1σ to μ+1σ)
o Roughly 95.5% of the data is within 2 standard
deviations of the average (from μ-2σ to μ+2σ)
o Roughly 99.7% of the data is within 3 standard
deviations of the average (from μ-3σ to μ+3σ)
Note: A normal distribution has a "bell" shape and
spreads out equally on both sides.
Statistics - Standard Deviation
Calculating the Standard Deviation
You can calculate the standard deviation for both the population and the sample.
The formulas are almost the same and uses different symbols to refer to the standard deviation
() and sample standard deviation (s).
Calculating the standard deviation (σ) is done with this formula:
Statistics - Standard Deviation
Calculating the Standard Deviation
You can calculate the standard deviation for both the population and the sample.
The formulas are almost the same and uses different symbols to refer to the standard deviation
() and sample standard deviation (s).
Calculating the standard deviation (σ) is done with this formula:
Statistics - Standard Deviation
Calculating the Standard Deviation
You can calculate the standard deviation for both the population and the sample.
The formulas are almost the same and uses different symbols to refer to the standard deviation
() and sample standard deviation (s).
Calculating the standard deviation (σ) is done with this formula:
Statistics - Standard Deviation
Calculating the Standard Deviation with Programming
The standard deviation can easily be calculated with many programming languages.
Using software and programming to calculate statistics is more common for bigger sets of data, as
calculating by hand becomes difficult.
Population Standard Deviation
Example
With Python use the NumPy library std() method to find the standard deviation of the values
4,11,7,14:
import numpy
values = [4,11,7,14]
x = numpy.std(values)
print(x)
o/p:
3.8078865529319543
Statistics - Standard Deviation
Calculating the Standard Deviation with Programming
The standard deviation can easily be calculated with many programming languages.
Using software and programming to calculate statistics is more common for bigger sets of data, as
calculating by hand becomes difficult.
Population Standard Deviation
Example
With Python use the NumPy library std() method to find the standard deviation of the values
4,11,7,14:
import numpy
values = [4,11,7,14]
x = numpy.std(values)
print(x)
o/p:
3.8078865529319543
Sample Standard Deviation
x = numpy.std(values, ddof=1)
print(x)
o/p: 4.396968652757639
Inferential Statistics
Statistics - Statistical Inference
Statistical Inference
Using data analysis and statistics to make conclusions about a population is called statistical
inference.
The main types of statistical inference are:
• Estimation
• Hypothesis testing
Estimation
Statistics from a sample are used to estimate population parameters.
The most likely value is called a point estimate.
There is always uncertainty when estimating.
The uncertainty is often expressed as confidence intervals defined by a likely lowest and highest
value for the parameter.
An example could be a confidence interval for the number of bicycles a Dutch person owns:
"The average number of bikes a Dutch person owns is between 3.5 and 6."
Statistics - Statistical Inference
Hypothesis Testing
Hypothesis testing is a method to check if a claim about a population is true. More precisely, it
checks how likely it is that a hypothesis is true is based on the sample data.
There are different types of hypothesis testing.
The steps of the test depends on:
Type of data (categorical or numerical)
If you are looking at:
o A single group
o Comparing one group to another
o Comparing the same group before and after a change
Some examples of claims or questions that can be checked with hypothesis testing:
o 90% of Australians are left handed
o Is the average weight of dogs more than 40kg?
o Do doctors make more money than lawyers?
Statistics - Normal Distribution
The normal distribution is an important probability distribution used in statistics.
Many real world examples of data are normally distributed.
Normal Distribution
The normal distribution is described by the mean (μ) and the standard deviation (σ).
The normal distribution is often referred to as a 'bell curve' because of it's shape:
Most of the values are around the center (μ)
The median and mean are equal
It has only one mode
It is symmetric, meaning it decreases the same amount on the left and the right of the center
The area under the curve of the normal distribution represents probabilities for the data.
The area under the whole curve is equal to 1, or 100%
Here is a graph of a normal distribution with probabilities between standard deviations (σ):
Statistics - Normal Distribution
• Roughly 68.3% of the data is
within 1 standard deviation
of the average (from μ-1σ to
μ+1σ)
• Roughly 95.5% of the data is
within 2 standard deviations
of the average (from μ-2σ to
μ+2σ)
• Roughly 99.7% of the data is
within 3 standard deviations
of the average (from μ-3σ to
μ+3σ)
Note: Probabilities of the normal distribution can only be calculated for intervals (between two values).
Statistics - Normal Distribution
Different Mean and Standard Deviations
The mean describes where the center of the normal distribution is.
Here is a graph showing three different normal distributions with the same standard deviation but
different means.
The standard deviation describes
how spread out the normal
distribution is.
Here is a graph showing three
different normal distributions with
the same mean but different
standard deviations.
Statistics - Normal Distribution
Different Mean and Standard Deviations
The mean describes where the center of the normal distribution is.
The purple curve has the biggest standard deviation and the black curve has the smallest standard
deviation.
The area under each of the curves is still 1, or 100%.
.
Statistics - Normal Distribution
A Real Data Example of Normally Distributed Data
Real world data is often normally distributed.
Here is a histogram of the age of Nobel Prize winners when they won the prize:
The normal distribution drawn on top of
the histogram is based on the population
mean (μ) and standard deviation (σ) of the
real data.
We can see that the histogram close to a
normal distribution.
Examples of real world variables that can
be normally distributed:
• Test scores
• Height
• Birth weight
Statistics - Normal Distribution
Probability Distributions
Probability distributions are functions that calculates the probabilities of the outcomes of random
variables.
Typical examples of random variables are coin tosses and dice rolls.
Here is an graph showing the results of a growing number of coin tosses and the expected values
of the results (heads or tails).
The expected values of the coin toss is the probability distribution of the coin toss.
Notice how the result of random coin tosses
gets closer to the expected values (50%) as
the number of tosses increases.
Similarly, here is a graph showing the results
of a growing number of dice rolls and the
expected values of the results (from 1 to 6).
Statistics - Normal Distribution
Notice again how the result of random dice
rolls gets closer to the expected values (1/6,
or 16.666%) as the number of rolls increases.
When the random variable is a sum of dice
rolls the results and expected values take a
different shape.
The different shape comes from there being
more ways of getting a sum of near the
middle, than a small or large sum.
Statistics - Normal Distribution
Notice again how the result of random dice
rolls gets closer to the expected values (1/6,
or 16.666%) as the number of rolls increases.
When the random variable is a sum of dice
rolls the results and expected values take a
different shape.
The different shape comes from there being
more ways of getting a sum of near the
middle, than a small or large sum.
Statistics - Normal Distribution
As we keep increasing the number of dice for a sum the shape of the results and expected
values look more and more like a normal distribution.
Many real world variables follow a similar pattern and naturally form normal distributions.
Normally distributed variables can be analyzed with well-known techniques.
Statistics - Standard Normal Distribution
The standard normal distribution is a normal distribution where the mean is 0 and the standard
deviation is 1.
Normally distributed data can be transformed into a standard normal distribution.
Standardizing normally distributed data makes it easier to compare different sets of data.
The standard normal distribution is used for:
• Calculating confidence intervals
• Hypothesis tests
Here is a graph of the standard normal distribution with probability values (p-values) between the
standard deviations:
Statistics - Standard Normal Distribution
Standardizing makes it easier
to calculate probabilities.
The functions for calculating
probabilities are complex and
difficult to calculate by hand.
Typically, probabilities are
found by looking up tables of
pre-calculated values, or by
using software and
programming.
The standard normal
distribution is also called the
'Z-distribution' and the values
are called 'Z-values' (or Z-
scores).
Statistics - Standard Normal Distribution
Z-Values
Z-values express how many standard deviations from the mean a value is.
The formula for calculating a Z-value is:
Z=(x−μ)/σ
x is the value we are standardizing, μ is the mean, and σ is the standard deviation.
For example, if we know that:
The mean height of people in Germany is 170 cm (μ)
The standard deviation of the height of people in Germany is 10 cm (σ)
Bob is 200 cm tall (x)
Bob is 30 cm taller than the average person in Germany.
30 cm is 3 times 10 cm. So Bob's height is 3 standard deviations larger than mean height in
Germany.
Using the formula:
Statistics - Standard Normal Distribution
Finding the P-value of a Z-Value
Using a Z-table or programming we can calculate how many people Germany are shorter than
Bob and how many are taller.
Example
With Python use the Scipy Stats library norm.cdf() function find the probability of getting less
than a Z-value of 3:
import scipy.stats as stats
print(stats.norm.cdf(3))
O/P: 0.9986501019683699
Statistics - Student's T Distribution
The student's t-distribution is similar to a normal distribution and used in statistical inference to
adjust for uncertainty. It is used for estimation and hypothesis testing of a population mean
(average).
The t-distribution is adjusted for the extra uncertainty of estimating the mean.
If the sample is small, the t-distribution is wider. If the sample is big, the t-distribution is narrower.
The bigger the sample size is, the closer the t-distribution gets to the standard normal
distribution.
Statistics - Student's T Distribution
Notice how some of the curves have bigger tails.
This is due to the uncertainty from a smaller
sample size.
The green curve has the smallest sample size.
For the t-distribution this is expressed as 'degrees
of freedom' (df), which is calculated by subtracting 1
from the sample size (n).
For example a sample size of 30 will make 29
degrees of freedom for the t-distribution.
The t-distribution is used to find critical t-
values and p-values (probabilities) for estimation
and hypothesis testing.
Note: Finding the critical t-values and p-values of the
t-distribution is similar z-values and p-values of the
standard normal distribution. But make sure to use
the correct degrees of freedom.
Statistics - Student's T Distribution
Finding the P-Value of a T-Value
You can find the p-values of a t-value by using a t-
table or with programming.
Example
With Python use the Scipy Stats library t.cdf()
function find the probability of getting less than a
t-value of 2.1 with 29 degrees of freedom:
import scipy.stats as stats
print(stats.t.cdf(2.1, 29))
O/P: 0.9777290209818548
Finding the T-value of a P-Value
You can find the t-values of a p-value by
using a t-table or with programming.
Example
With Python use the Scipy Stats library
t.ppf() function find the t-value
separating the top 25% from the bottom
75% with 29 degrees of freedom:
import scipy.stats as stats
print(stats.t.ppf(0.75, 29))
O/P: 0.6830438592467808
Statistics - Estimation
Point estimates are the most likely value for a population parameter.
Confidence intervals express the uncertainty of an estimated population parameter.
A point estimate is calculated from a sample.
The point estimate depends on the type of data:
• Categorical data: the number of occurrences divided by the sample size.
• Numerical data: the mean (the average) of the sample.
One example could be:
The point estimate for the average height of people in Denmark is 180 cm.
Estimates are always uncertain. This uncertainty can be expressed with a confidence interval.
Statistics - Estimation
Confidence Intervals
The confidence interval is defined by a lower bound and an upper bound.
This gives us a range of values that the true parameter is likely to be between.
For example that:
The average height of people in Denmark is between 170 cm and 190 cm.
Here, 170 cm is the lower bound, and 190 cm is the upper bound.
The lower and upper bounds of a confidence interval is based on the confidence level.
Statistics - Hypothesis Testing
Hypothesis testing is a formal way of checking if a hypothesis about a population is true or not.
A hypothesis is a claim about a population parameter.
A hypothesis test is a formal procedure to check if a hypothesis is true or not.
Examples of claims that can be checked:
The average height of people in Denmark is more than 170 cm.
The share of left handed people in Australia is not 10%.
The average income of dentists is less the average income of dentists.
Statistics - Hypothesis Testing
The Null and Alternative Hypothesis
Hypothesis testing is based on making two different claims about a population parameter.
The null hypothesis (H0) and the alternative hypothesis (H1) are the claims.
The two claims needs to be mutually exclusive, meaning only one of them can be true.
The alternative hypothesis is typically what we are trying to prove.
For example, we want to check the following claim:
"The average height of people in Denmark is more than 170 cm."
In this case, the parameter is the average height of people in Denmark (μ).
The null and alternative hypothesis would be:
Null hypothesis: The average height of people in Denmark is 170 cm.
Alternative hypothesis: The average height of people in Denmark is more than 170 cm.
Statistics - Hypothesis Testing
The claims are often expressed with symbols like this:
:
:
If the data supports the alternative hypothesis, we reject the null hypothesis and accept the
alternative hypothesis.
If the data does not support the alternative hypothesis, we keep the null hypothesis.
Note: The alternative hypothesis is also referred to as (H_{A})
Statistics - Hypothesis Testing
The Significance Level
The significance level (α) is the uncertainty we accept when rejecting the null hypothesis in
the hypothesis test.
The significance level is a percentage probability of accidentally making the wrong
conclusion.
Typical significance levels are:
•α=0.1 (10%)
•α=0.05 (5%)
•α=0.01 (1%)
A lower significance level means that the evidence in the data needs to be stronger to reject the null
hypothesis.
There is no "correct" significance level - it only states the uncertainty of the conclusion.
Note: A 5% significance level means that when we reject a null hypothesis:
We expect to reject a true null hypothesis 5 out of 100 times.
Statistics - Hypothesis Testing
The Critical Value and P-Value Approach
There are two main approaches used for hypothesis tests:
The critical value approach compares the test statistic with the critical value of the significance
level.
The p-value approach compares the p-value of the test statistic and with the significance level.
The Critical Value Approach
The critical value approach checks if the test statistic is in the rejection region.
The rejection region is an area of probability in the tails of the distribution.
The size of the rejection region is decided by the significance level ().
The value that separates the rejection region from the rest is called the critical value.
Statistics - Hypothesis Testing
Here is a graphical illustration:
If the test statistic
is inside this rejection region,
the null hypothesis
is rejected.
For example, if the test
statistic is 2.3 and the critical
value is 2 for a significance
level (α=0.05):
We reject the null hypothesis
(H0) at 0.05 significance level
(α)
Statistics - Hypothesis Testing
The P-Value Approach
It checks if the p-value of the test statistic is smaller than the significance level ().
The p-value of the test statistic is the area of probability in the tails of the distribution from the
value of the test statistic.
Here is a graphical illustration:
Statistics - Hypothesis Testing
If the p-value is smaller than the significance level, the null hypothesis is rejected.
The p-value directly tells us the lowest significance level where we can reject the null
hypothesis.
For example, if the p-value is 0.03:
We reject the null hypothesis (Ho) at a 0.05 significance level (α)
We keep the null hypothesis (Ho) at a 0.01 significance level (α)
Note: The two approaches are only different in how they present the conclusion.
Statistics - Hypothesis Testing
Steps for a Hypothesis Test
The following steps are used for a hypothesis test:
1. Check the conditions
2. Define the claims
3. Decide the significance level
4. Calculate the test statistic
5. Conclusion
One condition is that the sample is randomly selected from the population.
The other conditions depends on what type of parameter you are testing the hypothesis for.
Common parameters to test hypotheses are:
o Proportions (for qualitative data)
o Mean values (for numerical data)
We are missing one important variable that affects Calorie_Burnage, which is the
Duration of the training session.
 Duration in combination with Average_Pulse will together explain Calorie_Burnage
more precisely.
Data Science - Linear Regression
The term regression is used when you try to find the relationship between variables.
In Machine Learning and in statistical modeling, that relationship is used to predict the
outcome of events.
In this module, we will cover the following questions:
Can we conclude that Average_Pulse and Duration are related to Calorie_Burnage?
Can we use Average_Pulse and Duration to predict Calorie_Burnage?
Data Science - Linear Regression
Least Square Method Least Square Method
Linear regression uses the least square
method.
The concept is to draw a line through all the
plotted data points. The line is positioned
in a way that it minimizes the distance to all
of the data points.
The distance is called "residuals" or
"errors".
The red dashed lines represents the
distance from the data points to the drawn
mathematical function.
Data Science - Linear Regression
Least Square Method
Linear Regression Using One Explanatory Variable
In this example, we will try to predict Calorie_Burnage with Average_Pulse using Linear
Regression:
Example
import pandas as pd
import matplotlib.pyplot as plt
from scipy import stats
full_health_data = pd.read_csv("data.csv", header=0,
sep=",")
x = full_health_data["Average_Pulse"]
y = full_health_data ["Calorie_Burnage"]
slope, intercept, r, p, std_err = stats.linregress(x, y)
def myfunc(x):
return slope * x + intercept
mymodel = list(map(myfunc, x))
plt.scatter(x, y)
plt.plot(x, slope * x + intercept)
plt.ylim(ymin=0, ymax=2000)
plt.xlim(xmin=0, xmax=200)
plt.xlabel("Average_Pulse")
plt.ylabel ("Calorie_Burnage")
plt.show()
Data Science - Linear Regression
Least Square Method
Linear Regression Using One Explanatory Variable
In this example, we will try to predict Calorie_Burnage with Average_Pulse using Linear
Regression:
Do you think that the line is
able to predict
Calorie_Burnage precisely?
We will show that the
variable Average_Pulse alone
is not enough to make
precise prediction of
Calorie_Burnage.
Congrats ! You got your Data Science Job
Congrats ! You got your Data Science Job
Congrats ! You got your Data Science Job
Congrats ! You got your Data Science Job
Congrats ! You got your Data Science Job
Congrats ! You got your Data Science Job
Congrats ! You got your Data Science Job
Congrats ! You got your Data Science Job
Congrats ! You got your Data Science Job
Congrats ! You got your Data Science Job
Congrats ! You got your Data Science Job
Congrats ! You got your Data Science Job
Congrats ! You got your Data Science Job
Congrats ! You got your Data Science Job
Congrats ! You got your Data Science Job
Congrats ! You got your Data Science Job
Congrats ! You got your Data Science Job
Congrats ! You got your Data Science Job
Congrats ! You got your Data Science Job
Congrats ! You got your Data Science Job
Congrats ! You got your Data Science Job
Congrats ! You got your Data Science Job
Congrats ! You got your Data Science Job
Congrats ! You got your Data Science Job
Congrats ! You got your Data Science Job
Congrats ! You got your Data Science Job
Congrats ! You got your Data Science Job
Congrats ! You got your Data Science Job
Congrats ! You got your Data Science Job
Congrats ! You got your Data Science Job
Congrats ! You got your Data Science Job
Congrats ! You got your Data Science Job
Congrats ! You got your Data Science Job
Congrats ! You got your Data Science Job
Congrats ! You got your Data Science Job
Congrats ! You got your Data Science Job
Congrats ! You got your Data Science Job
Congrats ! You got your Data Science Job
Congrats ! You got your Data Science Job
Congrats ! You got your Data Science Job
Congrats ! You got your Data Science Job
Congrats ! You got your Data Science Job
Congrats ! You got your Data Science Job
Congrats ! You got your Data Science Job
Congrats ! You got your Data Science Job
Congrats ! You got your Data Science Job
Congrats ! You got your Data Science Job
Congrats ! You got your Data Science Job
Congrats ! You got your Data Science Job
Congrats ! You got your Data Science Job
Congrats ! You got your Data Science Job
Congrats ! You got your Data Science Job
Congrats ! You got your Data Science Job
Congrats ! You got your Data Science Job
Congrats ! You got your Data Science Job
Congrats ! You got your Data Science Job
Congrats ! You got your Data Science Job
Congrats ! You got your Data Science Job
Congrats ! You got your Data Science Job
Congrats ! You got your Data Science Job
Congrats ! You got your Data Science Job
Congrats ! You got your Data Science Job
Congrats ! You got your Data Science Job
Congrats ! You got your Data Science Job
Congrats ! You got your Data Science Job
Congrats ! You got your Data Science Job
Congrats ! You got your Data Science Job
Congrats ! You got your Data Science Job
Congrats ! You got your Data Science Job
Congrats ! You got your Data Science Job
Congrats ! You got your Data Science Job
Congrats ! You got your Data Science Job
Congrats ! You got your Data Science Job
Congrats ! You got your Data Science Job
Congrats ! You got your Data Science Job
Congrats ! You got your Data Science Job
Congrats ! You got your Data Science Job
Congrats ! You got your Data Science Job
Congrats ! You got your Data Science Job
Congrats ! You got your Data Science Job
Congrats ! You got your Data Science Job
Congrats ! You got your Data Science Job
Congrats ! You got your Data Science Job
Congrats ! You got your Data Science Job
Congrats ! You got your Data Science Job
Congrats ! You got your Data Science Job
Congrats ! You got your Data Science Job
Congrats ! You got your Data Science Job
Congrats ! You got your Data Science Job
Congrats ! You got your Data Science Job
Congrats ! You got your Data Science Job
Congrats ! You got your Data Science Job
Congrats ! You got your Data Science Job
Congrats ! You got your Data Science Job
Congrats ! You got your Data Science Job
Congrats ! You got your Data Science Job
Congrats ! You got your Data Science Job
Congrats ! You got your Data Science Job
Congrats ! You got your Data Science Job
Congrats ! You got your Data Science Job
Congrats ! You got your Data Science Job
Congrats ! You got your Data Science Job
Congrats ! You got your Data Science Job
Congrats ! You got your Data Science Job
Congrats ! You got your Data Science Job
Congrats ! You got your Data Science Job
Congrats ! You got your Data Science Job
Congrats ! You got your Data Science Job
Congrats ! You got your Data Science Job
Congrats ! You got your Data Science Job
Congrats ! You got your Data Science Job
Congrats ! You got your Data Science Job
Congrats ! You got your Data Science Job
Congrats ! You got your Data Science Job
Congrats ! You got your Data Science Job
Congrats ! You got your Data Science Job
Congrats ! You got your Data Science Job
Congrats ! You got your Data Science Job
Congrats ! You got your Data Science Job
Congrats ! You got your Data Science Job
Congrats ! You got your Data Science Job
Congrats ! You got your Data Science Job
Congrats ! You got your Data Science Job
Congrats ! You got your Data Science Job
Congrats ! You got your Data Science Job

Mais conteúdo relacionado

Semelhante a Congrats ! You got your Data Science Job

Data Science Academy Student Demo day--Michael blecher,the importance of clea...
Data Science Academy Student Demo day--Michael blecher,the importance of clea...Data Science Academy Student Demo day--Michael blecher,the importance of clea...
Data Science Academy Student Demo day--Michael blecher,the importance of clea...Vivian S. Zhang
 
Unit 3_Numpy_Vsp.pptx
Unit 3_Numpy_Vsp.pptxUnit 3_Numpy_Vsp.pptx
Unit 3_Numpy_Vsp.pptxprakashvs7
 
web-application.pdf
web-application.pdfweb-application.pdf
web-application.pdfouiamouhdifa
 
Data Processing-Presentation
Data Processing-PresentationData Processing-Presentation
Data Processing-Presentationnibraspk
 
De-Cluttering-ML | TechWeekends
De-Cluttering-ML | TechWeekendsDe-Cluttering-ML | TechWeekends
De-Cluttering-ML | TechWeekendsDSCUSICT
 
Data Science Job ready #DataScienceInterview Question and Answers 2022 | #Dat...
Data Science Job ready #DataScienceInterview Question and Answers 2022 | #Dat...Data Science Job ready #DataScienceInterview Question and Answers 2022 | #Dat...
Data Science Job ready #DataScienceInterview Question and Answers 2022 | #Dat...Rohit Dubey
 
Python - Data Collection
Python - Data CollectionPython - Data Collection
Python - Data CollectionJoseTanJr
 
Lec 1 Ds
Lec 1 DsLec 1 Ds
Lec 1 DsQundeel
 
Data Structure
Data StructureData Structure
Data Structuresheraz1
 
Lec 1 Ds
Lec 1 DsLec 1 Ds
Lec 1 DsQundeel
 
Data Manipulation with Numpy and Pandas in PythonStarting with N
Data Manipulation with Numpy and Pandas in PythonStarting with NData Manipulation with Numpy and Pandas in PythonStarting with N
Data Manipulation with Numpy and Pandas in PythonStarting with NOllieShoresna
 
Bsc cs ii dfs u-1 introduction to data structure
Bsc cs ii dfs u-1 introduction to data structureBsc cs ii dfs u-1 introduction to data structure
Bsc cs ii dfs u-1 introduction to data structureRai University
 
PPT on Data Science Using Python
PPT on Data Science Using PythonPPT on Data Science Using Python
PPT on Data Science Using PythonNishantKumar1179
 
Data structures & problem solving unit 1 ppt
Data structures & problem solving unit 1 pptData structures & problem solving unit 1 ppt
Data structures & problem solving unit 1 pptaviban
 

Semelhante a Congrats ! You got your Data Science Job (20)

Data Science Academy Student Demo day--Michael blecher,the importance of clea...
Data Science Academy Student Demo day--Michael blecher,the importance of clea...Data Science Academy Student Demo day--Michael blecher,the importance of clea...
Data Science Academy Student Demo day--Michael blecher,the importance of clea...
 
Unit 3_Numpy_VP.pptx
Unit 3_Numpy_VP.pptxUnit 3_Numpy_VP.pptx
Unit 3_Numpy_VP.pptx
 
Unit 3_Numpy_Vsp.pptx
Unit 3_Numpy_Vsp.pptxUnit 3_Numpy_Vsp.pptx
Unit 3_Numpy_Vsp.pptx
 
web-application.pdf
web-application.pdfweb-application.pdf
web-application.pdf
 
Data Processing-Presentation
Data Processing-PresentationData Processing-Presentation
Data Processing-Presentation
 
Lecture 9.pptx
Lecture 9.pptxLecture 9.pptx
Lecture 9.pptx
 
e_lumley.pdf
e_lumley.pdfe_lumley.pdf
e_lumley.pdf
 
De-Cluttering-ML | TechWeekends
De-Cluttering-ML | TechWeekendsDe-Cluttering-ML | TechWeekends
De-Cluttering-ML | TechWeekends
 
Data Science Job ready #DataScienceInterview Question and Answers 2022 | #Dat...
Data Science Job ready #DataScienceInterview Question and Answers 2022 | #Dat...Data Science Job ready #DataScienceInterview Question and Answers 2022 | #Dat...
Data Science Job ready #DataScienceInterview Question and Answers 2022 | #Dat...
 
Python - Data Collection
Python - Data CollectionPython - Data Collection
Python - Data Collection
 
Unit 3_Numpy_VP.pptx
Unit 3_Numpy_VP.pptxUnit 3_Numpy_VP.pptx
Unit 3_Numpy_VP.pptx
 
Lec 1 Ds
Lec 1 DsLec 1 Ds
Lec 1 Ds
 
Data Structure
Data StructureData Structure
Data Structure
 
Lec 1 Ds
Lec 1 DsLec 1 Ds
Lec 1 Ds
 
4)12th_L-1_PYTHON-PANDAS-I.pptx
4)12th_L-1_PYTHON-PANDAS-I.pptx4)12th_L-1_PYTHON-PANDAS-I.pptx
4)12th_L-1_PYTHON-PANDAS-I.pptx
 
Data Manipulation with Numpy and Pandas in PythonStarting with N
Data Manipulation with Numpy and Pandas in PythonStarting with NData Manipulation with Numpy and Pandas in PythonStarting with N
Data Manipulation with Numpy and Pandas in PythonStarting with N
 
Stata tutorial university of princeton
Stata tutorial university of princetonStata tutorial university of princeton
Stata tutorial university of princeton
 
Bsc cs ii dfs u-1 introduction to data structure
Bsc cs ii dfs u-1 introduction to data structureBsc cs ii dfs u-1 introduction to data structure
Bsc cs ii dfs u-1 introduction to data structure
 
PPT on Data Science Using Python
PPT on Data Science Using PythonPPT on Data Science Using Python
PPT on Data Science Using Python
 
Data structures & problem solving unit 1 ppt
Data structures & problem solving unit 1 pptData structures & problem solving unit 1 ppt
Data structures & problem solving unit 1 ppt
 

Mais de Rohit Dubey

DATA ANALYTICS INTRODUCTION
DATA ANALYTICS INTRODUCTIONDATA ANALYTICS INTRODUCTION
DATA ANALYTICS INTRODUCTIONRohit Dubey
 
Guide for a Data Scientist
Guide for a Data ScientistGuide for a Data Scientist
Guide for a Data ScientistRohit Dubey
 
Crack Data Analyst Interview Course
Crack Data Analyst Interview CourseCrack Data Analyst Interview Course
Crack Data Analyst Interview CourseRohit Dubey
 
Data Analyst Project of Market Basket Analysis
Data Analyst Project of Market Basket AnalysisData Analyst Project of Market Basket Analysis
Data Analyst Project of Market Basket AnalysisRohit Dubey
 
Business Analyst Job Interview
Business Analyst Job Interview Business Analyst Job Interview
Business Analyst Job Interview Rohit Dubey
 
Data Analyst Job resume 2022
Data Analyst Job resume 2022Data Analyst Job resume 2022
Data Analyst Job resume 2022Rohit Dubey
 
Business Analyst Job Course.pptx
Business Analyst Job Course.pptxBusiness Analyst Job Course.pptx
Business Analyst Job Course.pptxRohit Dubey
 
Machine Learning with Python made easy and simple
Machine Learning with Python  made easy and simple Machine Learning with Python  made easy and simple
Machine Learning with Python made easy and simple Rohit Dubey
 
Crash Course on R Shiny Package
Crash Course on R Shiny Package Crash Course on R Shiny Package
Crash Course on R Shiny Package Rohit Dubey
 
Rohit Dubey Data Scientist Resume
Rohit Dubey Data Scientist Resume Rohit Dubey Data Scientist Resume
Rohit Dubey Data Scientist Resume Rohit Dubey
 
Data Scientist Rohit Dubey
Data Scientist Rohit DubeyData Scientist Rohit Dubey
Data Scientist Rohit DubeyRohit Dubey
 
Best way of Public Speaking by Rohit Dubey (Treejee)
Best way of Public Speaking by Rohit Dubey (Treejee)Best way of Public Speaking by Rohit Dubey (Treejee)
Best way of Public Speaking by Rohit Dubey (Treejee)Rohit Dubey
 
HbaseHivePigbyRohitDubey
HbaseHivePigbyRohitDubeyHbaseHivePigbyRohitDubey
HbaseHivePigbyRohitDubeyRohit Dubey
 
Big Data PPT by Rohit Dubey
Big Data PPT by Rohit DubeyBig Data PPT by Rohit Dubey
Big Data PPT by Rohit DubeyRohit Dubey
 

Mais de Rohit Dubey (14)

DATA ANALYTICS INTRODUCTION
DATA ANALYTICS INTRODUCTIONDATA ANALYTICS INTRODUCTION
DATA ANALYTICS INTRODUCTION
 
Guide for a Data Scientist
Guide for a Data ScientistGuide for a Data Scientist
Guide for a Data Scientist
 
Crack Data Analyst Interview Course
Crack Data Analyst Interview CourseCrack Data Analyst Interview Course
Crack Data Analyst Interview Course
 
Data Analyst Project of Market Basket Analysis
Data Analyst Project of Market Basket AnalysisData Analyst Project of Market Basket Analysis
Data Analyst Project of Market Basket Analysis
 
Business Analyst Job Interview
Business Analyst Job Interview Business Analyst Job Interview
Business Analyst Job Interview
 
Data Analyst Job resume 2022
Data Analyst Job resume 2022Data Analyst Job resume 2022
Data Analyst Job resume 2022
 
Business Analyst Job Course.pptx
Business Analyst Job Course.pptxBusiness Analyst Job Course.pptx
Business Analyst Job Course.pptx
 
Machine Learning with Python made easy and simple
Machine Learning with Python  made easy and simple Machine Learning with Python  made easy and simple
Machine Learning with Python made easy and simple
 
Crash Course on R Shiny Package
Crash Course on R Shiny Package Crash Course on R Shiny Package
Crash Course on R Shiny Package
 
Rohit Dubey Data Scientist Resume
Rohit Dubey Data Scientist Resume Rohit Dubey Data Scientist Resume
Rohit Dubey Data Scientist Resume
 
Data Scientist Rohit Dubey
Data Scientist Rohit DubeyData Scientist Rohit Dubey
Data Scientist Rohit Dubey
 
Best way of Public Speaking by Rohit Dubey (Treejee)
Best way of Public Speaking by Rohit Dubey (Treejee)Best way of Public Speaking by Rohit Dubey (Treejee)
Best way of Public Speaking by Rohit Dubey (Treejee)
 
HbaseHivePigbyRohitDubey
HbaseHivePigbyRohitDubeyHbaseHivePigbyRohitDubey
HbaseHivePigbyRohitDubey
 
Big Data PPT by Rohit Dubey
Big Data PPT by Rohit DubeyBig Data PPT by Rohit Dubey
Big Data PPT by Rohit Dubey
 

Último

High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...soniya singh
 
Ukraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICSUkraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICSAishani27
 
Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxEmmanuel Dauda
 
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...Pooja Nehwal
 
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPramod Kumar Srivastava
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptxAnupama Kate
 
Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfLars Albertsson
 
04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationshipsccctableauusergroup
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfRachmat Ramadhan H
 
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Callshivangimorya083
 
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...Suhani Kapoor
 
Data Warehouse , Data Cube Computation
Data Warehouse   , Data Cube ComputationData Warehouse   , Data Cube Computation
Data Warehouse , Data Cube Computationsit20ad004
 
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...Suhani Kapoor
 
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...dajasot375
 
RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998YohFuh
 
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Jack DiGiovanna
 
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptxEMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptxthyngster
 
Data Science Jobs and Salaries Analysis.pptx
Data Science Jobs and Salaries Analysis.pptxData Science Jobs and Salaries Analysis.pptx
Data Science Jobs and Salaries Analysis.pptxFurkanTasci3
 

Último (20)

High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
 
Ukraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICSUkraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICS
 
Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptx
 
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
 
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx
 
Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdf
 
04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
 
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
 
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
 
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
 
Data Warehouse , Data Cube Computation
Data Warehouse   , Data Cube ComputationData Warehouse   , Data Cube Computation
Data Warehouse , Data Cube Computation
 
Russian Call Girls Dwarka Sector 15 💓 Delhi 9999965857 @Sabina Modi VVIP MODE...
Russian Call Girls Dwarka Sector 15 💓 Delhi 9999965857 @Sabina Modi VVIP MODE...Russian Call Girls Dwarka Sector 15 💓 Delhi 9999965857 @Sabina Modi VVIP MODE...
Russian Call Girls Dwarka Sector 15 💓 Delhi 9999965857 @Sabina Modi VVIP MODE...
 
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
 
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
 
RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998
 
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
 
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptxEMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
 
Data Science Jobs and Salaries Analysis.pptx
Data Science Jobs and Salaries Analysis.pptxData Science Jobs and Salaries Analysis.pptx
Data Science Jobs and Salaries Analysis.pptx
 

Congrats ! You got your Data Science Job

  • 1. Finish this Book and get your Data Scientist Job
  • 2. Data Science Introduction Data Science is a combination of multiple disciplines that uses statistics, data analysis, and machine learning to analyze data and to extract knowledge and insights from it. What is Data Science? Data Science is about data gathering, analysis and decision-making. Data Science is about finding patterns in data, through analysis, and make future predictions. By using Data Science, companies are able to make: Better decisions (should we choose A or B) Predictive analysis (what will happen next?) Pattern discoveries (find pattern, or maybe hidden information in the data)
  • 3.
  • 4. Where is Data Science Needed?  Data Science is used in many industries in the world today, e.g. banking, consultancy, healthcare, and manufacturing.  Examples of where Data Science is needed:  For route planning: To discover the best routes to ship  To foresee delays for flight/ship/train etc. (through predictive analysis)  To create promotional offers  To find the best suited time to deliver goods  To forecast the next years revenue for a company  To analyze health benefit of training  To predict who will win elections
  • 5.  Where is Data Science Needed?  Data Science can be applied in nearly every part of a business where data is available. Examples are:  Consumer goods  Stock markets  Industry  Politics  Logistic companies  E-commerce
  • 6.  How Does a Data Scientist Work?  A Data Scientist requires expertise in several backgrounds:  Machine Learning  Statistics  Programming (Python or R)  Mathematics  Databases  A Data Scientist must find patterns within the data. Before he/she can find the patterns, he/she must organize the data in a standard format.
  • 7. Here is how a Data Scientist works: Ask the right questions - To understand the business problem. Explore and collect data - From database, web logs, customer feedback, etc. Extract the data - Transform the data to a standardized format. Clean the data - Remove erroneous values from the data. Find and replace missing values - Check for missing values and replace them with a suitable value (e.g. an average value). Normalize data - Scale the values in a practical range (e.g. 140 cm is smaller than 1,8 m. However, the number 140 is larger than 1,8. - so scaling is important). Analyze data, find patterns and make future predictions. Represent the result - Present the result with useful insights in a way the "company" can understand.
  • 8.  What is Data?  Data is a collection of information.  One purpose of Data Science is to structure data, making it interpretable and easy to work with.  Data can be categorized into two groups:  Structured data  Unstructured data
  • 9. Unstructured Data Unstructured data is not organized. We must organize the data for analysis purposes. Structured Data Structured data is organized and easier to work with. Data Types?
  • 10.  How to Structure Data? We can use an array or a database table to structure or present data. Example of an array: [80, 85, 90, 95, 100, 105, 110, 115, 120, 125] The following example shows how to create an array in Python: #Example Array = [80, 85, 90, 95, 100, 105, 110, 115, 120, 125] print(Array) o/p: [80, 85, 90, 95, 100, 105, 110, 115, 120, 125]
  • 11. Data Science - Database Table: What is Database Table? A database table is a table with structured data. The following table shows a database table with health data extracted from a sports watch: This dataset contains information of a typical training session such as duration, average pulse, calorie burnage etc.
  • 12. Data Science - Database Table: Database Table Structure A database table consists of column(s) and row(s): A row is a horizontal representation of data. A column is a vertical representation of data.
  • 13. Data Science - Database Table: Variables A variable is defined as something that can be measured or counted. Examples can be characters, numbers or time. In the example under, we can observe that each column represents a variable.
  • 14. Data Science - Database Table: Variables A variable is defined as something that can be measured or counted. There are 6 columns, meaning that there are 6 variables (Duration, Average_Pulse, Max_Pulse, Calorie_Burnage, Hours_Work, Hours_Sleep). There are 11 rows, meaning that each variable has 10 observations. But if there are 11 rows, how come there are only 10 observations? It is because the first row is the label, meaning that it is the name of the variable.
  • 15. Data Science & Python Python Python is a programming language widely used by Data Scientists. Python has in-built mathematical libraries and functions, making it easier to calculate mathematical problems and to perform data analysis. Python Libraries Python has libraries with large collections of mathematical functions and analytical tools. In this course, we will use the following libraries:  Pandas - This library is used for structured data operations, like import CSV files, create dataframes, and data preparation  Numpy - This is a mathematical library. Has a powerful N-dimensional array object, linear algebra, Fourier transform, etc.  Matplotlib - This library is used for visualization of data.  SciPy - This library has linear algebra modules
  • 16. Data Science - Python DataFrame Create a DataFrame with Pandas A data frame is a structured representation of data. Let's define a data frame with 3 columns and 5 rows with fictional numbers: Example import pandas as pd d = {'col1': [1, 2, 3, 4, 7], 'col2': [4, 5, 6, 9, 5], 'col3': [7, 8, 12, 1, 11]} df = pd.DataFrame(data=d) print(df) Example Explained Import the Pandas library as pd Define data with column and rows in a variable named d Create a data frame using the function pd.DataFrame() The data frame contains 3 columns and 5 rows Print the data frame output with the print() function
  • 17. Data Science - Python DataFrame We write pd. in front of DataFrame() to let Python know that we want to activate the DataFrame() function from the Pandas library. Be aware of the capital D and F in DataFrame! Interpreting the Output : This is the output: We see that "col1", "col2" and "col3" are the names of the columns. Do not be confused about the vertical numbers ranging from 0- 4. They tell us the information about the position of the rows. In Python, the numbering of rows starts with zero. Now, we can use Python to count the columns and rows. We can use df.shape[1] to find the number of columns:
  • 18. Data Science Functions The Sports Watch Data Set The data set above consists of 6 variables, each with 10 observations:  Duration - How long lasted the training session in minutes?  Average_Pulse - What was the average pulse of the training session? This is measured by beats per minute  Max_Pulse - What was the max pulse of the training session?  Calorie_Burnage - How much calories were burnt on the training session?  Hours_Work - How many hours did we work at our job before the training session?  Hours_Sleep - How much did we sleep the night before the training session? We use underscore (_) to separate strings because Python cannot read space as separator.
  • 19. DataThe min() function The Python min() function is used to find the lowest value in an array. Science Functions Average_pulse_min = min(80, 85, 90, 95, 100, 105, 110, 115, 120, 125) print (Average_pulse_min) o/p: 80 Average_pulse_max = max(80, 85, 90, 95, 100, 105, 110, 115, 120, 125) print (Average_pulse_max) o/p: 125 Data Science Functions The max() function The Python max() function is used to find the highest value in an array. The mean() function The NumPy mean() function is used to find the average value of an array. import numpy as np Calorie_burnage = [240, 250, 260, 270, 280, 290, 300, 310, 320, 330] Average_calorie_burnage = np.mean(Calorie_burnage) print(Average_calorie_burnage) o/p: 285.0
  • 20. Extract and Read Data With Pandas • Before analyzing data, a Data Scientist must extract the data, and make it clean and valuable. • Before data can be analyzed, it must be imported/extracted. In the example below, we show you how to import data using Pandas in Python. We use the read_csv() function to import a CSV file with the health data: import pandas as pd health_data = pd.read_csv("data.csv", header=0, sep=",") print(health_data) Data Science - Data Preparation
  • 21. Example Explained • Import the Pandas library • Name the data frame as health_data. • header=0 means that the headers for the variable names are to be found in the first row (note that 0 means the first row in Python) • sep="," means that "," is used as the separator between the values. This is because we are using the file type .csv (comma separated values) • Tip: If you have a large CSV file, you can use the head() function to only show the top 5rows: Data Science - Data Preparation
  • 22. Data Cleaning Look at the imported data. As you can see, the data are "dirty" with wrongly or unregistered values: • There are some blank fields • Average pulse of 9 000 is not possible • 9 000 will be treated as non-numeric, because of the space separator • One observation of max pulse is denoted as "AF", which does not make sense • So, we must clean the data in order to perform the analysis. Data Science - Data Preparation
  • 23. Data Cleaning Remove Blank Rows We see that the non-numeric values (9 000 and AF) are in the same rows with missing values. Solution: We can remove the rows with missing observations to fix this problem. When we load a data set using Pandas, all blank cells are automatically converted into "NaN" values. So, removing the NaN cells gives us a clean data set that can be analyzed. We can use the dropna() function to remove the NaNs. axis=0 means that we want to remove all rows that have a NaN value: import pandas as pd health_data = pd.read_csv("data.csv", header=0, sep=",") health_data.dropna(axis=0,inplace=True) print(health_data) Data Science - Data Preparation
  • 24. Data Cleaning Remove Blank Rows We see that the non-numeric values (9 000 and AF) are in the same rows with missing values. Solution: We can remove the rows with missing observations to fix this problem. When we load a data set using Pandas, all blank cells are automatically converted into "NaN" values. So, removing the NaN cells gives us a clean data set that can be analyzed. We can use the dropna() function to remove the NaNs. axis=0 means that we want to remove all rows that have a NaN value: import pandas as pd health_data = pd.read_csv("data.csv", header=0, sep=",") health_data.dropna(axis=0,inplace=True) print(health_data) Data Science - Data Preparation
  • 25. Data Categories To analyze data, we also need to know the types of data we are dealing with. Data can be split into three main categories: o Numerical - Contains numerical values. Can be divided into two categories: Discrete: Numbers are counted as "whole". Example: You cannot have trained 2.5 sessions, it is either 2 or 3 Continuous: Numbers can be of infinite precision. For example, you can sleep for 7 hours, 30 minutes and 20 seconds, or 7.533 hours o Categorical - Contains values that cannot be measured up against each other. Example: A color or a type of training o Ordinal - Contains categorical data that can be measured up against each other. Example: School grades where A is better than B and so on o By knowing the type of your data, you will be able to know what technique to use when analyzing them. Data Science - Data Preparation
  • 26. Data Types We can use the info() function to list the data types within our data set: import pandas as pd health_data = pd.read_csv("data.csv", header=0, sep=",") print(health_data.info()) o/p: Data Science - Data Preparation We see that this data set has two different types of data: Float64 Object We cannot use objects to calculate and perform analysis here. We must convert the type object to float64 (float64 is a number with a decimal in Python). We cannot use objects to calculate and perform analysis here. We must convert the type object to float64 (float64 is a number with a decimal in Python).
  • 27. Analyze the Data Data Science - Data Preparation import pandas as pd health_data = pd.read_csv("data.csv", header=0, sep=",") pd.set_option('display.max_columns',None) print(health_data.describe()) When we have cleaned the data set, we can start analyzing the data. We can use the describe() function in Python to summarize data: Count - Counts the number of observations Mean - The average value Std - Standard deviation (explained in the statistics chapter) Min - The lowest value 25%, 50% and 75% are percentiles (explained in the statistics chapter) Max - The highest value
  • 28. Data Science - Linear Functions DS Math Mathematical functions are important to know as a data scientist, because we want to make predictions and interpret them. Linear Functions In mathematics a function is used to relate one variable to another variable. Suppose we consider the relationship between calorie burnage and average pulse. It is reasonable to assume that, in general, the calorie burnage will change as the average pulse changes - we say that the calorie burnage depends upon the average pulse. Furthermore, it may be reasonable to assume that as the average pulse increases, so will the calorie burnage. Calorie burnage and average pulse are the two variables being considered. Because the calorie burnage depends upon the average pulse, we say that calorie burnage is the dependent variable and the average pulse is the independent variable. The relationship between a dependent and an independent variable can often be expressed mathematically using a formula (function).
  • 29. Data Science - Linear Functions DS Math Mathematical functions are important to know as a data scientist, because we want to make predictions and interpret them. A linear function has one independent variable (x) and one dependent variable (y), and has the following form: y = f(x) = ax + b This function is used to calculate a value for the dependent variable when we choose a value for the independent variable. Explanation: o f(x) = the output (the dependant variable) o x = the input (the independant variable) o a = slope = is the coefficient of the independent variable. It gives the rate of change of the dependent variable o b = intercept = is the value of the dependent variable when x = 0. It is also the point where the diagonal line crosses the vertical axis.
  • 30. Data Science - Linear Functions DS Math Linear Function With One Explanatory Variable A function with one explanatory variable means that we use one variable for prediction. Let us say we want to predict calorie burnage using average pulse. We have the following formula: f(x) = 2x + 80 Here, the numbers and variables means: o f(x) = The output. This number is where we get the predicted value of Calorie_Burnage o x = The input, which is Average_Pulse o 2 = Slope = Specifies how much Calorie_Burnage increases if Average_Pulse increases by one. It tells us how "steep" the diagonal line is o 80 = Intercept = A fixed value. It is the value of the dependent variable when x = 0
  • 31. Data Science - Linear Functions DS Math Plotting a Linear Function The term linearity means a "straight line". So, if you show a linear function graphically, the line will always be a straight line. The line can slope upwards, downwards, and in some cases may be horizontal or vertical. Here is a graphical representation of the mathematical function above: Graph Explanations: o The horizontal axis is generally called the x-axis. Here, it represents Average_Pulse. o The vertical axis is generally called the y-axis. Here, it represents Calorie_Burnage. o Calorie_Burnage is a function of Average_Pulse, because Calorie_Burnage is assumed to be dependent on Average_Pulse. o In other words, we use Average_Pulse to predict Calorie_Burnage. o The blue (diagonal) line represents the structure of the mathematical function that predicts calorie burnage.
  • 32. Data Science - Plotting Linear Functions DS Math The Sports Watch Data Set Take a look at our health data set: Plot the Existing Data in Python Now, we can first plot the values of Average_Pulse against Calorie_Burnage using the matplotlib library. The plot() function is used to make a 2D hexagonal binning plot of points x,y:
  • 33. Data Science - Plotting Linear Functions DS Math The Sports Watch Data Set Take a look at our health data set: #Three lines to make our compiler able to draw: import sys import matplotlib matplotlib.use('Agg') import pandas as pd import matplotlib.pyplot as plt health_data = pd.read_csv("data.csv", header=0, sep=",") health_data.plot(x ='Average_Pulse', y='Calorie_Burnage', kind='line') plt.ylim(ymin=0) plt.xlim(xmin=0) plt.show() #Two lines to make our compiler able to draw: plt.savefig(sys.stdout.buffer) sys.stdout.flush() o Example Explained o Import the pyplot module of the matplotlib library o Plot the data from Average_Pulse against Calorie_Burnage o kind='line' tells us which type of plot we want. Here, we want to have a straight line o plt.ylim() and plt.xlim() tells us what value we want the axis to start on. Here, we want the axis to begin from zero o plt.show() shows us the output The code above will produce the following result:
  • 34. Data Science - Plotting Linear Functions DS Math The Sports Watch Data Set Take a look at our health data set: The Graph Output As we can see, there is a relationship between Average_Pulse and Calorie_Burnage. Calorie_Burnage increases proportionally with Average_Pulse. It means that we can use Average_Pulse to predict Calorie_Burnage.
  • 35. Data Science - Plotting Linear Functions DS Math Why is The Line Not Fully Drawn Down to The y-axis? The reason is that we do not have observations where Average_Pulse or Calorie_Burnage are equal to zero. 80 is the first observation of Average_Pulse and 240 is the first observation of Calorie_Burnage. Look at the line. What happens to calorie burnage if average pulse increases from 80 to 90?
  • 36. Data Science - Plotting Linear Functions We can use the diagonal line to find the mathematical function to predict calorie burnage. DS Math As it turns out: o If the average pulse is 80, the calorie burnage is 240 o If the average pulse is 90, the calorie burnage is 260 o If the average pulse is 100, the calorie burnage is 280 o There is a pattern. If average pulse increases by 10, the calorie burnage increases by 20.
  • 37. Data Science - Slope and Intercept Slope and Intercept Now we will explain how we found the slope and intercept of our function: f(x) = 2x + 80 The image points to the Slope - which indicates how steep the line is, and the Intercept - which is the value of y, when x = 0 (the point where the diagonal line crosses the vertical axis). The red line is the continuation of the blue line from previous page. DS Math
  • 38. Data Science - Slope and Intercept Find The Slope The slope is defined as how much calorie burnage increases, if average pulse increases by one. It tells us how "steep" the diagonal line is. We can find the slope by using the proportional difference of two points from the graph. If the average pulse is 80, the calorie burnage is 240 If the average pulse is 90, the calorie burnage is 260 We see that if average pulse increases with 10, the calorie burnage increases by 20. Slope = 20/10 = 2 The slope is 2. DS Math
  • 39. Data Science - Slope and Intercept Find The Slope Mathematically, Slope is Defined as: Slope = f(x2) - f(x1) / x2-x1 f(x2) = Second observation of Calorie_Burnage = 260 f(x1) = First observation of Calorie_Burnage = 240 x2 = Second observation of Average_Pulse = 90 x1 = First observation of Average_Pulse = 80 Slope = (260-240) / (90 - 80) = 2 Be consistent to define the observations in the correct order! If not, the prediction will not be correct! DS Math Use Python to Find the Slope Calculate the slope with the following code: def slope(x1, y1, x2, y2): s = (y2-y1)/(x2-x1) return s print(slope(80,240,90,260)) o/p: 2.0
  • 40. Data Science - Slope and Intercept Find The Intercept The intercept is used to fine tune the functions ability to predict Calorie_Burnage. The intercept is where the diagonal line crosses the y-axis, if it were fully drawn. The intercept is the value of y, when x = 0. Here, we see that if average pulse (x) is zero, then the calorie burnage (y) is 80. So, the intercept is 80. Sometimes, the intercept has a practical meaning. Sometimes not. Does it make sense that average pulse is zero? No, you would be dead and you certainly would not burn any calories. However, we need to include the intercept in order to complete the mathematical function's ability to predict Calorie_Burnage correctly. Other examples where the intercept of a mathematical function can have a practical meaning: Predicting next years revenue by using marketing expenditure (How much revenue will we have next year, if marketing expenditure is zero?). It is likely to assume that a company will still have some revenue even though if it does not spend money on marketing. Fuel usage with speed (How much fuel do we use if speed is equal to 0 mph?). A car that uses gasoline will still use fuel when it is idle. DS Math
  • 41. Data Science - Slope and Intercept Find the Slope and Intercept Using Python The np.polyfit() function returns the slope and intercept. If we proceed with the following code, we can both get the slope and intercept from the function. import pandas as pd import numpy as np health_data = pd.read_csv("data.csv", header=0, sep=",") x = health_data["Average_Pulse"] y = health_data["Calorie_Burnage"] slope_intercept = np.polyfit(x,y,1) print(slope_intercept) o/p: 2.80 DS Math Example Explained: Isolate the variables Average_Pulse (x) and Calorie_Burnage (y) from health_data. Call the np.polyfit() function. The last parameter of the function specifies the degree of the function, which in this case is "1".
  • 42. Data Science - Slope and Intercept Find the Slope and Intercept Using Python The np.polyfit() function returns the slope and intercept. If we proceed with the following code, we can both get the slope and intercept from the function. import pandas as pd import numpy as np health_data = pd.read_csv("data.csv", header=0, sep=",") x = health_data["Average_Pulse"] y = health_data["Calorie_Burnage"] slope_intercept = np.polyfit(x,y,1) print(slope_intercept) o/p: 2.80 DS Math We have now calculated the slope (2) and the intercept (80). We can write the mathematical function as follow: Predict Calorie_Burnage by using a mathematical expression: f(x) = 2x + 80
  • 43. Data Science - Slope and Intercept Task: Now, we want to predict calorie burnage if average pulse is 135. Remember that the intercept is a constant. A constant is a number that does not change. We can now substitute the input x with 135: f(135) = 2 * 135 + 80 = 350 If average pulse is 135, the calorie burnage is 350. Define the Mathematical Function in Python Here is the exact same mathematical function, but in Python. The function returns 2*x + 80, with x as the input: #Try to replace x with 140 and 150. def my_function(x): return 2*x + 80 print (my_function(135)) o/p: 350 DS Math
  • 44. Data Science - Slope and Intercept Plot a New Graph in Python Here, we plot the same graph as earlier, but formatted the axis a little bit. Max value of the y-axis is now 400 and for x-axis is 150: import matplotlib.pyplot as plt health_data.plot(x ='Average_Pulse', y='Calorie_Burnage', kind='line'), plt.ylim(ymin=0, ymax=400) plt.xlim(xmin=0, xmax=150) plt.show() DS Math Example Explained Import the pyplot module of the matplotlib library Plot the data from Average_Pulse against Calorie_Burnage kind='line' tells us which type of plot we want. Here, we want to have a straight line plt.ylim() and plt.xlim() tells us what value we want the axis to start and stop on. plt.show() shows us the output
  • 45. Introduction to Statistics Statistics is the discipline that concerns the collection, organization, analysis, interpretation, and presentation of data. When we have created a model for prediction, we must assess the prediction's reliability. Statistics is a method of interpreting, analyzing and summarizing the data. The types of statistics are categorized based on these features: Descriptive and inferential statistics Based on the representation of data such as using pie charts, bar graphs, or tables, we analyse and interpret it. DS- Statistics
  • 46. Descriptive Statistics import pandas as pd full_health_data = pd.read_csv("data.csv", header=0, sep=",") pd.set_option('display.max_columns',None) pd.set_option('display.max_rows',None) print (full_health_data.describe()) DS- Statistics
  • 47. Statistics Percentiles 25%, 50% and 75% - Percentiles Percentiles are used in statistics to give you a number that describes the value that a given percent of the values are lower than. DS- Statistics
  • 48. Statistics Percentiles Let us try to explain it by some examples, using Average_Pulse. The 25% percentile of Average_Pulse means that 25% of all of the training sessions have an average pulse of 100 beats per minute or lower. If we flip the statement, it means that 75% of all of the training sessions have an average pulse of 100 beats per minute or higher The 75% percentile of Average_Pulse means that 75% of all the training session have an average pulse of 111 or lower. If we flip the statement, it means that 25% of all of the training sessions have an average pulse of 111 beats per minute or higher Task: Find the 10% percentile for Max_Pulse The following example shows how to do it in Python: DS- Statistics
  • 49. Statistics Percentiles import pandas as pd import numpy as np full_health_data = pd.read_csv("data.csv", header=0, sep=",") Max_Pulse= full_health_data["Max_Pulse"] percentile10 = np.percentile(Max_Pulse, 10) print(percentile10) o/p: 120.00 Max_Pulse = full_health_data["Max_Pulse"] - Isolate the variable Max_Pulse from the full health data set. np.percentile() is used to define that we want the 10% percentile from Max_Pulse. The 10% percentile of Max_Pulse is 120. This means that 10% of all the training sessions have a Max_Pulse of 120 or lower. DS- Statistics
  • 50. Statistics Standard Deviation Standard Deviation Standard deviation is a number that describes how spread out the observations are A mathematical function will have difficulties in predicting precise values, if the observations are "spread". Standard deviation is a measure of uncertainty. A low standard deviation means that most of the numbers are close to the mean (average) value. A high standard deviation means that the values are spread out over a wider range. DS- Statistics
  • 51. Statistics Standard Deviation Standard Deviation import pandas as pd import numpy as np full_health_data = pd.read_csv("data.csv", header=0, sep=",") std = np.std(full_health_data) print(std) DS- Statistics
  • 52. Statistics Standard Deviation Coefficient of Variation The coefficient of variation is used to get an idea of how large the standard deviation is. Mathematically, the coefficient of variation is defined as: Coefficient of Variation = Standard Deviation / Mean We can do this in Python if we proceed with the following code: import numpy as np cv = np.std(full_health_data) / np.mean(full_health_data) print(cv) o/p: DS- Statistics We see that the variables Duration, Calorie_Burnage and Hours_Work has a high Standard Deviation compared to Max_Pulse, Average_Pulse and Hours_Sleep.
  • 53. Data Science - Statistics Variance Variance Variance is another number that indicates how spread out the values are. In fact, if you take the square root of the variance, you get the standard deviation. Or the other way around, if you multiply the standard deviation by itself, you get the variance! We will first use the data set with 10 observations to give an example of how we can calculate the variance: DS- Statistics Tip: Variance is often represented by the symbol Sigma Square: σ^2
  • 54. Data Science - Statistics Variance Variance Step 1 to Calculate the Variance: Find the Mean We want to find the variance of Average_Pulse. 1. Find the mean: (80+85+90+95+100+105+110+115+120+125) / 10 = 102.5 The mean is 102.5 DS- Statistics Step 2: For Each Value - Find the Difference From the Mean 2. Find the difference from the mean for each value: 80 - 102.5 = -22.5 85 - 102.5 = -17.5 90 - 102.5 = -12.5 95 - 102.5 = -7.5 100 - 102.5 = -2.5 105 - 102.5 = 2.5 110 - 102.5 = 7.5 115 - 102.5 = 12.5 120 - 102.5 = 17.5 125 - 102.5 = 22.5
  • 55. Step 3: For Each Difference - Find the Square Value 3. Find the square value for each difference: (-22.5)^2 = 506.25 (-17.5)^2 = 306.25 (-12.5)^2 = 156.25 (-7.5)^2 = 56.25 (-2.5)^2 = 6.25 2.5^2 = 6.25 7.5^2 = 56.25 12.5^2 = 156.25 17.5^2 = 306.25 22.5^2 = 506.25 Note: We must square the values to get the total spread. DS- Statistics Step 4: The Variance is the Average Number of These Squared Values 4. Sum the squared values and find the average: (506.25 + 306.25 + 156.25 + 56.25 + 6.25 + 6.25 + 56.25 + 156.25 + 306.25 + 506.25) / 10 = 206.25 The variance is 206.25.
  • 56. Data Science - Statistics Variance Variance Use Python to Find the Variance of health_data We can use the var() function from Numpy to find the variance (remember that we now use the first data set with 10 observations): import numpy as np var = np.var(health_data) print(var) o/p: DS- Statistics Here we calculate the variance for each column for the full data set: import numpy as np var_full = np.var(full_health_data) print(var_full) o/p:
  • 57. Correlation Correlation measures the relationship between two variables. We mentioned that a function has a purpose to predict a value, by converting input (x) to output (f(x)). We can say also say that a function uses the relationship between two variables for prediction. DS - Statistics Correlation Correlation Coefficient The correlation coefficient measures the relationship between two variables. The correlation coefficient can never be less than -1 or higher than 1. o 1 = there is a perfect linear relationship between the variables (like Average_Pulse against Calorie_Burnage) o 0 = there is no linear relationship between the variables o -1 = there is a perfect negative linear relationship between the variables (e.g. Less hours worked, leads to higher calorie burnage during a training session)
  • 58. Example of a Perfect Linear Relationship (Correlation Coefficient = 1) We will use scatterplot to visualize the relationship between Average_Pulse and Calorie_Burnage (we have used the small data set of the sports watch with 10 observations). This time we want scatter plots, so we change kind to "scatter": DS - Statistics Correlation import matplotlib.pyplot as plt health_data.plot(x ='Average_Pulse', y='Calorie_Burnage', kind='scatter') plt.show()
  • 59. Example of a Perfect Negative Linear Relationship (Correlation Coefficient = -1) DS - Statistics Correlation We have plotted fictional data here. The x-axis represents the amount of hours worked at our job before a training session. The y-axis is Calorie_Burnage. If we work longer hours, we tend to have lower calorie burnage because we are exhausted before the training session. The correlation coefficient here is -1.
  • 60. Example of a Perfect Negative Linear Relationship (Correlation Coefficient = -1) DS - Statistics Correlation import pandas as pd import matplotlib.pyplot as plt negative_corr = {'Hours_Work_Before_Training': [10,9,8,7,6,5,4,3,2,1], 'Calorie_Burnage': [220,240,260,280,300,320,340,360,380,400]} negative_corr = pd.DataFrame(data=negative_corr) negative_corr.plot(x ='Hours_Work_Before_Training', y='Calorie_Burnage', kind='scatter') plt.show()
  • 61. Example of No Linear Relationship (Correlation coefficient = 0) DS - Statistics Correlation Here, we have plotted Max_Pulse against Duration from the full_health_data set. As you can see, there is no linear relationship between the two variables. It means that longer training session does not lead to higher Max_Pulse. The correlation coefficient here is 0. import matplotlib.pyplot as plt full_health_data.plot(x ='Duration', y='Max_Pulse', kind='scatter') plt.show() o/p:
  • 62. Correlation Matrix A matrix is an array of numbers arranged in rows and columns. A correlation matrix is simply a table showing the correlation coefficients between variables. Here, the variables are represented in the first row, and in the first column: D S - Statistics Correlation Matrix The table here has used data from the full health data set. Observations: We observe that Duration and Calorie_Burnage are closely related, with a correlation coefficient of 0.89. This makes sense as the longer we train, the more calories we burn We observe that there is almost no linear relationships between Average_Pulse and Calorie_Burnage (correlation coefficient of 0.02) Can we conclude that Average_Pulse does not affect Calorie_Burnage? No. We will come back to answer this question later!
  • 63. Correlation Matrix Correlation Matrix in Python We can use the corr() function in Python to create a correlation matrix. We also use the round() function to round the output to two decimals: D S - Statistics Correlation Matrix import pandas as pd full_health_data = pd.read_csv("data.csv", header=0, sep=",") Corr_Matrix = round(full_health_data.corr(),2) print(Corr_Matrix) o/p:
  • 64. Correlation Matrix Using a Heatmap We can use a Heatmap to Visualize the Correlation Between Variables: D S - Statistics Correlation Matrix The closer the correlation coefficient is to 1, the greener the squares get. The closer the correlation coefficient is to -1, the browner the squares get.
  • 65. Correlation Matrix Use Seaborn to Create a Heatmap We can use the Seaborn library to create a correlation heat map (Seaborn is a visualization library based on matplotlib): D S - Statistics Correlation Matrix import matplotlib.pyplot as plt import seaborn as sns correlation_full_health = full_health_data.corr() axis_corr = sns.heatmap( correlation_full_health, vmin=-1, vmax=1, center=0, cmap=sns.diverging_palette(50, 500, n=500), square=True ) plt.show()
  • 66. Correlation Matrix Use Seaborn to Create a Heatmap We can use the Seaborn library to create a correlation heat map (Seaborn is a visualization library based on matplotlib): D S - Statistics Correlation Matrix o Example Explained: o Import the library seaborn as sns. o Use the full_health_data set. o Use sns.heatmap() to tell Python that we want a heatmap to visualize the correlation matrix. o Use the correlation matrix. Define the maximal and minimal values of the heatmap. Define that 0 is the center. o Define the colors with sns.diverging_palette. n=500 means that we want 500 types of color in the same color palette. o square = True means that we want to see squares.
  • 67. Correlation Does Not Imply Causality Correlation measures the numerical relationship between two variables. A high correlation coefficient (close to 1), does not mean that we can for sure conclude an actual relationship between two variables. A classic example: During the summer, the sale of ice cream at a beach increases Simultaneously, drowning accidents also increase as well Does this mean that increase of ice cream sale is a direct cause of increased drowning accidents? The Beach Example in Python Here, we constructed a fictional data set for you to try: D S - Statistics Correlation vs. Causality
  • 68. Correlation Does Not Imply Causality The Beach Example in Python Here, we constructed a fictional data set for you to try: import pandas as pd import matplotlib.pyplot as plt Drowning_Accident = [20,40,60,80,100,120,140,160,180,200] Ice_Cream_Sale = [20,40,60,80,100,120,140,160,180,200] Drowning = {"Drowning_Accident": [20,40,60,80,100,120,140,160,180,200], "Ice_Cream_Sale": [20,40,60,80,100,120,140,160,180,200]} Drowning = pd.DataFrame(data=Drowning) Drowning.plot(x="Ice_Cream_Sale", y="Drowning_Accident", kind="scatter") plt.show() correlation_beach = Drowning.corr() print(correlation_beach) D S - Statistics Correlation vs. Causality
  • 69. Correlation Does Not Imply Causality The Beach Example in Python o/p: D S - Statistics Correlation vs. Causality
  • 70. Correlation Does Not Imply Causality The Beach Example in Python D S - Statistics Correlation vs. Causality Correlation vs Causality - The Beach Example In other words: can we use ice cream sale to predict drowning accidents? The answer is - Probably not. It is likely that these two variables are accidentally correlating with each other. What causes drowning then? Unskilled swimmers Waves Cramp Seizure disorders Lack of supervision Alcohol (mis)use etc.
  • 71. Correlation Does Not Imply Causality The Beach Example in Python D S - Statistics Correlation vs. Causality Let us reverse the argument: Does a low correlation coefficient (close to zero) mean that change in x does not affect y? Back to the question: Can we conclude that Average_Pulse does not affect Calorie_Burnage because of a low correlation coefficient? The answer is no. There is an important difference between correlation and causality: Correlation is a number that measures how closely the data are related Causality is the conclusion that x causes y. Tip: Always critically reflect over the concept of causality when doing predictions!
  • 72. Correlation Does Not Imply Causality The Beach Example in Python D S - Statistics Correlation vs. Causality Let us reverse the argument: Does a low correlation coefficient (close to zero) mean that change in x does not affect y? Back to the question: Can we conclude that Average_Pulse does not affect Calorie_Burnage because of a low correlation coefficient? The answer is no. There is an important difference between correlation and causality: Correlation is a number that measures how closely the data are related Causality is the conclusion that x causes y.
  • 73. Statistics gives us methods of gaining knowledge from data. What is Statistics Used for? Statistics is used in all kinds of science and business applications. Statistics gives us more accurate knowledge which helps us make better decisions. Statistics can focus on making predictions about what will happen in the future. It can also focus on explaining how different things are connected.
  • 74. Typical Steps of Statistical Methods  The typical steps are:  Gathering data  Describing and visualizing data  Making conclusions It is important to keep all three steps in mind for any questions we want more knowledge about. Knowing which types of data are available can tell you what kinds of questions you can answer with statistical methods. Knowing which questions you want to answer can help guide what sort of data you need. A lot of data might be available, and knowing what to focus on is important.
  • 75. How is Statistics Used? Statistics can be used to explain things in a precise way. You can use it to understand and make conclusions about the group that you want to know more about. This group is called the population. • A population could be many different kinds of groups. It could be: • All of the people in a country • All the businesses in an industry • All the customers of a business • All people that play football who are older than 45 and so on - it just depends on what you want to know about. Gathering data about the population will give you a sample. This is a part of the whole population. Statistical methods are then used on that sample. The results of the statistical methods from the sample is used to make conclusions about the population.
  • 76. Important Concepts in Statistics o Predictions and Explanations o Populations and Samples o Parameters and Sample Statistics o Sampling Methods o Data Types o Measurement Level o Descriptive Statistics o Random Variables o Univariate and Multivariate Statistics o Probability Calculation o Probability Distributions o Statistical Inference o Parameter Estimation o Hypothesis Testing o Correlation o Regression Analysis o Causal Inference
  • 77. Statistics and Programming: Statistical analysis is typically done with computers. Small amounts of data can analyzed reasonably well without computers. Historically, all data analysis was performed by manually. It was time-consuming and prone to errors. Nowadays, programming and software is typically used for data analysis. In this course, we will see examples of code to do statistics with the programming languages Python and R.
  • 78. Statistics - Describing Data Describing data is typically the second step of statistical analysis after gathering data. Descriptive Statistics The information (data) from your sample or population can be visualized with graphs or summarized by numbers. This will show key information in a simpler way than just looking at raw data. It can help us understand how the data is distributed. Graphs can visually show the data distribution. Examples of graphs include: o Histograms o Pie charts o Bar graphs o Box plots
  • 79. Statistics - Describing Data Some graphs have a close connection to numerical summary statistics. Calculating those gives us the basis of these graphs. For example, a box plot visually shows the quartiles of a data distribution. Quartiles are the data split into four equal size parts, or quarters. A quartile is one type of summary statistics. Summary statistics take a large amount of information and sums it up in a few key values. Numbers are calculated from the data which also describe the shape of the distributions. These are individual 'statistics'.
  • 80. Statistics - Making Conclusions Using statistics to make conclusions about a population is called statistical inference. Statistics from the data in the sample is used to make conclusions about the whole population. This is a type of statistical inference. Probability theory is used to calculate the certainty that those statistics also apply to the population. When using a sample, there will always be some uncertainty about what the data looks like for the population. When using a sample, there will always be some uncertainty about what the data looks like for the population. Uncertainty is often expressed as confidence intervals.
  • 81. Statistics - Making Conclusions Confidence intervals are numerical ways of showing how likely it is that the true value of this statistic is within a certain range for the population. Hypothesis testing is a another way of checking if a statement about a population is true. More precisely, it checks how likely it is that a hypothesis is true is based on the sample data. Some examples of statements or questions that can be checked with hypothesis testing: People in the Netherlands taller than people in Denmark Do people prefer Pepsi or Coke? Does a new medicine cure a disease?
  • 82. Statistics - Making Conclusions Causal Inference Causal inference is used to investigate if something causes another thing. For example: Does rain make plants grow? If we think two things are related we can investigate to see if they correlate. Statistics can be used to find out how strong this relation is. Even if things are correlated, finding out of something is caused by other things can be difficult. It can be done with good experimental design or other special statistical techniques. Note: Good experimental design is often difficult to achieve because of ethical concerns or other practical reasons.
  • 83. Statistics - Prediction and Explanation Some types of statistical methods are focused on predicting what will happen. Other types of statistical methods are focused on explaining how things are connected. Prediction Some statistical methods are not focused on explaining how things are connected. Only the accuracy of prediction is important. Many statistical methods are successful at predicting without giving insight into how things are connected. Some types of machine learning let computers do the hard work, but the way they predict is difficult to understand. These approaches can also be vulnerable to mistakes if the circumstances change, since the how they work is less clear.
  • 84. Statistics - Prediction and Explanation Some types of statistical methods are focused on predicting what will happen. Other types of statistical methods are focused on explaining how things are connected. Prediction Some statistical methods are not focused on explaining how things are connected. Only the accuracy of prediction is important. Many statistical methods are successful at predicting without giving insight into how things are connected. Some types of machine learning let computers do the hard work, but the way they predict is difficult to understand. These approaches can also be vulnerable to mistakes if the circumstances change, since the how they work is less clear. Note: Predictions about future events are called forecasts. Not all predictions are about the future. Some predictions can be about something else that is unknown, even if it is not in the future.
  • 85. Statistics - Prediction and Explanation Explanation Different statistical methods are often used for explaining how things are connected. These statistical methods may not make good predictions. These statistical methods often explain only small parts of the whole situation. But, if you only want to know how a few things are connected, the rest might not matter. If these methods accurately explains how all the relevant things are connected, they will also be good at prediction. But managing to explain every detail is often challenging. Some times we are specifically interested in figuring out if one thing causes another. This is called causal inference. If we are looking at complicated situations, many things are connected. To figure out what causes what, we need to untangle every way these things are connected.
  • 86. Statistics - Population and Samples Population: Everything in the group that we want to learn about. Sample: A part of the population. For good statistical analysis, the sample needs to be as "similar" as possible to the population. If they are similar enough, we say that the sample is representative of the population. The sample is used to make conclusions about the whole population. If the sample is not similar enough to the whole population, the conclusions could be useless.
  • 87. Statistics - Parameters and Statistics The terms 'parameter' and (sample) 'statistic' refer to key concepts that are closely related in statistics. They are also directly connected to the concepts of populations and samples. Parameter: A number that describes something about the whole population. Sample statistic: A number that describes something about the sample. The parameters are the key things we want to learn about. The parameters are usually unknown. Sample statistics gives us estimates for parameters. There will always be some uncertainty about how accurate estimates are. More certainty gives us more useful knowledge. For every parameter we want to learn about we can get a sample and calculate a sample statistic, which gives us an estimate of the parameter.
  • 88. Statistics - Parameters and Statistics Some Important Examples Mean, median and mode are different types of averages (typical values in a population). For example: The typical age of people in a country The typical profits of a company The typical range of an electric car Variance and standard deviation are two types of values describing how spread out the values are. A single class of students in a school would usually be about the same age. The age of the students will have low variance and standard deviation. A whole country will have people of all kinds of different ages. The variance and standard deviation of age in the whole country would then be bigger than in a single school grade.
  • 89. Statistics - Study Types A statistical study can be a part of the process of gathering data. There are different types of studies. Some are better than others, but they might be harder to do. Main Types of Statistical Studies The main types of statistical studies are observational and experimental studies. We are often interested in knowing if something is the cause of another thing. Experimental studies are generally better than observational studies for investigating this, but usually require more effort. An observational study is when observe and gather data without changing anything.
  • 90. Statistics - Study Types Experimental Studies In an experimental study, the circumstances around the sample is changed. Usually, we compare two groups from a population and these two groups are treated differently. One example can be a medical study to see if a new medicine is effective. One group receives the medicine and the other does not. These are the different circumstances around those samples. We can compare the health of both groups afterwards and see if the results are different. Experimental studies can allow us to investigate causal relationships. A well designed experimental study can be useful since it can isolate the relationship we are interested in from other effects. Then we can be more confident that we are measuring the true effect.
  • 91. Statistics - Sample Types A study needs participants and there are different ways of gathering them. Some methods are better than others, but they might be more difficult. Different Types of Sampling Methods: Random Sampling A random sample is where every member of the population has an equal chance to be chosen. Random sampling is the best. But, it can be difficult, or impossible, to make sure that it is completely random. Note: Every other sampling method is compared to how close it is to a random sample - the closer, the better. Convenience Sampling A convenience sample is where the participants that are the easiest to reach are chosen. Note: Convenience sampling is the easiest to do. In many cases this sample will not be similar enough to the population, and the conclusions can potentially be useless.
  • 92. Statistics - Sample Types Systematic Sampling A systematic sample is where the participants are chosen by some regular system. For example: The first 30 people in a queue Every third on a list The first 10 and the last 10 Stratified Sampling A stratified sample is where the population is split into smaller groups called 'strata'. The 'strata' can, for example, be based on demographics, like: Different age groups Professions Stratification of a sample is the first step. Another sampling method (like random sampling) is used for the second step of choosing participants from all of the smaller groups (strata).
  • 93. Statistics - Sample Types Clustered Sampling A clustered sample is where the population is split into smaller groups called 'clusters'. The clusters are usually natural, like different cities in a country. The clusters are chosen randomly for the sample. All members of the clusters can participate in the sample, or members can be chosen randomly from the clusters in a third step.
  • 94. Statistics - Data Types Data can be different types, and require different types of statistical methods to analyze Different types of data There are two main types of data: Qualitative (or 'categorical') and quantitative (or 'numerical'). These main types also have different sub-types depending on their measurement level. Qualitative Data Information about something that can be sorted into different categories that can't be described directly by numbers. Examples: • Brands • Nationality • Professions
  • 95. Statistics - Data Types With categorical data we can calculate statistics like proportions. For example, the proportion of Indian people in the world, or the percent of people who prefer one brand to another. Quantitative Data Information about something that is described by numbers. Examples: • Income • Age • Height With numerical data we can calculate statistics like the average income in a country, or the range of heights of players in a football team.
  • 96. Statistics - Measurement Levels Different data types have different measurement levels. Measurement levels are important for what types of statistics can be calculated and how to best present the data. The main types of data are Qualitative (categories) and Quantitative (numerical). These are further split into the following measurement levels. These measurement levels are also called measurement 'scales' Nominal Level Categories (qualitative data) without any order. Examples: • Brand names • Countries • Colors
  • 97. Statistics - Measurement Levels Ordinal level Categories that can be ordered (from low to high), but the precise "distance" between each is not meaningful. Examples: • Letter grade scales from F to A • Military ranks • Level of satisfaction with a product Consider letter grades from F to A: Is the grade A precisely twice as good as a B? And, is the grade B also twice as good as C? Exactly how much distance it is between grades is not clear and precise. If the grades are based on amounts of points on a test, you can say that there is a precise "distance" on the point scale, but not the grades themselves.
  • 98. Statistics - Measurement Levels Interval Level Data that can be ordered and the distance between them is objectively meaningful. But there is no natural 0-value where the scale originates. Examples: Years in a calendar Temperature measured in Fahrenheit Note: Interval scales are usually invented by people, like degrees of temperature. 0 degrees Celsius is 32 degrees of Fahrenheit. There is consistent distances between each degree (for every 1 extra degree of Celsius, there is 1.8 extra Fahrenheit), but they do not agree on where 0 degrees is.
  • 99.
  • 100. Statistics - Descriptive Statistics Descriptive statistics gives us insight into data without having to look at all of it in detail. Key Features to Describe about Data: Getting a quick overview of how the data is distributed is a important step in statistical methods. We calculate key numerical values about the data that tells us about the distribution of the data. We also draw graphs showing visually how the data is distributed. Key Features of Data: • Where is the center of the data? (location) • How much does the data vary? (scale) • What is the shape of the data? (shape) • These can be described by summary statistics (numerical values).
  • 101. Statistics - Descriptive Statistics The Center of the Data The center of the data is where most of the values are concentrated. Different kinds of averages, like mean, median and mode, are measures of the center. Note: Measures of the center are also called location parameters, because they tell us something about where data is 'located' on a number line. The Variation of the Data The variation of the data is how spread out the data are around the center. Statistics like standard deviation, range and quartiles are measures of variation. Note: Measures of variation are also called scale parameters.
  • 102. Statistics - Descriptive Statistics The Shape of the Data The shape of the data can refer to the how the data are bunched up on either side of the center. Statistics like skew describe if the right or left side of the center is bigger. Skew is one type of shape parameters. Frequency Tables One typical of presenting data is with frequency tables. A frequency table counts and orders data into a table. Typically, the data will need to be sorted into intervals. Frequency tables are often the basis for making graphs to visually present the data.
  • 103. Statistics - Descriptive Statistics Visualizing Data Different types of graphs are used for different kinds of data. For example: • Pie charts for qualitative data • Histograms for quantitative data • Scatter plots for bivariate data • Graphs often have a close connection to numerical summary statistics. For example, box plots show where the quartiles are. Quartiles also tell us where the minimum and maximum values, range, interquartile range, and median are.
  • 104. Statistics - Frequency Tables We can see that there is only one winner from ages 10 to 19. And that the highest number of winners are in their 60s.
  • 105. Statistics - Descriptive Statistics Relative Frequency Tables Relative frequency means the number of times a value appears in the data compared to the total amount. A percentage is a relative frequency. Here are the relative frequencies of ages of Noble Prize winners. Now, all the frequencies are divided by the total (934) to give percentages.
  • 106. Statistics - Descriptive Statistics Cumulative Frequency Tables Cumulative frequency counts up to a particular value. Here are the cumulative frequencies of ages of Nobel Prize winners. Now, we can see how many winners have been younger than a certain age. Cumulative frequency tables can also be made with relative frequencies (percentages).
  • 107. Statistics - Histograms A histogram visually presents quantitative data. A histogram is a widely used graph to show the distribution of quantitative (numerical) data. It shows the frequency of values in the data, usually in intervals of values. Frequency is the amount of times that value appeared in the data. Each interval is represented with a bar, placed next to the other intervals on a number line. The height of the bar represents the frequency of values in that interval. Here is a histogram of the age of all 934 Nobel Prize winners up to the year 2020: This histogram uses age intervals from 10 to 19, 20 to 29, and so on. Note: Histograms are similar to bar graphs, which are used for qualitative data.
  • 108. Statistics - Histograms Bin Width The intervals of values are often called 'bins'. And the length of an interval is called 'bin width'. We can choose any width. It is best with a bin width that shows enough detail without being confusing. Here is a histogram of the same Nobel Prize winner data, but with bin widths of 5 instead of 10: This histogram uses age intervals from from 15 to 19, 20 to 24, 25 to 29, and so on. Smaller intervals gives a more detailed look at the distribution of the age values in the data.
  • 109. Statistics - Bar Graphs A bar graph visually presents qualitative data. Bar graphs are used to show the distribution of qualitative (categorical) data. It shows the frequency of values in the data. Frequency is the amount of times that value appeared in the data. Each category is represented with a bar. The height of the bar represents the frequency of values from that category in the data. Here is a bar graph of the number of people who have won a Nobel Prize in each category up to the year 2020: Some of the categories have existed longer than others. Multiple winners are also more common in some categories. So there is a different number of winners in each category. Note: Bar graphs are similar to histograms, which are used for quantitative data
  • 110. Statistics - Pie Charts A pie chart visually presents qualitative data. Pie graphs are used to show the distribution of qualitative (categorical) data. It shows the frequency or relative frequency of values in the data. Frequency is the amount of times that value appeared in the data. Relative frequency is the percentage of the total. Each category is represented with a slice in the 'pie' (circle). The size of each slice represents the frequency of values from that category in the data. Here is a pie chart of the number of people who have won a Nobel Prize in each category up to the year 2020: This pie chart shows relative frequency. So each slice is sized by the percentage for each category. Some of the categories have existed longer than others. Multiple winners are also more common in some categories. So there is a different number of winners in each category.
  • 111. Statistics - Box Plots A box plot is a graph used to show key features of quantitative data. A box plot is a good way to show many important features of quantitative (numerical) data. It shows the median of the data. This is the middle value of the data and one type of an average value. It also shows the range and the quartiles of the data. This tells us something about how spread out the data is. Note: Box plots are also called 'box and whiskers plots'. Here is a box plot of the age of all the Nobel Prize winners up to the year 2020:
  • 112. Statistics - Box Plots The median is the red line through the middle of the 'box'. We can see that this is just above the number 60 on the number line below. So the middle value of age is 60 years. The left side of the box is the 1st quartile. This is the value that separates the first quarter, or 25% of the data, from the rest. Here, this is 51 years. The right side of the box is the 3rd quartile. This is the value that separates the first three quarters, or 75% of the data, from the rest. Here, this is 69 years.
  • 113. Statistics - Box Plots The distance between the sides of the box is called the inter-quartile range (IQR). This tells us where the 'middle half' of the values are. Here, half of the winners were between 51 and 69 years. The ends of the lines from the box at the left and the right are the minimum and maximum values in the data. The distance between these is called the range. The youngest winner was 17 years old, and the oldest was 97 years old. So the range of the age of winners was 80 years.
  • 114. Statistics - Average An average is a measure of where most of the values in the data are located. The center of the data is where most of the values in the data are located. Averages are measures of the location of the center. There are different types of averages. The most commonly used are: o Mean o Median o Mode Note: In statistics, averages are often referred to as 'measures of central tendency'. For example, using the values: 40, 21, 55, 21, 48, 13, 72
  • 115. Statistics - Average Median The median is the 'middle value' of the data. The median is found by ordering all the values in the data and picking the middle value: 13, 21, 21, 40, 48, 55, 72 The median is less influenced by extreme values in the data than the mean. Changing the last value to 356 does not change the median: 13, 21, 21, 40, 48, 55, 356 The median is still 40. Changing the last value to 356 changes the mean a lot: (13 + 21 + 21 + 40 + 48 + 55 + 72)/7 = 38.57 (13 + 21 + 21 + 40 + 48 + 55 + 356)/7 = 79.14 Note: Extreme values are values in the data that are much smaller or larger than the average values in the data.
  • 116. Statistics - Average Mode The mode is the value(s) that appears most often in the data: 40, 21, 55, 21, 48, 13, 72 Here, 21 appears two times, and the other values only once. The mode of this data is 21. The mode is also used for categorical data, unlike the median and mean. Categorical data can't be described directly with numbers, like names: Alice, John, Bob, Maria, John, Julia, Carol Here, John appears two times, and the other values only once. The mode of this data is John. Note: There can be more than one mode if multiple values appear the same number of times in the data.
  • 117. Statistics - Mean The mean is a type of average value, which describes where center of the data is located. Mean The mean is usually referred to as 'the average'. The mean is the sum of all the values in the data divided by the total number of values in the data. The mean is calculated for numerical variables. A variable is something in the data that can vary, like: Age Height Income Note: There are multiple types of mean values. The most common type of mean is the arithmetic mean. In this tutorial 'mean' refers to the arithmetic mean.
  • 118. Statistics - Mean Calculating the Mean You can calculate the mean for both the population and the sample. The formulas are the same and uses different symbols to refer to the population mean (μ) and sample mean (x¯).
  • 119. Statistics - Mean Calculation with Programming The mean can easily be calculated with many programming languages. Using software and programming to calculate statistics is more common for bigger sets of data, as calculating by hand becomes difficult. Example With Python use the NumPy library mean() method to find the mean of the values 4,11,7,14: import numpy values = [4,11,7,14] x = numpy.mean(values) print(x) o/p: 9.0
  • 120. Statistics - Mean Calculation with Programming The mean can easily be calculated with many programming languages. Using software and programming to calculate statistics is more common for bigger sets of data, as calculating by hand becomes difficult. Example With Python use the NumPy library mean() method to find the mean of the values 4,11,7,14: import numpy values = [4,11,7,14] x = numpy.mean(values) print(x) o/p: 9.0 Use the R mean() function to find the mean of the values 4,11,7,14: values <- c(4,7,11,14) mean(values) o/p: [1] 9
  • 121. Statistics - Median The median is a type of average value, which describes where the center of the data is located. The median is the middle value in a data set ordered from low to high. Finding the Median The median can only be calculated for numerical variables. The formula for finding the middle value is: Where n is the total number of observations. If the total number of observations is an odd number, the formula gives a whole number and the value of this observation is the median. 13, 21, 21, 40, 48, 55, 72 Here, there are 7 total observations, so the median is the 4th value: The 4th value in the ordered list is 40, so that is the median.
  • 122. Statistics - Median If the total number of observations is an even number, the formula gives a decimal number between two observations. 13, 21, 21, 40, 42, 48, 55, 72 Here, there are 8 total observations, so the median is between the 4th and 5th values: The 4th and 5th values in the ordered list is 40 and 42, so the median is the mean of these two values. That is, the sum of those two values divided by 2: Note: It is important that the numbers are ordered before you can find the median.
  • 123. Statistics - Median Finding the Median with Programming The median can easily be found with many programming languages. Using software and programming to calculate statistics is more common for bigger sets of data, as finding it manually becomes difficult. The median can easily be found with many programming languages. Using software and programming to calculate statistics is more common for bigger sets of data, as finding it manually becomes difficult. Example With Python use the NumPy library median() method to find the median of the values 13, 21, 21, 40, 42, 48, 55, 72: import numpy values = [13,21,21,40,42,48,55,72] x = numpy.median(values) print(x) o/p: 41.0
  • 124. Statistics - Mode The mode is a type of average value, which describes where most of the data is located. Mode The mode is the value(s) that are the most common in the data. A dataset can have multiple values that are modes. A distribution of values with only one mode is called unimodal. A distribution of values with two modes is called bimodal. In general, a distribution with more than one mode is called multimodal. Mode can be found for both categorical and numerical data.
  • 125. Statistics - Mode Finding the Mode Here is a numerical example: 4, 7, 3, 8, 11, 7, 10, 19, 6, 9, 12, 12 Both 7 and 12 appears two times each, and the other values only once. The modes of this data is 7 and 12. Here is a categorical example with names: Alice, John, Bob, Maria, John, Julia, Carol John appears two times, and the other values only once. The mode of this data is John.
  • 126. Statistics - Mode Finding the Mode with Programming The mode can easily be found with many programming languages. Using software and programming to calculate statistics is more common for bigger sets of data, as calculating manually becomes difficult. Example With Python use the statistics library multimode() method to find the modes of the values 4,7,3,8,11,7,10,19,6,9,12,12: from statistics import multimode values = [4,7,3,8,11,7,10,19,6,9,12,12] x = multimode(values) print(x) o/p: [7, 12]
  • 127. Statistics - Variation Variation is a measure of how spread out the data is around the center of the data. Measures of variation are statistics of how far away the values in the observations (data points) are from each other. There are different measures of variation. The most commonly used are: o Range o Quartiles and Percentiles o Interquartile Range o Standard Deviation Measures of variation combined with an average (measure of center) gives a good picture of the distribution of the data. Note: These measures of variation can only be calculated for numerical data.
  • 128. Statistics - Variation Range The range is the difference between the smallest and the largest value of the data. Range is the simplest measure of variation. Here is a histogram of the age of all 934 Nobel Prize winners up to the year 2020, showing the range: The youngest winner was 17 years and the oldest was 97 years. The range of ages for Nobel Prize winners is then 80 years.
  • 129. Statistics - Variation Quartiles and Percentiles Quartiles and percentiles are ways of separating equal numbers of values in the data into parts. Quartiles are values that separate the data into four equal parts. Percentiles are values that separate the data into 100 equal parts. Here is a histogram of the age of all 934 Nobel Prize winners up to the year 2020, showing the quartiles: The quartiles (Q0,Q1,Q2,Q3,Q4) are the values that separate each quarter. Between Q0 and Q1 are the 25% lowest values in the data. Between Q1 and Q2 are the next 25%. And so on. o Q0 is the smallest value in the data. o Q2 is the middle value (median). o Q4 is the largest value in the data.
  • 130. Statistics - Variation Interquartile Range Interquartile range is the difference between the first and third quartiles (Q1 and Q3). The 'middle half' of the data is between the first and third quartile. Here is a histogram of the age of all 934 Nobel Prize winners up to the year 2020, showing the interquartile range (IQR): Here, the middle half of is between 51 and 69 years. The interquartile range for Nobel Prize winners is then 18 years.
  • 131. Statistics - Variation Standard deviation : It is the most used measure of variation. Standard deviation (σ) measures how far a 'typical' observation is from the average of the data (μ). Standard deviation is important for many statistical methods. Here is a histogram of the age of all 934 Nobel Prize winners up to the year 2020, showing standard deviations: Note: Values within one standard deviation (σ) are considered to be typical. Values outside three standard deviations are considered to be outliers.
  • 132. Statistics - Range The range is a measure of variation, which describes how spread out the data is. Range The range is the difference between the smallest and the largest value of the data. Range is the simplest measure of variation. Here is a histogram of the age of all 934 Nobel Prize winners up to the year 2020, showing the range: The youngest winner was 17 years and the oldest was 97 years. The range of ages for Nobel Prize winners is then 80 years.
  • 133. Statistics - Range The range is a measure of variation, which describes how spread out the data is. Calculating the Range The range can only be calculated for numerical data. First, find the smallest and largest values of this example: 13, 21, 21, 40, 48, 55, 72 Calculate the difference by subtracting the smallest from the largest: 72 - 13 = 59
  • 134. Statistics - Range The range is a measure of variation, which describes how spread out the data is. Calculating the Range with Programming The range can easily be found with many programming languages. Using software and programming to calculate statistics is more common for bigger sets of data, as finding it manually becomes difficult. Example With Python use the NumPy library ptp() method to find the range of the values 13, 21, 21, 40, 48, 55, 72: import numpy values = [13,21,21,40,48,55,72] x = numpy.ptp(values) print(x) o/p: 59
  • 135. Statistics - Quartiles and Percentiles Quartiles and percentiles are a measures of variation, which describes how spread out the data is. Quartiles and percentiles are both types of quantiles. Quartiles Quartiles are values that separate the data into four equal parts. Here is a histogram of the age of all 934 Nobel Prize winners up to the year 2020, showing the quartiles: The quartiles (Q0,Q1,Q2,Q3,Q4) are the values that separate each quarter. Between Q0 and Q1 are the 25% lowest values in the data. Between Q1 and Q2 are the next 25%. And so on. Q0 is the smallest value in the data. Q1 is the value separating the first quarter from the second quarter of the data. Q2 is the middle value (median), separating the bottom from the top half. Q3 is the value separating the third quarter from the fourth quarter Q4 is the largest value in the data.
  • 136. Statistics - Quartiles and Percentiles The range is a measure of variation, which describes how spread out the data is. Calculating Quartiles with Programming Quartiles can easily be found with many programming languages. Using software and programming to calculate statistics is more common for bigger sets of data, as finding it manually becomes difficult. Example With Python use the NumPy library quantile() method to find the quartiles of the values 13, 21, 21, 40, 42, 48, 55, 72: import numpy values = [13,21,21,40,42,48,55,72] x = numpy.quantile(values, [0,0.25,0.5,0.75,1]) print(x) O/P : [13. 21. 41. 49.75 72. ]
  • 137. Statistics - Interquartile Range Interquartile range is a measure of variation, which describes how spread out the data is. Interquartile Range is the difference between the first and third quartiles (Q1 and Q3). The 'middle half' of the data is between the first and third quartile. The first quartile is the value in the data that separates the bottom 25% of values from the top 75%. The third quartile is the value in the data that separates the bottom 75% of the values from the top 25% Here is a histogram of the age of all 934 Nobel Prize winners up to the year 2020, showing the interquartile range (IQR):
  • 138. Statistics - Interquartile Range Here, the middle half of is between 51 and 69 years. The interquartile range for Nobel Prize winners is then 18 years.
  • 139. Statistics - Interquartile Range Calculating the Interquartile Range with Programming The interquartile range can easily be found with many programming languages. Using software and programming to calculate statistics is more common for bigger sets of data, as finding it manually becomes difficult. Example With Python use the SciPy library iqr() method to find the interquartile range of the values 13, 21, 21, 40, 42, 48, 55, 72: from scipy import stats values = [13,21,21,40,42,48,55,72] x = stats.iqr(values) print(x) O/P : 28.75
  • 140. Statistics - Standard Deviation Standard deviation is the most commonly used measure of variation, which describes how spread out the data is. Standard deviation (σ) measures how far a 'typical' observation is from the average of the data (μ). Standard deviation is important for many statistical methods. Here is a histogram of the age of all 934 Nobel Prize winners up to the year 2020, showing standard deviations: Each dotted line in the histogram shows a shift of one extra standard deviation. If the data is normally distributed: o Roughly 68.3% of the data is within 1 standard deviation of the average (from μ-1σ to μ+1σ) o Roughly 95.5% of the data is within 2 standard deviations of the average (from μ-2σ to μ+2σ) o Roughly 99.7% of the data is within 3 standard deviations of the average (from μ-3σ to μ+3σ) Note: A normal distribution has a "bell" shape and spreads out equally on both sides.
  • 141. Statistics - Standard Deviation Calculating the Standard Deviation You can calculate the standard deviation for both the population and the sample. The formulas are almost the same and uses different symbols to refer to the standard deviation () and sample standard deviation (s). Calculating the standard deviation (σ) is done with this formula:
  • 142. Statistics - Standard Deviation Calculating the Standard Deviation You can calculate the standard deviation for both the population and the sample. The formulas are almost the same and uses different symbols to refer to the standard deviation () and sample standard deviation (s). Calculating the standard deviation (σ) is done with this formula:
  • 143. Statistics - Standard Deviation Calculating the Standard Deviation You can calculate the standard deviation for both the population and the sample. The formulas are almost the same and uses different symbols to refer to the standard deviation () and sample standard deviation (s). Calculating the standard deviation (σ) is done with this formula:
  • 144. Statistics - Standard Deviation Calculating the Standard Deviation with Programming The standard deviation can easily be calculated with many programming languages. Using software and programming to calculate statistics is more common for bigger sets of data, as calculating by hand becomes difficult. Population Standard Deviation Example With Python use the NumPy library std() method to find the standard deviation of the values 4,11,7,14: import numpy values = [4,11,7,14] x = numpy.std(values) print(x) o/p: 3.8078865529319543
  • 145. Statistics - Standard Deviation Calculating the Standard Deviation with Programming The standard deviation can easily be calculated with many programming languages. Using software and programming to calculate statistics is more common for bigger sets of data, as calculating by hand becomes difficult. Population Standard Deviation Example With Python use the NumPy library std() method to find the standard deviation of the values 4,11,7,14: import numpy values = [4,11,7,14] x = numpy.std(values) print(x) o/p: 3.8078865529319543 Sample Standard Deviation x = numpy.std(values, ddof=1) print(x) o/p: 4.396968652757639
  • 147. Statistics - Statistical Inference Statistical Inference Using data analysis and statistics to make conclusions about a population is called statistical inference. The main types of statistical inference are: • Estimation • Hypothesis testing Estimation Statistics from a sample are used to estimate population parameters. The most likely value is called a point estimate. There is always uncertainty when estimating. The uncertainty is often expressed as confidence intervals defined by a likely lowest and highest value for the parameter. An example could be a confidence interval for the number of bicycles a Dutch person owns: "The average number of bikes a Dutch person owns is between 3.5 and 6."
  • 148. Statistics - Statistical Inference Hypothesis Testing Hypothesis testing is a method to check if a claim about a population is true. More precisely, it checks how likely it is that a hypothesis is true is based on the sample data. There are different types of hypothesis testing. The steps of the test depends on: Type of data (categorical or numerical) If you are looking at: o A single group o Comparing one group to another o Comparing the same group before and after a change Some examples of claims or questions that can be checked with hypothesis testing: o 90% of Australians are left handed o Is the average weight of dogs more than 40kg? o Do doctors make more money than lawyers?
  • 149. Statistics - Normal Distribution The normal distribution is an important probability distribution used in statistics. Many real world examples of data are normally distributed. Normal Distribution The normal distribution is described by the mean (μ) and the standard deviation (σ). The normal distribution is often referred to as a 'bell curve' because of it's shape: Most of the values are around the center (μ) The median and mean are equal It has only one mode It is symmetric, meaning it decreases the same amount on the left and the right of the center The area under the curve of the normal distribution represents probabilities for the data. The area under the whole curve is equal to 1, or 100% Here is a graph of a normal distribution with probabilities between standard deviations (σ):
  • 150. Statistics - Normal Distribution • Roughly 68.3% of the data is within 1 standard deviation of the average (from μ-1σ to μ+1σ) • Roughly 95.5% of the data is within 2 standard deviations of the average (from μ-2σ to μ+2σ) • Roughly 99.7% of the data is within 3 standard deviations of the average (from μ-3σ to μ+3σ) Note: Probabilities of the normal distribution can only be calculated for intervals (between two values).
  • 151. Statistics - Normal Distribution Different Mean and Standard Deviations The mean describes where the center of the normal distribution is. Here is a graph showing three different normal distributions with the same standard deviation but different means. The standard deviation describes how spread out the normal distribution is. Here is a graph showing three different normal distributions with the same mean but different standard deviations.
  • 152. Statistics - Normal Distribution Different Mean and Standard Deviations The mean describes where the center of the normal distribution is. The purple curve has the biggest standard deviation and the black curve has the smallest standard deviation. The area under each of the curves is still 1, or 100%. .
  • 153. Statistics - Normal Distribution A Real Data Example of Normally Distributed Data Real world data is often normally distributed. Here is a histogram of the age of Nobel Prize winners when they won the prize: The normal distribution drawn on top of the histogram is based on the population mean (μ) and standard deviation (σ) of the real data. We can see that the histogram close to a normal distribution. Examples of real world variables that can be normally distributed: • Test scores • Height • Birth weight
  • 154. Statistics - Normal Distribution Probability Distributions Probability distributions are functions that calculates the probabilities of the outcomes of random variables. Typical examples of random variables are coin tosses and dice rolls. Here is an graph showing the results of a growing number of coin tosses and the expected values of the results (heads or tails). The expected values of the coin toss is the probability distribution of the coin toss. Notice how the result of random coin tosses gets closer to the expected values (50%) as the number of tosses increases. Similarly, here is a graph showing the results of a growing number of dice rolls and the expected values of the results (from 1 to 6).
  • 155. Statistics - Normal Distribution Notice again how the result of random dice rolls gets closer to the expected values (1/6, or 16.666%) as the number of rolls increases. When the random variable is a sum of dice rolls the results and expected values take a different shape. The different shape comes from there being more ways of getting a sum of near the middle, than a small or large sum.
  • 156. Statistics - Normal Distribution Notice again how the result of random dice rolls gets closer to the expected values (1/6, or 16.666%) as the number of rolls increases. When the random variable is a sum of dice rolls the results and expected values take a different shape. The different shape comes from there being more ways of getting a sum of near the middle, than a small or large sum.
  • 157. Statistics - Normal Distribution As we keep increasing the number of dice for a sum the shape of the results and expected values look more and more like a normal distribution. Many real world variables follow a similar pattern and naturally form normal distributions. Normally distributed variables can be analyzed with well-known techniques.
  • 158. Statistics - Standard Normal Distribution The standard normal distribution is a normal distribution where the mean is 0 and the standard deviation is 1. Normally distributed data can be transformed into a standard normal distribution. Standardizing normally distributed data makes it easier to compare different sets of data. The standard normal distribution is used for: • Calculating confidence intervals • Hypothesis tests Here is a graph of the standard normal distribution with probability values (p-values) between the standard deviations:
  • 159. Statistics - Standard Normal Distribution Standardizing makes it easier to calculate probabilities. The functions for calculating probabilities are complex and difficult to calculate by hand. Typically, probabilities are found by looking up tables of pre-calculated values, or by using software and programming. The standard normal distribution is also called the 'Z-distribution' and the values are called 'Z-values' (or Z- scores).
  • 160. Statistics - Standard Normal Distribution Z-Values Z-values express how many standard deviations from the mean a value is. The formula for calculating a Z-value is: Z=(x−μ)/σ x is the value we are standardizing, μ is the mean, and σ is the standard deviation. For example, if we know that: The mean height of people in Germany is 170 cm (μ) The standard deviation of the height of people in Germany is 10 cm (σ) Bob is 200 cm tall (x) Bob is 30 cm taller than the average person in Germany. 30 cm is 3 times 10 cm. So Bob's height is 3 standard deviations larger than mean height in Germany. Using the formula:
  • 161. Statistics - Standard Normal Distribution Finding the P-value of a Z-Value Using a Z-table or programming we can calculate how many people Germany are shorter than Bob and how many are taller. Example With Python use the Scipy Stats library norm.cdf() function find the probability of getting less than a Z-value of 3: import scipy.stats as stats print(stats.norm.cdf(3)) O/P: 0.9986501019683699
  • 162. Statistics - Student's T Distribution The student's t-distribution is similar to a normal distribution and used in statistical inference to adjust for uncertainty. It is used for estimation and hypothesis testing of a population mean (average). The t-distribution is adjusted for the extra uncertainty of estimating the mean. If the sample is small, the t-distribution is wider. If the sample is big, the t-distribution is narrower. The bigger the sample size is, the closer the t-distribution gets to the standard normal distribution.
  • 163. Statistics - Student's T Distribution Notice how some of the curves have bigger tails. This is due to the uncertainty from a smaller sample size. The green curve has the smallest sample size. For the t-distribution this is expressed as 'degrees of freedom' (df), which is calculated by subtracting 1 from the sample size (n). For example a sample size of 30 will make 29 degrees of freedom for the t-distribution. The t-distribution is used to find critical t- values and p-values (probabilities) for estimation and hypothesis testing. Note: Finding the critical t-values and p-values of the t-distribution is similar z-values and p-values of the standard normal distribution. But make sure to use the correct degrees of freedom.
  • 164. Statistics - Student's T Distribution Finding the P-Value of a T-Value You can find the p-values of a t-value by using a t- table or with programming. Example With Python use the Scipy Stats library t.cdf() function find the probability of getting less than a t-value of 2.1 with 29 degrees of freedom: import scipy.stats as stats print(stats.t.cdf(2.1, 29)) O/P: 0.9777290209818548 Finding the T-value of a P-Value You can find the t-values of a p-value by using a t-table or with programming. Example With Python use the Scipy Stats library t.ppf() function find the t-value separating the top 25% from the bottom 75% with 29 degrees of freedom: import scipy.stats as stats print(stats.t.ppf(0.75, 29)) O/P: 0.6830438592467808
  • 165. Statistics - Estimation Point estimates are the most likely value for a population parameter. Confidence intervals express the uncertainty of an estimated population parameter. A point estimate is calculated from a sample. The point estimate depends on the type of data: • Categorical data: the number of occurrences divided by the sample size. • Numerical data: the mean (the average) of the sample. One example could be: The point estimate for the average height of people in Denmark is 180 cm. Estimates are always uncertain. This uncertainty can be expressed with a confidence interval.
  • 166. Statistics - Estimation Confidence Intervals The confidence interval is defined by a lower bound and an upper bound. This gives us a range of values that the true parameter is likely to be between. For example that: The average height of people in Denmark is between 170 cm and 190 cm. Here, 170 cm is the lower bound, and 190 cm is the upper bound. The lower and upper bounds of a confidence interval is based on the confidence level.
  • 167. Statistics - Hypothesis Testing Hypothesis testing is a formal way of checking if a hypothesis about a population is true or not. A hypothesis is a claim about a population parameter. A hypothesis test is a formal procedure to check if a hypothesis is true or not. Examples of claims that can be checked: The average height of people in Denmark is more than 170 cm. The share of left handed people in Australia is not 10%. The average income of dentists is less the average income of dentists.
  • 168. Statistics - Hypothesis Testing The Null and Alternative Hypothesis Hypothesis testing is based on making two different claims about a population parameter. The null hypothesis (H0) and the alternative hypothesis (H1) are the claims. The two claims needs to be mutually exclusive, meaning only one of them can be true. The alternative hypothesis is typically what we are trying to prove. For example, we want to check the following claim: "The average height of people in Denmark is more than 170 cm." In this case, the parameter is the average height of people in Denmark (μ). The null and alternative hypothesis would be: Null hypothesis: The average height of people in Denmark is 170 cm. Alternative hypothesis: The average height of people in Denmark is more than 170 cm.
  • 169. Statistics - Hypothesis Testing The claims are often expressed with symbols like this: : : If the data supports the alternative hypothesis, we reject the null hypothesis and accept the alternative hypothesis. If the data does not support the alternative hypothesis, we keep the null hypothesis. Note: The alternative hypothesis is also referred to as (H_{A})
  • 170. Statistics - Hypothesis Testing The Significance Level The significance level (α) is the uncertainty we accept when rejecting the null hypothesis in the hypothesis test. The significance level is a percentage probability of accidentally making the wrong conclusion. Typical significance levels are: •α=0.1 (10%) •α=0.05 (5%) •α=0.01 (1%) A lower significance level means that the evidence in the data needs to be stronger to reject the null hypothesis. There is no "correct" significance level - it only states the uncertainty of the conclusion. Note: A 5% significance level means that when we reject a null hypothesis: We expect to reject a true null hypothesis 5 out of 100 times.
  • 171. Statistics - Hypothesis Testing The Critical Value and P-Value Approach There are two main approaches used for hypothesis tests: The critical value approach compares the test statistic with the critical value of the significance level. The p-value approach compares the p-value of the test statistic and with the significance level. The Critical Value Approach The critical value approach checks if the test statistic is in the rejection region. The rejection region is an area of probability in the tails of the distribution. The size of the rejection region is decided by the significance level (). The value that separates the rejection region from the rest is called the critical value.
  • 172. Statistics - Hypothesis Testing Here is a graphical illustration: If the test statistic is inside this rejection region, the null hypothesis is rejected. For example, if the test statistic is 2.3 and the critical value is 2 for a significance level (α=0.05): We reject the null hypothesis (H0) at 0.05 significance level (α)
  • 173. Statistics - Hypothesis Testing The P-Value Approach It checks if the p-value of the test statistic is smaller than the significance level (). The p-value of the test statistic is the area of probability in the tails of the distribution from the value of the test statistic. Here is a graphical illustration:
  • 174. Statistics - Hypothesis Testing If the p-value is smaller than the significance level, the null hypothesis is rejected. The p-value directly tells us the lowest significance level where we can reject the null hypothesis. For example, if the p-value is 0.03: We reject the null hypothesis (Ho) at a 0.05 significance level (α) We keep the null hypothesis (Ho) at a 0.01 significance level (α) Note: The two approaches are only different in how they present the conclusion.
  • 175. Statistics - Hypothesis Testing Steps for a Hypothesis Test The following steps are used for a hypothesis test: 1. Check the conditions 2. Define the claims 3. Decide the significance level 4. Calculate the test statistic 5. Conclusion One condition is that the sample is randomly selected from the population. The other conditions depends on what type of parameter you are testing the hypothesis for. Common parameters to test hypotheses are: o Proportions (for qualitative data) o Mean values (for numerical data)
  • 176. We are missing one important variable that affects Calorie_Burnage, which is the Duration of the training session.  Duration in combination with Average_Pulse will together explain Calorie_Burnage more precisely.
  • 177. Data Science - Linear Regression The term regression is used when you try to find the relationship between variables. In Machine Learning and in statistical modeling, that relationship is used to predict the outcome of events. In this module, we will cover the following questions: Can we conclude that Average_Pulse and Duration are related to Calorie_Burnage? Can we use Average_Pulse and Duration to predict Calorie_Burnage?
  • 178. Data Science - Linear Regression Least Square Method Least Square Method Linear regression uses the least square method. The concept is to draw a line through all the plotted data points. The line is positioned in a way that it minimizes the distance to all of the data points. The distance is called "residuals" or "errors". The red dashed lines represents the distance from the data points to the drawn mathematical function.
  • 179. Data Science - Linear Regression Least Square Method Linear Regression Using One Explanatory Variable In this example, we will try to predict Calorie_Burnage with Average_Pulse using Linear Regression: Example import pandas as pd import matplotlib.pyplot as plt from scipy import stats full_health_data = pd.read_csv("data.csv", header=0, sep=",") x = full_health_data["Average_Pulse"] y = full_health_data ["Calorie_Burnage"] slope, intercept, r, p, std_err = stats.linregress(x, y) def myfunc(x): return slope * x + intercept mymodel = list(map(myfunc, x)) plt.scatter(x, y) plt.plot(x, slope * x + intercept) plt.ylim(ymin=0, ymax=2000) plt.xlim(xmin=0, xmax=200) plt.xlabel("Average_Pulse") plt.ylabel ("Calorie_Burnage") plt.show()
  • 180. Data Science - Linear Regression Least Square Method Linear Regression Using One Explanatory Variable In this example, we will try to predict Calorie_Burnage with Average_Pulse using Linear Regression: Do you think that the line is able to predict Calorie_Burnage precisely? We will show that the variable Average_Pulse alone is not enough to make precise prediction of Calorie_Burnage.