SlideShare uma empresa Scribd logo
1 de 87
Baixar para ler offline
DATA MINING AND MACHINE LEARNING
                                                      IN A NUTSHELL



               AN INTRODUCTION TO                         DATA MINING
                                            Mohammad-Ali Abbasi
                                                 http://www.public.asu.edu/~mabbasi2/

                               SCHOOL OF COMPUTING, INFORMATICS, AND DECISION SYSTEMS ENGINEERING
                                                   ARIZONA STATE UNIVERSITY

                                                     http://dmml.asu.edu/
Data Mining and Machine Learning in a nutshell                                             An Introduction to Data Mining   1
INTRODUCTION

      • Data production rate has been increased
        dramatically (Big Data) and we are able store
        much more data than before
             – E.g., purchase data, social media data, mobile
               phone data
      • Businesses and customers need useful or
        actionable knowledge and gain insight from
        raw data for various purposes
             – It’s not just searching data or databases

        Data mining helps us to extract new information and uncover
        hidden patterns out of the stored and streaming data
Data Mining and Machine Learning in a nutshell       An Introduction to Data Mining   2
DATA MINING

          The process of discovering hidden patterns in large data sets
           It utilizes methods at the intersection of artificial intelligence, machine learning,
                                    statistics, and database systems


      • Extracting or “mining” knowledge from large
        amounts of data, or big data
      • Data-driven discovery and modeling of hidden
        patterns in big data
      • Extracting implicit, previously unknown,
        unexpected, and potentially useful
        information/knowledge from data

Data Mining and Machine Learning in a nutshell                              An Introduction to Data Mining   3
DATA MINING STORIES

      • “My bank called and said that they saw that I bought
        two surfboards at Laguna Beach, California.” - credit
        card fraud detection

      • The NSA is using data mining to analyze telephone
        call data to track al’Qaeda activities

      • Walmart uses data mining to control product
        distribution based on typical customer buying
        patterns at individual stores




Data Mining and Machine Learning in a nutshell   An Introduction to Data Mining   4
DATA MINING VS. DATABASES

      • Data mining is the process of extracting
        hidden and actionable patterns from data
      • Database systems store and manage data
             – Queries return part of stored data
             – Queries do not extract hidden patterns
      • Examples of querying databases
             – Find all employees with income more than $250K
             – Find top spending customers in last month
             – Find all students from engineering college with
               GPA more than average
Data Mining and Machine Learning in a nutshell    An Introduction to Data Mining   5
EXAMPLES OF DATA MINING APPLICATIONS

      • Identifying fraudulent transactions of a credit card
        or spam emails
             – You are given a user’s purchase history and a new
               transaction, identify whether the transaction is fraud
               or not;
             – Determine whether a given email is spam or not
      • Extracting purchase patterns from existing records
             – beer ⇒ dippers (80%)
      • Forecasting future sales and needs according to
        some given samples
      • Extracting groups of like-minded people in a given
        network
Data Mining and Machine Learning in a nutshell         An Introduction to Data Mining   6
BASIC DATA MINING TASKS

      • Classification
             – Assign data into predefined classes
                    • Spam Detection, fraudulent credit card detection
      • Regression
             – Predict a real value for a given data instance
                    • Predict the price for a given house
      • Clustering
             – Group similar items together into some clusters
                    • Detect communities in a given social network



Data Mining and Machine Learning in a nutshell              An Introduction to Data Mining   7
DATA




Data Mining and Machine Learning in a nutshell          An Introduction to Data Mining   8
DATA INSTANCES

      • A collection of properties and features related
        to an object or person
             – A patient’s medical record
             – A user’s profile
             – A gene’s information
      • Instances are also called examples, records,
        data points, or observations
 Data Instance:



                                            Features or Attributes                     Class Label
Data Mining and Machine Learning in a nutshell                       An Introduction to Data Mining   9
DATA TYPES

      • Nominal (categorical)
             – No comparison is defined
             – E.g., {male, female}
      • Ordinal
             – Comparable but the difference is not defined
             – E.g., {Low, medium, high}
      • Interval
             – Deduction and addition is defined but not division
             – E.g., 3:08 PM, calendar dates
      • Ratio
             – E.g., Height, weight, money quantities

Data Mining and Machine Learning in a nutshell          An Introduction to Data Mining   10
SAMPLE DATASET
                 outlook                     temperature      humidity      windy            play
                  sunny                          85             85          FALSE             no
                  sunny                          80             90          TRUE              no
                 overcast                        83             86          FALSE            yes
                   rainy                         70             96          FALSE            yes
                   rainy                         68             80          FALSE            yes
                   rainy                         65             70          TRUE              no
                 overcast                        64             65          TRUE             yes
                  sunny                          72             95          FALSE             no
                  sunny                          69             70          FALSE            yes
                   rainy                         75             80          FALSE            yes
                  sunny                          75             70          TRUE             yes
                 overcast                        72             90          TRUE             yes
                 overcast                        81             75          FALSE            yes
                   rainy                         71             91          TRUE              no

                                                  Interval   Ordinal            Nominal




Data Mining and Machine Learning in a nutshell                           An Introduction to Data Mining   11
DATA QUALITY

   When making data ready for data mining
   algorithms, data quality need to be assured
   • Noise
          – Noise is the distortion of the data
   • Outliers
          – Outliers are data points that are considerably different
            from other data points in the dataset
   • Missing Values
          – Missing feature values in data instances
   • Duplicate data

Data Mining and Machine Learning in a nutshell     An Introduction to Data Mining   12
DATA PREPROCESSING

      • Aggregation
             – when multiple attributes need to be combined into a
               single attribute or when the scale of the attributes change
      • Discretization
             – From continues values to discrete values
      • Feature Selection
             – Choose relevant features
      • Feature Extraction
             – Creating a mapping of new features from original features
      • Sampling
             – Random Sampling
             – Sampling with or without replacement
             – Stratified Sampling

Data Mining and Machine Learning in a nutshell            An Introduction to Data Mining   13
CLASSIFICATION




Data Mining and Machine Learning in a nutshell                An Introduction to Data Mining   14
CLASSIFICATION

      Learning patterns from labeled data and classify
      new data with labels (categories)
             – For example, we want to classify an e-mail as
               "legitimate" or "spam"
                                                 Classifier




Data Mining and Machine Learning in a nutshell                An Introduction to Data Mining   15
CLASSIFICATION: THE PROCESS

    • In classification, we are given a set of labeled
      examples
    • These examples are records/instances in the
      format (x, y) where x is a vector and y is the
      class attribute, commonly a scalar
    • The classification task is to build model that
      maps x to y
    • Our task is to find a mapping f such that f(x) = y


Data Mining and Machine Learning in a nutshell   An Introduction to Data Mining   16
CLASSIFICATION: THE PROCESS




Data Mining and Machine Learning in a nutshell   An Introduction to Data Mining   17
CLASSIFICATION: AN EMAIL EXAMPLE

      • A set of emails is given where
        users have manually identified
        spam versus non-spam
      • Our task is to use a set of
        features such as words in the
        email (x) to identify spam/non-
        spam status of the email (y)
      • In this case, classes are
             y = {spam, non-spam}
      • What would it be dealt with in
        a social setting?
Data Mining and Machine Learning in a nutshell   An Introduction to Data Mining   18
CLASSIFICATION ALGORITHMS


      • Decision tree learning

      • Naive Bayes learning

      • K-nearest neighbor classifier




Data Mining and Machine Learning in a nutshell   An Introduction to Data Mining   19
DECISION TREE

      • A decision tree is learned from the dataset
        (training data with known classes) and later
        applied to predict the class attribute value of
        new data (test data with unknown classes)
        where only the feature values are known




Data Mining and Machine Learning in a nutshell   An Introduction to Data Mining   20
DECISION TREE INDUCTION




Data Mining and Machine Learning in a nutshell   An Introduction to Data Mining   21
ID3, A DECISION TREE ALGORITHM

      Use information gain (entropy) to determine
      how well an attribute separates the training
      data according to the class attribute value

             – p+ is the proportion of positive examples in D
             – p- is the proportion of negative examples in D

    In a dataset containing ten examples, 7 have a positive class
    attribute value and 3 have a negative class attribute value [7+, 3-]:


    If the numbers of positive and negative examples in the set are equal, then the entropy is 1
Data Mining and Machine Learning in a nutshell                                 An Introduction to Data Mining   22
DECISION TREE: EXAMPLE 1
 outlook    temperature     humidity    windy    play
 sunny           85            85       FALSE    no
 sunny           80            90       TRUE     no
overcast         83            86       FALSE    yes
  rainy          70            96       FALSE    yes
  rainy          68            80       FALSE    yes
  rainy          65            70       TRUE     no
overcast         64            65       TRUE     yes
 sunny           72            95       FALSE    no
 sunny           69            70       FALSE    yes
  rainy          75            80       FALSE    yes
 sunny           75            70       TRUE     yes
overcast         72            90       TRUE     yes
overcast         81            75       FALSE    yes
  rainy          71            91       TRUE     no




Data Mining and Machine Learning in a nutshell          An Introduction to Data Mining   23
DECISION TREE: EXAMPLE 2




                                                 Class Labels




          Learned Decision Tree 1                               Learned Decision Tree 2
Data Mining and Machine Learning in a nutshell                        An Introduction to Data Mining   24
NAIVE BAYES CLASSIFIER

      For two random variables X and Y, Bayes
      theorem states that,


                                            class variable   the instance features

   Then class attribute value for instance X



    Assuming that variables
    are independent




Data Mining and Machine Learning in a nutshell                                  An Introduction to Data Mining   25
NBC: AN EXAMPLE




Data Mining and Machine Learning in a nutshell   An Introduction to Data Mining   26
NEAREST NEIGHBOR CLASSIFIER

      • k-nearest neighbor employs the neighbors of a
        data point to perform classification
      • The instance being classified is assigned the
        label that the majority of k neighbors’ labels
      • When k = 1, the closest neighbor’s label is
        used as the predicted label for the instance
        being classified
      • For determining the neighbors, distance is
        computed based on some distance metric,
        e.g., Euclidean distance
Data Mining and Machine Learning in a nutshell   An Introduction to Data Mining   27
K-NN: ALGORITHM

      1. The dataset, number of neighbors (k), and
         the instance i is given
      2. Compute the distance between i and all
         other data points in the dataset
      3. Pick k closest neighbors
      4. The class label for the data point i is the one
         that the majority holds (if there are more
         than one class, select one of them randomly)


Data Mining and Machine Learning in a nutshell   An Introduction to Data Mining   28
K-NEAREST NEIGHBOR: EXAMPLE



     k = 10
     Class label = ?                                                               k=5


                                                                                   k=3




  • Depending on the k, different labels can be predicted for the green circle
  • In our example k = 3 and k = 5 generate different labels for the instance
  • K= 10 we can choose either triangle or rectangle
Data Mining and Machine Learning in a nutshell                 An Introduction to Data Mining   30
K-NEAREST NEIGHBOR: EXAMPLE




  Similarity between row 8 and other data instances;
  (Similarity = 1 if attributes have the same value, otherwise similarity = 0)
  Data instance Outlook Temperature Humidity Similarity Label                      K       Prediction
        2          1        1          1         3        N                        1           N
        1          1        0          1         2        N                        2           N
        4          0        1          1         2        Y                        3           N
        3          0        0          1         1        Y                        4           ?
        5          1        0          0         1        Y                        5           Y
        6          0        0          0         0        N                        6           ?
        7          0        0          0         0        Y                        7           Y
Data Mining and Machine Learning in a nutshell                           An Introduction to Data Mining   31
EVALUATING CLASSIFICATION PERFORMANCE

      • As the class labels are discrete, we can measure the
        accuracy by dividing number of correctly predicted
        labels (C) by the total number of instances (N)
             • Accuracy = C/N
             • Error rate = 1 - Accuracy
      • More sophisticated approaches of evaluation will be
        discussed later




Data Mining and Machine Learning in a nutshell   An Introduction to Data Mining   32
REGRESSION




Data Mining and Machine Learning in a nutshell                An Introduction to Data Mining   33
REGRESSION

      Regression analysis includes techniques of
      modeling and analyzing the relationship
      between a dependent variable and one or more
      independent variables
      • Regression analysis is widely used for
        prediction and forecasting
      • It can be used to infer
        relationships between
        the independent and
        dependent variables
Data Mining and Machine Learning in a nutshell   An Introduction to Data Mining   34
REGRESSION
     In regression, we deal with real numbers as class
     values (Recall that in classification, class values
     or labels are categories)
                            y ≈ f(X)



                        Dependent variable       Regressors
                              y R                x1, x2, …, xm


    Our task is to find the relation between y and the vector
    (x1, x2, …, xm)
Data Mining and Machine Learning in a nutshell               An Introduction to Data Mining   35
LINEAR REGRESSION

      In linear regression, we assume the relation
      between the class attribute y and feature set x
      to be linear

      where w represents the vector of regression
      coefficients
      • The problem of regression can be solved by
        estimating w and using the provided dataset
        and the labels y
             – The least squares is often used to solve the
               problem
Data Mining and Machine Learning in a nutshell      An Introduction to Data Mining   36
SOLVING LINEAR REGRESSION PROBLEMS

      • The problem of regression can be solved by
        estimating w and using the dataset provided
        and the labels y
             – “Least squares” is a popular method to solve
               regression problems




Data Mining and Machine Learning in a nutshell    An Introduction to Data Mining   37
LEAST SQUARES

   Find W such that minimizing ǁY - XWǁ2 for
   regressors X and labels Y




Data Mining and Machine Learning in a nutshell   An Introduction to Data Mining   38
LEAST SQUARES




Data Mining and Machine Learning in a nutshell   An Introduction to Data Mining   39
REGRESSION COEFFICIENTS

      • When there is only one independent variable:
             y = w0 + w1x                                              n
                                                         x = å xi
                                                            1
                                                            n i




      • Two independent variables
             y = w0 + w1x1 + w2x2



Data Mining and Machine Learning in a nutshell   An Introduction to Data Mining   40
LINEAR REGRESSION: EXAMPLE

   Years of
                 Salary ($K)
  experience

       3             30
       8             57
       9             64
      13             72
       3             36
       6             43
      11             59
      21             90
       1             20
      16             83

Data Mining and Machine Learning in a nutshell   An Introduction to Data Mining   41
EVALUATING REGRESSION PERFORMANCE

      • The labels cannot be predicted precisely
      • It is needed to set a margin to accept or reject
        the predictions
             – For example, when the observed temperature is
               71 any prediction in the range of 71±0.5 can be
               considered as correct prediction




Data Mining and Machine Learning in a nutshell    An Introduction to Data Mining   43
CLUSTERING




Data Mining and Machine Learning in a nutshell                An Introduction to Data Mining   44
CLUSTERING

      Grouping together items that are similar in
      some way – according to some criteria

      • Clustering is a form of unsupervised learning
             – The clustering algorithms do not have examples
               showing how the samples should be group
               together
      • The clustering algorithms look for patterns or
        structures in the data that are of interest
      • Clustering algorithms group together similar
        items
Data Mining and Machine Learning in a nutshell    An Introduction to Data Mining   45
CLUSTERING: EXAMPLE




Data Mining and Machine Learning in a nutshell   An Introduction to Data Mining   46
MEASURING SIMILARITY IN CLUSTERING ALGORITHMS

      • The goal is to group together similar items
      • Different similarity measures can be used to
        find similar items
      • Usually similarity measures are critical to
        clustering algorithms

      The most popular (dis)similarity measure for
      continuous features are Euclidean Distance and
      Pearson Linear Correlation

Data Mining and Machine Learning in a nutshell   An Introduction to Data Mining   47
EUCLIDEAN DISTANCE – A DISSIMILAR MEASURE




      • Here n is the number of dimensions in the
        data vector

Data Mining and Machine Learning in a nutshell   An Introduction to Data Mining   48
PEARSON LINEAR CORRELATION
                                          n

                                         å (x       i   - x )(yi - y )
           r (x, y) =               n
                                         i=1
                                                              n

                                  å (x        i   - x )2     å (y  i   - y )2
                                   i=1                       i=1

              1 n
           x = å xi
              n i
              1 n
           y = å yi
              n i



   • We’re shifting the expression profiles down (subtracting the means) and scaling
     by the standard deviations (i.e., making the data have mean = 0 and std = 1)
   • Always between –1 and +1 (perfectly anti-correlated and perfectly correlated)

Data Mining and Machine Learning in a nutshell                                  An Introduction to Data Mining   49
SIMILARITY MEASURES: MORE DEFINITIONS




Data Mining and Machine Learning in a nutshell   An Introduction to Data Mining   50
CLUSTERING


      • Distance-based algorithms

             – K-Means

      • Hierarchical algorithms




Data Mining and Machine Learning in a nutshell   An Introduction to Data Mining   51
K-MEANS

      k-means clustering aims to partition n
      observations into k clusters in which each
      observation belongs to the cluster with the
      nearest mean
      • Finding the global optimal of k partitions is
        computationally expensive (NP-hard).
        However, there are efficient heuristic
        algorithms that are commonly employed and
        converge quickly to an optimum that might
        not be global.
Data Mining and Machine Learning in a nutshell   An Introduction to Data Mining   52
K-MEANS

      • Given a set of observations (x1, x2, …, xn),
        where each observation is a d-dimensional
        real vector, k-means clustering aims to
        partition the n observations into k sets (k ≤ n)
        S = {S1, S2, …, Sk} so as to minimize the within-
        cluster sum of squares:



            where μi is the mean of points in Si



Data Mining and Machine Learning in a nutshell     An Introduction to Data Mining   53
K-MEANS: ALGORITHM

      Given data points xi and an initial set
      of k centroids m1(1),…,mk(1), the algorithm proceeds as
      follows:
      • Assignment step: Assign each data point to the
         cluster Si with the closest centroid each data
         point goes into exactly one cluster)




      • Update step: Calculate the new means to be
        the centroid of the data points in the cluster

Data Mining and Machine Learning in a nutshell   An Introduction to Data Mining   54
K-MEANS: AN EXAMPLE

      Data                                                      Cluster 1                   Cluster 2
                   X          Y
      point
                                                 Step   Data point     Centroid     Data point       Centroid
        1         1           1
        2         2           1                   1         1          (1.0, 1.0)       2            (2.0, 1.0)
        3         1           2
        4         2           2
        5         4           4
        6         4           5
        7         5           4
        8         5           5




Data Mining and Machine Learning in a nutshell                                      An Introduction to Data Mining   56
RUNNING K-MEANS ON IRIS DATASET




Data Mining and Machine Learning in a nutshell   An Introduction to Data Mining   58
HIERARCHICAL CLUSTERING

      Hierarchical clustering is a method of cluster
      analysis which seeks to build a hierarchy of clusters.
      • Strategies for hierarchical clustering generally fall
        into two types:
             – Agglomerative: This is a "bottom up" approach: each
               observation starts in its own cluster, and pairs of
               clusters are merged as one moves up the hierarchy.
             – Divisive: This is a "top down" approach: all
               observations start in one cluster, and splits are
               performed recursively as one moves down the
               hierarchy.

Data Mining and Machine Learning in a nutshell      An Introduction to Data Mining   60
HIERARCHICAL ALGORITHMS

      • Initially n data points are considered as either
        1 or n clusters in hierarchical clustering
      • These clusters are gradually split or merged
        (divisive or agglomerative hierarchical
        clustering algorithms), depending on the type
        of an algorithm
      • Until the desired number of clusters are
        reached



Data Mining and Machine Learning in a nutshell   An Introduction to Data Mining   61
HIERARCHICAL AGGLOMERATIVE CLUSTERING

      • Start with each data point as a cluster
      • Keep merging the most similar pairs of data
        points/clusters until only one big cluster left
      • This is called a bottom-up or agglomerative
        method

      This produces a binary tree or dendrogram
             – The final cluster is the root and each data point is
               a leaf
             – The height of the bars indicate how close the
               points are
Data Mining and Machine Learning in a nutshell       An Introduction to Data Mining   62
HIERARCHICAL CLUSTERING: AN EXAMPLE




Data Mining and Machine Learning in a nutshell   An Introduction to Data Mining   63
MERGING THE DATA POINTS IN HIERARCHICAL CLUSTERING

      • Average Linkage
             – Each cluster ci is associated with a mean vector i
               which is the mean of all the data items in the
               cluster
             – The distance between two clusters ci and cj is then
               just d( i , j )
      • Single Linkage
             – The minimum of all pairwise distances between
               points in the two clusters
      • Complete Linkage
             – The maximum of all pairwise distances between
               points in the two clusters
Data Mining and Machine Learning in a nutshell      An Introduction to Data Mining   64
LINKAGE IN HIERARCHICAL CLUSTERING: EXAMPLE


                  Single Linkage                 Average Linkage




                                                           Complete Linkage




Data Mining and Machine Learning in a nutshell                                An Introduction to Data Mining   65
EVALUATING THE CLUSTERINGS

     When we are given objects of two different
     kinds, the perfect clustering would be that
     objects of the same type are clustered together.




      • Evaluation with ground truth
      • Evaluation without ground truth
Data Mining and Machine Learning in a nutshell   An Introduction to Data Mining   66
EVALUATION WITH GROUND TRUTH

      When ground truth is available, the evaluator
      has prior knowledge of what a clustering should
      be
             – That is, we know the correct clustering
               assignments.


      • Measures
             – Precision and Recall, or F-Measure
             – Purity
             – Normalized Mutual Information (NMI)

Data Mining and Machine Learning in a nutshell     An Introduction to Data Mining   67
PRECISION AND RECALL




      • True Positive (TP) :                         • False Negative (FN) :
             – when similar points are assigned to       – when similar points are assigned to
               the same clusters                           different clusters
             – This is considered a correct              – This is considered an incorrect
               decision.                                   decision
      • True Negative (TN) :                         • False Positive (FP) :
             – when dissimilar points are                – when dissimilar points are
               assigned to different clusters              assigned to the same clusters
             – This is considered a correct              – This is considered an incorrect
               decision                                    decision

Data Mining and Machine Learning in a nutshell                           An Introduction to Data Mining   68
PRECISION AND RECALL: EXAMPLE 1




Data Mining and Machine Learning in a nutshell   An Introduction to Data Mining   69
F-MEASURE

      • To consolidate precision and recall into one
        measure, we can use the harmonic mean of
        precision of recall




      Computed for the same example, we get F = 0.54


Data Mining and Machine Learning in a nutshell   An Introduction to Data Mining   71
PURITY
  • In purity, we assume the majority of a cluster
    represents the cluster
  • Hence, we use the label of the majority
    against the label of each member to evaluate
  • the algorithm easily tampered; consider points
     Purity can be
  • The purity is then defined assize 1) or very large
     being singleton clusters (of the fraction of
    instances that have labels equal to the
     clusters.
    cluster’s majority label
  • In both cases, purity does not make much sense.


        where Lj defines label j (ground truth) and
        Mi defines the majority label for cluster i

Data Mining and Machine Learning in a nutshell        An Introduction to Data Mining   72
MUTUAL INFORMATION

      The mutual information of two random
      variables is a quantity that measures the mutual
      dependence of the two random variables




    • p(x,y) is the joint probability distribution function of X and Y,
    • p(x) and p(y) are the marginal probability distribution
      functions of X and Y respectively

Data Mining and Machine Learning in a nutshell           An Introduction to Data Mining   73
NORMALIZED MUTUAL INFORMATION

      Normalized Mutual Information has been
      derived from information theory where the
      Mutual Information (MI) between the
      clusterings found and the labels is normalized by
      the upper bound of (MI) which is a mean of the
      entropies (H) of labels and clusterings found




Data Mining and Machine Learning in a nutshell   An Introduction to Data Mining   74
NORMALIZED MUTUAL INFORMATION




        •   where l and h are labels and found clusterings,
        •   nh and nl are the number of data points in the clusters h and l, respectively,
        •   nh,l is the number of points in clusters h and labeled l,
        •   n is the size of the dataset


      • NMI values close to one indicate high similarity
        between clusterings found and labels
      • Values close to zero indicate high dissimilarity
        between them
Data Mining and Machine Learning in a nutshell                             An Introduction to Data Mining   75
NORMALIZED MUTUAL INFORMATION: EXAMPLE

      Partition a: [1,1,1,1,1,1,1, 2,2,2,2,2,2,2]
      Partition b: [1,1,1,1,1,2,2, 1,2,2,2,2,2,2]



                                                 nh         nl   nh,l   l=1      l=2
             n = 14                  h=1         6    l=1   7    h=1      5        1
                                     h=2         8    l=2   7    h=2      2        6




Data Mining and Machine Learning in a nutshell                          An Introduction to Data Mining   76
EVALUATION WITHOUT GROUND TRUTH

      • Use domain experts

      • Use quality measures such as SSE
             – SSE: the sum of the squared error for all clusters

      • Use more than two clustering algorithms and
            compare the results and pick the algorithm
            with better quality measure


Data Mining and Machine Learning in a nutshell      An Introduction to Data Mining   77
TEXT MINING




Data Mining and Machine Learning in a nutshell                 An Introduction to Data Mining   78
TEXT MINING

      • In social media, most of the data that is
        available online is in text format
      • In general, the way to perform data mining is
        to convert text data into tabular format and
        then perform data mining on this data

      • The process of converting text data into
        tabular data is called vectorization


Data Mining and Machine Learning in a nutshell   An Introduction to Data Mining   79
TEXT MINING PROCESS




      A set of linguistic, statistical, and machine
      learning techniques that model and structure
      the information content of textual sources for
      business intelligence, exploratory data analysis,
      research, or investigation

Data Mining and Machine Learning in a nutshell   An Introduction to Data Mining   80
TEXT PREPROCESSING

      Text preprocessing aims to make the input
      documents more consistent to facilitate text
      representation, which is necessary for most text
      analytics tasks
      • Methods:
             – Stop word removal
                    • Stop word removal eliminates words using a stop word list,
                      in which the words are considered more general and
                      meaningless
                           – e.g. the, a, is, at, which
             – Stemming
                    • Stemming reduces inflected (or sometimes derived) words
                      to their stem, base or root form
                           – For example, “watch”, “watching”, “watched” are represented as
                             “watch”

Data Mining and Machine Learning in a nutshell                          An Introduction to Data Mining   81
TEXT REPRESENTATION

      • The most common way to model documents
        is to transform them into sparse numeric
        vectors and then deal with them with linear
        algebraic operations
      • This representation is called “Bag of Words”

      • Methods:
             – Vector space model
             – tf-idf

Data Mining and Machine Learning in a nutshell   An Introduction to Data Mining   82
VECTOR SPACE MODEL

      • In the vector space model, we start with a set
        of documents, D
      • Each document is a set of words
      • The goal is to convert these textual
        documents to vectors


                    • di : document i, wj,i : the weight for word j in document i

        The weight can be set to 1 when the word exist in the document and 0 when
        it does not. Or we can set this weight to the number of times the word is
        observed in the document
Data Mining and Machine Learning in a nutshell                         An Introduction to Data Mining   83
VECTOR SPACE MODEL: AN EXAMPLE

      • Documents:
             – d1: data mining and social media mining
             – d2: social network analysis
             – d3: data mining
      • Reference vector:
             – (social, media, mining, network, analysis, data)
      • Vector representation:
                            analysis         data   media   mining   network    social
                    d1         0              1      1        1         0         1
                    d2         1              0      0        0         1         1
                    d3         0              1      0        1         0         0

Data Mining and Machine Learning in a nutshell                             An Introduction to Data Mining   84
TF-IDF (TERM FREQUENCY-INVERSE DOCUMENT FREQUENCY)

    tf-idf of term t, document d, and document corpus D is
    calculated as follows:
                  tf-idf(t, d, D) = tf (t, d) * idf (t, D)




                                                 The total number of documents in
                                                 the corpus


                                                 The number of documents where
                                                 the term t appears


Data Mining and Machine Learning in a nutshell            An Introduction to Data Mining   85
TF-IDF: AN EXAMPLE

      Consider words “apple” and “the” that appear
      10 and 20 times in document 1 (d1), which
      contains 100 words.
      Consider |D| = 20 and word “apple” only
      appearing in d1 and word “the” appearing in all
      20 documents




Data Mining and Machine Learning in a nutshell   An Introduction to Data Mining   86
TF-IDF: AN EXAMPLE

      • Documents:
             – d1: data mining and social media mining
             – d2: social network analysis
             – d3: data mining
      • tf-idf representation:

                                 analysis        data   media   mining   network       social
          df(w)                     1              2      1        2        1             2
          log(N/df(w))            0.48           0.18   0.48     0.18     0.48          0.18
          d1, tf                    0              1      1        2        0             1
          d2, tf                    1              0      0        0        1             1
          d3, tf                    0              1      0        1        0             0
          d1, tf-idf              0.00           0.18   0.48     0.35     0.00          0.18
          d2, tf-idf              0.48           0.00   0.00     0.00     0.48          0.18
          d3, tf-idf              0.00           0.18   0.00     0.18     0.00          0.00


Data Mining and Machine Learning in a nutshell                             An Introduction to Data Mining   87
SENTIMENT ANALYSIS

      • Sentiment analysis or opinion mining refers to
        the application of natural language
        processing, computational linguistics, and text
        analytics to identify and extract subjective
        information in source materials
      • It aims to determine the attitude of a speaker
        or a writer with respect to some topic or the
        overall contextual polarity of a document.



Data Mining and Machine Learning in a nutshell   An Introduction to Data Mining   88
POLARITY ANALYSIS

      • The basic task in opinion mining is classifying
        the polarity of a given document or text
             – The polarity could be positive, negative, or neutral
      • Methods:
             – Naïve Bayes
             – Pointwise Mutual Information (PMI)




Data Mining and Machine Learning in a nutshell      An Introduction to Data Mining   89
MEASURING POLARITY, NAÏVE BAYES

      • Bayes’ rule:



      • If we consider that the occurrence of features
        (words) in the document are independent




Data Mining and Machine Learning in a nutshell   An Introduction to Data Mining   90
MEASURING POLARITY, MAXIMUM ENTROPY




      • Z(d) is the normalization factor
      • is feature-weight parameter and shows the
        importance of each feature
      • Fi,c is defined as a feature/class function for
        feature fi and class c


Data Mining and Machine Learning in a nutshell   An Introduction to Data Mining   91
MEASURING POLARITY, POINTWISE MUTUAL INFORMATION




         • P(word) is the number of results returned by search engine in response to search
           for term word
         • P(word1 word2) is the number of results for mutual search of word1 and word2
           together


Data Mining and Machine Learning in a nutshell                       An Introduction to Data Mining   92
Mohammad-Ali Abbasi (Ali),
                                Ali, is a Ph.D. student at Data Mining
                                and Machine Learning Lab, Arizona
                                State University.
                                His research interests include Data
                                Mining, Machine Learning, Social
                                Computing, and Social Media Behavior
                                Analysis.

                                http://www.public.asu.edu/~mabbasi2/

Data Mining and Machine Learning in a nutshell                         An Introduction to Data Mining   93

Mais conteúdo relacionado

Mais procurados

Data Mining: Concepts and Techniques chapter 07 : Advanced Frequent Pattern M...
Data Mining: Concepts and Techniques chapter 07 : Advanced Frequent Pattern M...Data Mining: Concepts and Techniques chapter 07 : Advanced Frequent Pattern M...
Data Mining: Concepts and Techniques chapter 07 : Advanced Frequent Pattern M...Salah Amean
 
Association rule mining
Association rule miningAssociation rule mining
Association rule miningAcad
 
Data mining concepts and work
Data mining concepts and workData mining concepts and work
Data mining concepts and workAmr Abd El Latief
 
Data Mining: Concepts and Techniques_ Chapter 6: Mining Frequent Patterns, ...
Data Mining:  Concepts and Techniques_ Chapter 6: Mining Frequent Patterns, ...Data Mining:  Concepts and Techniques_ Chapter 6: Mining Frequent Patterns, ...
Data Mining: Concepts and Techniques_ Chapter 6: Mining Frequent Patterns, ...Salah Amean
 
Data mining slides
Data mining slidesData mining slides
Data mining slidessmj
 
Data mining presentation.ppt
Data mining presentation.pptData mining presentation.ppt
Data mining presentation.pptneelamoberoi1030
 
Mining Frequent Patterns, Association and Correlations
Mining Frequent Patterns, Association and CorrelationsMining Frequent Patterns, Association and Correlations
Mining Frequent Patterns, Association and CorrelationsJustin Cletus
 
Data mining: Classification and prediction
Data mining: Classification and predictionData mining: Classification and prediction
Data mining: Classification and predictionDataminingTools Inc
 
Big data visualization
Big data visualizationBig data visualization
Big data visualizationAnurag Gupta
 
Data Mining: What is Data Mining?
Data Mining: What is Data Mining?Data Mining: What is Data Mining?
Data Mining: What is Data Mining?Seerat Malik
 
Data mining (lecture 1 & 2) conecpts and techniques
Data mining (lecture 1 & 2) conecpts and techniquesData mining (lecture 1 & 2) conecpts and techniques
Data mining (lecture 1 & 2) conecpts and techniquesSaif Ullah
 
Data mining
Data mining Data mining
Data mining AthiraR23
 
Data Mining & Applications
Data Mining & ApplicationsData Mining & Applications
Data Mining & ApplicationsFazle Rabbi Ador
 
Data Mining Concepts
Data Mining ConceptsData Mining Concepts
Data Mining ConceptsDung Nguyen
 

Mais procurados (20)

Data Mining: Concepts and Techniques chapter 07 : Advanced Frequent Pattern M...
Data Mining: Concepts and Techniques chapter 07 : Advanced Frequent Pattern M...Data Mining: Concepts and Techniques chapter 07 : Advanced Frequent Pattern M...
Data Mining: Concepts and Techniques chapter 07 : Advanced Frequent Pattern M...
 
Association rule mining
Association rule miningAssociation rule mining
Association rule mining
 
Data mining
Data miningData mining
Data mining
 
Data mining concepts and work
Data mining concepts and workData mining concepts and work
Data mining concepts and work
 
Data mining
Data mining Data mining
Data mining
 
Data Mining: Concepts and Techniques_ Chapter 6: Mining Frequent Patterns, ...
Data Mining:  Concepts and Techniques_ Chapter 6: Mining Frequent Patterns, ...Data Mining:  Concepts and Techniques_ Chapter 6: Mining Frequent Patterns, ...
Data Mining: Concepts and Techniques_ Chapter 6: Mining Frequent Patterns, ...
 
Data mining slides
Data mining slidesData mining slides
Data mining slides
 
Data Mining: Association Rules Basics
Data Mining: Association Rules BasicsData Mining: Association Rules Basics
Data Mining: Association Rules Basics
 
Data mining presentation.ppt
Data mining presentation.pptData mining presentation.ppt
Data mining presentation.ppt
 
Mining Frequent Patterns, Association and Correlations
Mining Frequent Patterns, Association and CorrelationsMining Frequent Patterns, Association and Correlations
Mining Frequent Patterns, Association and Correlations
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
 
Data Science
Data ScienceData Science
Data Science
 
Data mining: Classification and prediction
Data mining: Classification and predictionData mining: Classification and prediction
Data mining: Classification and prediction
 
Big data visualization
Big data visualizationBig data visualization
Big data visualization
 
Data Mining: What is Data Mining?
Data Mining: What is Data Mining?Data Mining: What is Data Mining?
Data Mining: What is Data Mining?
 
Data mining (lecture 1 & 2) conecpts and techniques
Data mining (lecture 1 & 2) conecpts and techniquesData mining (lecture 1 & 2) conecpts and techniques
Data mining (lecture 1 & 2) conecpts and techniques
 
Data mining
Data mining Data mining
Data mining
 
Data Mining & Applications
Data Mining & ApplicationsData Mining & Applications
Data Mining & Applications
 
Data Mining Concepts
Data Mining ConceptsData Mining Concepts
Data Mining Concepts
 
3 Data Mining Tasks
3  Data Mining Tasks3  Data Mining Tasks
3 Data Mining Tasks
 

Destaque

Machine Learning and Data Mining: 19 Mining Text And Web Data
Machine Learning and Data Mining: 19 Mining Text And Web DataMachine Learning and Data Mining: 19 Mining Text And Web Data
Machine Learning and Data Mining: 19 Mining Text And Web DataPier Luca Lanzi
 
Text mining presentation in Data mining Area
Text mining presentation in Data mining AreaText mining presentation in Data mining Area
Text mining presentation in Data mining AreaMahamudHasanCSE
 
NumPy and SciPy for Data Mining and Data Analysis Including iPython, SciKits,...
NumPy and SciPy for Data Mining and Data Analysis Including iPython, SciKits,...NumPy and SciPy for Data Mining and Data Analysis Including iPython, SciKits,...
NumPy and SciPy for Data Mining and Data Analysis Including iPython, SciKits,...Ryan Rosario
 
Knowledge Discovery and Data Mining
Knowledge Discovery and Data MiningKnowledge Discovery and Data Mining
Knowledge Discovery and Data MiningAmritanshu Mehra
 
Chapter - 6 Data Mining Concepts and Techniques 2nd Ed slides Han & Kamber
Chapter - 6 Data Mining Concepts and Techniques 2nd Ed slides Han & KamberChapter - 6 Data Mining Concepts and Techniques 2nd Ed slides Han & Kamber
Chapter - 6 Data Mining Concepts and Techniques 2nd Ed slides Han & Kambererror007
 
Data warehousing and data mining
Data warehousing and data miningData warehousing and data mining
Data warehousing and data miningSnehali Chake
 
Data Mining and Data Warehousing
Data Mining and Data WarehousingData Mining and Data Warehousing
Data Mining and Data WarehousingAmdocs
 
introduction to data mining tutorial
introduction to data mining tutorial introduction to data mining tutorial
introduction to data mining tutorial Salah Amean
 
Data Mining and Business Intelligence Tools
Data Mining and Business Intelligence ToolsData Mining and Business Intelligence Tools
Data Mining and Business Intelligence ToolsMotaz Saad
 
Data Mining: Application and trends in data mining
Data Mining: Application and trends in data miningData Mining: Application and trends in data mining
Data Mining: Application and trends in data miningDataminingTools Inc
 
Ch 1 Intro to Data Mining
Ch 1 Intro to Data MiningCh 1 Intro to Data Mining
Ch 1 Intro to Data MiningSushil Kulkarni
 

Destaque (18)

Machine Learning and Data Mining: 19 Mining Text And Web Data
Machine Learning and Data Mining: 19 Mining Text And Web DataMachine Learning and Data Mining: 19 Mining Text And Web Data
Machine Learning and Data Mining: 19 Mining Text And Web Data
 
Data mining and_big_data_web
Data mining and_big_data_webData mining and_big_data_web
Data mining and_big_data_web
 
Lecture 01 Data Mining
Lecture 01 Data MiningLecture 01 Data Mining
Lecture 01 Data Mining
 
Text mining presentation in Data mining Area
Text mining presentation in Data mining AreaText mining presentation in Data mining Area
Text mining presentation in Data mining Area
 
Data Mining Overview
Data Mining OverviewData Mining Overview
Data Mining Overview
 
Big Data v Data Mining
Big Data v Data MiningBig Data v Data Mining
Big Data v Data Mining
 
Analytics and Data Mining Industry Overview
Analytics and Data Mining Industry OverviewAnalytics and Data Mining Industry Overview
Analytics and Data Mining Industry Overview
 
NumPy and SciPy for Data Mining and Data Analysis Including iPython, SciKits,...
NumPy and SciPy for Data Mining and Data Analysis Including iPython, SciKits,...NumPy and SciPy for Data Mining and Data Analysis Including iPython, SciKits,...
NumPy and SciPy for Data Mining and Data Analysis Including iPython, SciKits,...
 
Knowledge Discovery and Data Mining
Knowledge Discovery and Data MiningKnowledge Discovery and Data Mining
Knowledge Discovery and Data Mining
 
Chapter - 6 Data Mining Concepts and Techniques 2nd Ed slides Han & Kamber
Chapter - 6 Data Mining Concepts and Techniques 2nd Ed slides Han & KamberChapter - 6 Data Mining Concepts and Techniques 2nd Ed slides Han & Kamber
Chapter - 6 Data Mining Concepts and Techniques 2nd Ed slides Han & Kamber
 
Data warehousing and data mining
Data warehousing and data miningData warehousing and data mining
Data warehousing and data mining
 
Data Mining and Data Warehousing
Data Mining and Data WarehousingData Mining and Data Warehousing
Data Mining and Data Warehousing
 
introduction to data mining tutorial
introduction to data mining tutorial introduction to data mining tutorial
introduction to data mining tutorial
 
Data Mining and Business Intelligence Tools
Data Mining and Business Intelligence ToolsData Mining and Business Intelligence Tools
Data Mining and Business Intelligence Tools
 
Data Mining: Application and trends in data mining
Data Mining: Application and trends in data miningData Mining: Application and trends in data mining
Data Mining: Application and trends in data mining
 
Ch 1 Intro to Data Mining
Ch 1 Intro to Data MiningCh 1 Intro to Data Mining
Ch 1 Intro to Data Mining
 
DATA WAREHOUSING AND DATA MINING
DATA WAREHOUSING AND DATA MININGDATA WAREHOUSING AND DATA MINING
DATA WAREHOUSING AND DATA MINING
 
Data mining
Data miningData mining
Data mining
 

Semelhante a Data Mining: an Introduction

Data mining Basics and complete description onword
Data mining Basics and complete description onwordData mining Basics and complete description onword
Data mining Basics and complete description onwordSulman Ahmed
 
Unit 3 part ii Data mining
Unit 3 part ii Data miningUnit 3 part ii Data mining
Unit 3 part ii Data miningDhilsath Fathima
 
Data warehousing and mining furc
Data warehousing and mining furcData warehousing and mining furc
Data warehousing and mining furcShani729
 
Data Mining- Unit-I PPT (1).ppt
Data Mining- Unit-I PPT (1).pptData Mining- Unit-I PPT (1).ppt
Data Mining- Unit-I PPT (1).pptAravindReddy565690
 
lec01-IntroductionToDataMining.pptx
lec01-IntroductionToDataMining.pptxlec01-IntroductionToDataMining.pptx
lec01-IntroductionToDataMining.pptxAmjadAlDgour
 
Data Mining in Operating System
Data Mining in Operating SystemData Mining in Operating System
Data Mining in Operating SystemITz_1
 
Session7part1
Session7part1Session7part1
Session7part1abiraaman
 
Introduction to Data Mining (Why Mine Data? Commercial Viewpoint)
Introduction to Data Mining (Why Mine Data? Commercial Viewpoint)Introduction to Data Mining (Why Mine Data? Commercial Viewpoint)
Introduction to Data Mining (Why Mine Data? Commercial Viewpoint)dradilkhan87
 
finalestkddfinalpresentation-111207021040-phpapp01.pptx
finalestkddfinalpresentation-111207021040-phpapp01.pptxfinalestkddfinalpresentation-111207021040-phpapp01.pptx
finalestkddfinalpresentation-111207021040-phpapp01.pptxshumPanwar
 
Data mining nouman javed
Data mining   nouman javedData mining   nouman javed
Data mining nouman javednouman javed
 
OLAP IN DATA MINING
OLAP IN DATA MININGOLAP IN DATA MINING
OLAP IN DATA MININGwilifred
 
Intro to Data warehousing lecture 16
Intro to Data warehousing   lecture 16Intro to Data warehousing   lecture 16
Intro to Data warehousing lecture 16AnwarrChaudary
 
`Data mining
`Data mining`Data mining
`Data miningJebin R
 
Data mining
Data miningData mining
Data miningsagar dl
 
Data mining techniques unit 1
Data mining techniques  unit 1Data mining techniques  unit 1
Data mining techniques unit 1malathieswaran29
 
DM-Unit-1-Part 1-R.pdf
DM-Unit-1-Part 1-R.pdfDM-Unit-1-Part 1-R.pdf
DM-Unit-1-Part 1-R.pdfssuserb933d8
 

Semelhante a Data Mining: an Introduction (20)

Data mining Basics and complete description onword
Data mining Basics and complete description onwordData mining Basics and complete description onword
Data mining Basics and complete description onword
 
Data mining applications
Data mining applicationsData mining applications
Data mining applications
 
Data mining
Data miningData mining
Data mining
 
Unit 3 part ii Data mining
Unit 3 part ii Data miningUnit 3 part ii Data mining
Unit 3 part ii Data mining
 
Data warehousing and mining furc
Data warehousing and mining furcData warehousing and mining furc
Data warehousing and mining furc
 
Data Mining- Unit-I PPT (1).ppt
Data Mining- Unit-I PPT (1).pptData Mining- Unit-I PPT (1).ppt
Data Mining- Unit-I PPT (1).ppt
 
lec01-IntroductionToDataMining.pptx
lec01-IntroductionToDataMining.pptxlec01-IntroductionToDataMining.pptx
lec01-IntroductionToDataMining.pptx
 
Data Mining in Operating System
Data Mining in Operating SystemData Mining in Operating System
Data Mining in Operating System
 
Session7part1
Session7part1Session7part1
Session7part1
 
Introduction to Data Mining (Why Mine Data? Commercial Viewpoint)
Introduction to Data Mining (Why Mine Data? Commercial Viewpoint)Introduction to Data Mining (Why Mine Data? Commercial Viewpoint)
Introduction to Data Mining (Why Mine Data? Commercial Viewpoint)
 
finalestkddfinalpresentation-111207021040-phpapp01.pptx
finalestkddfinalpresentation-111207021040-phpapp01.pptxfinalestkddfinalpresentation-111207021040-phpapp01.pptx
finalestkddfinalpresentation-111207021040-phpapp01.pptx
 
Data mining nouman javed
Data mining   nouman javedData mining   nouman javed
Data mining nouman javed
 
OLAP IN DATA MINING
OLAP IN DATA MININGOLAP IN DATA MINING
OLAP IN DATA MINING
 
Lecture - Data Mining
Lecture - Data MiningLecture - Data Mining
Lecture - Data Mining
 
Intro to Data warehousing lecture 16
Intro to Data warehousing   lecture 16Intro to Data warehousing   lecture 16
Intro to Data warehousing lecture 16
 
`Data mining
`Data mining`Data mining
`Data mining
 
Data mining
Data miningData mining
Data mining
 
Data mining techniques unit 1
Data mining techniques  unit 1Data mining techniques  unit 1
Data mining techniques unit 1
 
DM-Unit-1-Part 1-R.pdf
DM-Unit-1-Part 1-R.pdfDM-Unit-1-Part 1-R.pdf
DM-Unit-1-Part 1-R.pdf
 
Lecture2 (1).ppt
Lecture2 (1).pptLecture2 (1).ppt
Lecture2 (1).ppt
 

Mais de Ali Abbasi

Social Media Mining: An Introduction
Social Media Mining: An IntroductionSocial Media Mining: An Introduction
Social Media Mining: An IntroductionAli Abbasi
 
Active learning
Active learningActive learning
Active learningAli Abbasi
 
Disaster Relief Using Social Media Data
Disaster Relief Using Social Media DataDisaster Relief Using Social Media Data
Disaster Relief Using Social Media DataAli Abbasi
 
Real-World Behavior Analysis through a Social Media Lens
Real-World Behavior Analysis through a Social Media LensReal-World Behavior Analysis through a Social Media Lens
Real-World Behavior Analysis through a Social Media LensAli Abbasi
 
Collective Intelligence, part II
Collective Intelligence, part IICollective Intelligence, part II
Collective Intelligence, part IIAli Abbasi
 
Collective Inteligence Part I
Collective Inteligence Part ICollective Inteligence Part I
Collective Inteligence Part IAli Abbasi
 
Learning To Recognize Reliable Users And Content In Social Media With Coupled...
Learning To Recognize Reliable Users And Content In Social Media With Coupled...Learning To Recognize Reliable Users And Content In Social Media With Coupled...
Learning To Recognize Reliable Users And Content In Social Media With Coupled...Ali Abbasi
 
Evolutionary Game Theory
Evolutionary Game TheoryEvolutionary Game Theory
Evolutionary Game TheoryAli Abbasi
 
Game Theory: an Introduction
Game Theory: an IntroductionGame Theory: an Introduction
Game Theory: an IntroductionAli Abbasi
 

Mais de Ali Abbasi (9)

Social Media Mining: An Introduction
Social Media Mining: An IntroductionSocial Media Mining: An Introduction
Social Media Mining: An Introduction
 
Active learning
Active learningActive learning
Active learning
 
Disaster Relief Using Social Media Data
Disaster Relief Using Social Media DataDisaster Relief Using Social Media Data
Disaster Relief Using Social Media Data
 
Real-World Behavior Analysis through a Social Media Lens
Real-World Behavior Analysis through a Social Media LensReal-World Behavior Analysis through a Social Media Lens
Real-World Behavior Analysis through a Social Media Lens
 
Collective Intelligence, part II
Collective Intelligence, part IICollective Intelligence, part II
Collective Intelligence, part II
 
Collective Inteligence Part I
Collective Inteligence Part ICollective Inteligence Part I
Collective Inteligence Part I
 
Learning To Recognize Reliable Users And Content In Social Media With Coupled...
Learning To Recognize Reliable Users And Content In Social Media With Coupled...Learning To Recognize Reliable Users And Content In Social Media With Coupled...
Learning To Recognize Reliable Users And Content In Social Media With Coupled...
 
Evolutionary Game Theory
Evolutionary Game TheoryEvolutionary Game Theory
Evolutionary Game Theory
 
Game Theory: an Introduction
Game Theory: an IntroductionGame Theory: an Introduction
Game Theory: an Introduction
 

Último

Role of Consumer Insights in business transformation
Role of Consumer Insights in business transformationRole of Consumer Insights in business transformation
Role of Consumer Insights in business transformationAnnie Melnic
 
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdfEnglish-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdfblazblazml
 
Digital Marketing Plan, how digital marketing works
Digital Marketing Plan, how digital marketing worksDigital Marketing Plan, how digital marketing works
Digital Marketing Plan, how digital marketing worksdeepakthakur548787
 
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...Boston Institute of Analytics
 
Bank Loan Approval Analysis: A Comprehensive Data Analysis Project
Bank Loan Approval Analysis: A Comprehensive Data Analysis ProjectBank Loan Approval Analysis: A Comprehensive Data Analysis Project
Bank Loan Approval Analysis: A Comprehensive Data Analysis ProjectBoston Institute of Analytics
 
Rithik Kumar Singh codealpha pythohn.pdf
Rithik Kumar Singh codealpha pythohn.pdfRithik Kumar Singh codealpha pythohn.pdf
Rithik Kumar Singh codealpha pythohn.pdfrahulyadav957181
 
Decoding Patterns: Customer Churn Prediction Data Analysis Project
Decoding Patterns: Customer Churn Prediction Data Analysis ProjectDecoding Patterns: Customer Churn Prediction Data Analysis Project
Decoding Patterns: Customer Churn Prediction Data Analysis ProjectBoston Institute of Analytics
 
Digital Indonesia Report 2024 by We Are Social .pdf
Digital Indonesia Report 2024 by We Are Social .pdfDigital Indonesia Report 2024 by We Are Social .pdf
Digital Indonesia Report 2024 by We Are Social .pdfNicoChristianSunaryo
 
Principles and Practices of Data Visualization
Principles and Practices of Data VisualizationPrinciples and Practices of Data Visualization
Principles and Practices of Data VisualizationKianJazayeri1
 
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024Susanna-Assunta Sansone
 
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...Dr Arash Najmaei ( Phd., MBA, BSc)
 
Non Text Magic Studio Magic Design for Presentations L&P.pdf
Non Text Magic Studio Magic Design for Presentations L&P.pdfNon Text Magic Studio Magic Design for Presentations L&P.pdf
Non Text Magic Studio Magic Design for Presentations L&P.pdfPratikPatil591646
 
Decoding Movie Sentiments: Analyzing Reviews with Data Analysis model
Decoding Movie Sentiments: Analyzing Reviews with Data Analysis modelDecoding Movie Sentiments: Analyzing Reviews with Data Analysis model
Decoding Movie Sentiments: Analyzing Reviews with Data Analysis modelBoston Institute of Analytics
 
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...Boston Institute of Analytics
 
why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...
why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...
why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...Jack Cole
 
Introduction to Mongo DB-open-­‐source, high-­‐performance, document-­‐orient...
Introduction to Mongo DB-open-­‐source, high-­‐performance, document-­‐orient...Introduction to Mongo DB-open-­‐source, high-­‐performance, document-­‐orient...
Introduction to Mongo DB-open-­‐source, high-­‐performance, document-­‐orient...boychatmate1
 
Statistics For Management by Richard I. Levin 8ed.pdf
Statistics For Management by Richard I. Levin 8ed.pdfStatistics For Management by Richard I. Levin 8ed.pdf
Statistics For Management by Richard I. Levin 8ed.pdfnikeshsingh56
 
World Economic Forum Metaverse Ecosystem By Utpal Chakraborty.pdf
World Economic Forum Metaverse Ecosystem By Utpal Chakraborty.pdfWorld Economic Forum Metaverse Ecosystem By Utpal Chakraborty.pdf
World Economic Forum Metaverse Ecosystem By Utpal Chakraborty.pdfsimulationsindia
 

Último (20)

Role of Consumer Insights in business transformation
Role of Consumer Insights in business transformationRole of Consumer Insights in business transformation
Role of Consumer Insights in business transformation
 
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdfEnglish-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf
 
Digital Marketing Plan, how digital marketing works
Digital Marketing Plan, how digital marketing worksDigital Marketing Plan, how digital marketing works
Digital Marketing Plan, how digital marketing works
 
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
 
Data Analysis Project: Stroke Prediction
Data Analysis Project: Stroke PredictionData Analysis Project: Stroke Prediction
Data Analysis Project: Stroke Prediction
 
Bank Loan Approval Analysis: A Comprehensive Data Analysis Project
Bank Loan Approval Analysis: A Comprehensive Data Analysis ProjectBank Loan Approval Analysis: A Comprehensive Data Analysis Project
Bank Loan Approval Analysis: A Comprehensive Data Analysis Project
 
Rithik Kumar Singh codealpha pythohn.pdf
Rithik Kumar Singh codealpha pythohn.pdfRithik Kumar Singh codealpha pythohn.pdf
Rithik Kumar Singh codealpha pythohn.pdf
 
Decoding Patterns: Customer Churn Prediction Data Analysis Project
Decoding Patterns: Customer Churn Prediction Data Analysis ProjectDecoding Patterns: Customer Churn Prediction Data Analysis Project
Decoding Patterns: Customer Churn Prediction Data Analysis Project
 
Digital Indonesia Report 2024 by We Are Social .pdf
Digital Indonesia Report 2024 by We Are Social .pdfDigital Indonesia Report 2024 by We Are Social .pdf
Digital Indonesia Report 2024 by We Are Social .pdf
 
Principles and Practices of Data Visualization
Principles and Practices of Data VisualizationPrinciples and Practices of Data Visualization
Principles and Practices of Data Visualization
 
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
 
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
 
Non Text Magic Studio Magic Design for Presentations L&P.pdf
Non Text Magic Studio Magic Design for Presentations L&P.pdfNon Text Magic Studio Magic Design for Presentations L&P.pdf
Non Text Magic Studio Magic Design for Presentations L&P.pdf
 
Insurance Churn Prediction Data Analysis Project
Insurance Churn Prediction Data Analysis ProjectInsurance Churn Prediction Data Analysis Project
Insurance Churn Prediction Data Analysis Project
 
Decoding Movie Sentiments: Analyzing Reviews with Data Analysis model
Decoding Movie Sentiments: Analyzing Reviews with Data Analysis modelDecoding Movie Sentiments: Analyzing Reviews with Data Analysis model
Decoding Movie Sentiments: Analyzing Reviews with Data Analysis model
 
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
 
why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...
why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...
why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...
 
Introduction to Mongo DB-open-­‐source, high-­‐performance, document-­‐orient...
Introduction to Mongo DB-open-­‐source, high-­‐performance, document-­‐orient...Introduction to Mongo DB-open-­‐source, high-­‐performance, document-­‐orient...
Introduction to Mongo DB-open-­‐source, high-­‐performance, document-­‐orient...
 
Statistics For Management by Richard I. Levin 8ed.pdf
Statistics For Management by Richard I. Levin 8ed.pdfStatistics For Management by Richard I. Levin 8ed.pdf
Statistics For Management by Richard I. Levin 8ed.pdf
 
World Economic Forum Metaverse Ecosystem By Utpal Chakraborty.pdf
World Economic Forum Metaverse Ecosystem By Utpal Chakraborty.pdfWorld Economic Forum Metaverse Ecosystem By Utpal Chakraborty.pdf
World Economic Forum Metaverse Ecosystem By Utpal Chakraborty.pdf
 

Data Mining: an Introduction

  • 1. DATA MINING AND MACHINE LEARNING IN A NUTSHELL AN INTRODUCTION TO DATA MINING Mohammad-Ali Abbasi http://www.public.asu.edu/~mabbasi2/ SCHOOL OF COMPUTING, INFORMATICS, AND DECISION SYSTEMS ENGINEERING ARIZONA STATE UNIVERSITY http://dmml.asu.edu/ Data Mining and Machine Learning in a nutshell An Introduction to Data Mining 1
  • 2. INTRODUCTION • Data production rate has been increased dramatically (Big Data) and we are able store much more data than before – E.g., purchase data, social media data, mobile phone data • Businesses and customers need useful or actionable knowledge and gain insight from raw data for various purposes – It’s not just searching data or databases Data mining helps us to extract new information and uncover hidden patterns out of the stored and streaming data Data Mining and Machine Learning in a nutshell An Introduction to Data Mining 2
  • 3. DATA MINING The process of discovering hidden patterns in large data sets It utilizes methods at the intersection of artificial intelligence, machine learning, statistics, and database systems • Extracting or “mining” knowledge from large amounts of data, or big data • Data-driven discovery and modeling of hidden patterns in big data • Extracting implicit, previously unknown, unexpected, and potentially useful information/knowledge from data Data Mining and Machine Learning in a nutshell An Introduction to Data Mining 3
  • 4. DATA MINING STORIES • “My bank called and said that they saw that I bought two surfboards at Laguna Beach, California.” - credit card fraud detection • The NSA is using data mining to analyze telephone call data to track al’Qaeda activities • Walmart uses data mining to control product distribution based on typical customer buying patterns at individual stores Data Mining and Machine Learning in a nutshell An Introduction to Data Mining 4
  • 5. DATA MINING VS. DATABASES • Data mining is the process of extracting hidden and actionable patterns from data • Database systems store and manage data – Queries return part of stored data – Queries do not extract hidden patterns • Examples of querying databases – Find all employees with income more than $250K – Find top spending customers in last month – Find all students from engineering college with GPA more than average Data Mining and Machine Learning in a nutshell An Introduction to Data Mining 5
  • 6. EXAMPLES OF DATA MINING APPLICATIONS • Identifying fraudulent transactions of a credit card or spam emails – You are given a user’s purchase history and a new transaction, identify whether the transaction is fraud or not; – Determine whether a given email is spam or not • Extracting purchase patterns from existing records – beer ⇒ dippers (80%) • Forecasting future sales and needs according to some given samples • Extracting groups of like-minded people in a given network Data Mining and Machine Learning in a nutshell An Introduction to Data Mining 6
  • 7. BASIC DATA MINING TASKS • Classification – Assign data into predefined classes • Spam Detection, fraudulent credit card detection • Regression – Predict a real value for a given data instance • Predict the price for a given house • Clustering – Group similar items together into some clusters • Detect communities in a given social network Data Mining and Machine Learning in a nutshell An Introduction to Data Mining 7
  • 8. DATA Data Mining and Machine Learning in a nutshell An Introduction to Data Mining 8
  • 9. DATA INSTANCES • A collection of properties and features related to an object or person – A patient’s medical record – A user’s profile – A gene’s information • Instances are also called examples, records, data points, or observations Data Instance: Features or Attributes Class Label Data Mining and Machine Learning in a nutshell An Introduction to Data Mining 9
  • 10. DATA TYPES • Nominal (categorical) – No comparison is defined – E.g., {male, female} • Ordinal – Comparable but the difference is not defined – E.g., {Low, medium, high} • Interval – Deduction and addition is defined but not division – E.g., 3:08 PM, calendar dates • Ratio – E.g., Height, weight, money quantities Data Mining and Machine Learning in a nutshell An Introduction to Data Mining 10
  • 11. SAMPLE DATASET outlook temperature humidity windy play sunny 85 85 FALSE no sunny 80 90 TRUE no overcast 83 86 FALSE yes rainy 70 96 FALSE yes rainy 68 80 FALSE yes rainy 65 70 TRUE no overcast 64 65 TRUE yes sunny 72 95 FALSE no sunny 69 70 FALSE yes rainy 75 80 FALSE yes sunny 75 70 TRUE yes overcast 72 90 TRUE yes overcast 81 75 FALSE yes rainy 71 91 TRUE no Interval Ordinal Nominal Data Mining and Machine Learning in a nutshell An Introduction to Data Mining 11
  • 12. DATA QUALITY When making data ready for data mining algorithms, data quality need to be assured • Noise – Noise is the distortion of the data • Outliers – Outliers are data points that are considerably different from other data points in the dataset • Missing Values – Missing feature values in data instances • Duplicate data Data Mining and Machine Learning in a nutshell An Introduction to Data Mining 12
  • 13. DATA PREPROCESSING • Aggregation – when multiple attributes need to be combined into a single attribute or when the scale of the attributes change • Discretization – From continues values to discrete values • Feature Selection – Choose relevant features • Feature Extraction – Creating a mapping of new features from original features • Sampling – Random Sampling – Sampling with or without replacement – Stratified Sampling Data Mining and Machine Learning in a nutshell An Introduction to Data Mining 13
  • 14. CLASSIFICATION Data Mining and Machine Learning in a nutshell An Introduction to Data Mining 14
  • 15. CLASSIFICATION Learning patterns from labeled data and classify new data with labels (categories) – For example, we want to classify an e-mail as "legitimate" or "spam" Classifier Data Mining and Machine Learning in a nutshell An Introduction to Data Mining 15
  • 16. CLASSIFICATION: THE PROCESS • In classification, we are given a set of labeled examples • These examples are records/instances in the format (x, y) where x is a vector and y is the class attribute, commonly a scalar • The classification task is to build model that maps x to y • Our task is to find a mapping f such that f(x) = y Data Mining and Machine Learning in a nutshell An Introduction to Data Mining 16
  • 17. CLASSIFICATION: THE PROCESS Data Mining and Machine Learning in a nutshell An Introduction to Data Mining 17
  • 18. CLASSIFICATION: AN EMAIL EXAMPLE • A set of emails is given where users have manually identified spam versus non-spam • Our task is to use a set of features such as words in the email (x) to identify spam/non- spam status of the email (y) • In this case, classes are y = {spam, non-spam} • What would it be dealt with in a social setting? Data Mining and Machine Learning in a nutshell An Introduction to Data Mining 18
  • 19. CLASSIFICATION ALGORITHMS • Decision tree learning • Naive Bayes learning • K-nearest neighbor classifier Data Mining and Machine Learning in a nutshell An Introduction to Data Mining 19
  • 20. DECISION TREE • A decision tree is learned from the dataset (training data with known classes) and later applied to predict the class attribute value of new data (test data with unknown classes) where only the feature values are known Data Mining and Machine Learning in a nutshell An Introduction to Data Mining 20
  • 21. DECISION TREE INDUCTION Data Mining and Machine Learning in a nutshell An Introduction to Data Mining 21
  • 22. ID3, A DECISION TREE ALGORITHM Use information gain (entropy) to determine how well an attribute separates the training data according to the class attribute value – p+ is the proportion of positive examples in D – p- is the proportion of negative examples in D In a dataset containing ten examples, 7 have a positive class attribute value and 3 have a negative class attribute value [7+, 3-]: If the numbers of positive and negative examples in the set are equal, then the entropy is 1 Data Mining and Machine Learning in a nutshell An Introduction to Data Mining 22
  • 23. DECISION TREE: EXAMPLE 1 outlook temperature humidity windy play sunny 85 85 FALSE no sunny 80 90 TRUE no overcast 83 86 FALSE yes rainy 70 96 FALSE yes rainy 68 80 FALSE yes rainy 65 70 TRUE no overcast 64 65 TRUE yes sunny 72 95 FALSE no sunny 69 70 FALSE yes rainy 75 80 FALSE yes sunny 75 70 TRUE yes overcast 72 90 TRUE yes overcast 81 75 FALSE yes rainy 71 91 TRUE no Data Mining and Machine Learning in a nutshell An Introduction to Data Mining 23
  • 24. DECISION TREE: EXAMPLE 2 Class Labels Learned Decision Tree 1 Learned Decision Tree 2 Data Mining and Machine Learning in a nutshell An Introduction to Data Mining 24
  • 25. NAIVE BAYES CLASSIFIER For two random variables X and Y, Bayes theorem states that, class variable the instance features Then class attribute value for instance X Assuming that variables are independent Data Mining and Machine Learning in a nutshell An Introduction to Data Mining 25
  • 26. NBC: AN EXAMPLE Data Mining and Machine Learning in a nutshell An Introduction to Data Mining 26
  • 27. NEAREST NEIGHBOR CLASSIFIER • k-nearest neighbor employs the neighbors of a data point to perform classification • The instance being classified is assigned the label that the majority of k neighbors’ labels • When k = 1, the closest neighbor’s label is used as the predicted label for the instance being classified • For determining the neighbors, distance is computed based on some distance metric, e.g., Euclidean distance Data Mining and Machine Learning in a nutshell An Introduction to Data Mining 27
  • 28. K-NN: ALGORITHM 1. The dataset, number of neighbors (k), and the instance i is given 2. Compute the distance between i and all other data points in the dataset 3. Pick k closest neighbors 4. The class label for the data point i is the one that the majority holds (if there are more than one class, select one of them randomly) Data Mining and Machine Learning in a nutshell An Introduction to Data Mining 28
  • 29. K-NEAREST NEIGHBOR: EXAMPLE k = 10 Class label = ? k=5 k=3 • Depending on the k, different labels can be predicted for the green circle • In our example k = 3 and k = 5 generate different labels for the instance • K= 10 we can choose either triangle or rectangle Data Mining and Machine Learning in a nutshell An Introduction to Data Mining 30
  • 30. K-NEAREST NEIGHBOR: EXAMPLE Similarity between row 8 and other data instances; (Similarity = 1 if attributes have the same value, otherwise similarity = 0) Data instance Outlook Temperature Humidity Similarity Label K Prediction 2 1 1 1 3 N 1 N 1 1 0 1 2 N 2 N 4 0 1 1 2 Y 3 N 3 0 0 1 1 Y 4 ? 5 1 0 0 1 Y 5 Y 6 0 0 0 0 N 6 ? 7 0 0 0 0 Y 7 Y Data Mining and Machine Learning in a nutshell An Introduction to Data Mining 31
  • 31. EVALUATING CLASSIFICATION PERFORMANCE • As the class labels are discrete, we can measure the accuracy by dividing number of correctly predicted labels (C) by the total number of instances (N) • Accuracy = C/N • Error rate = 1 - Accuracy • More sophisticated approaches of evaluation will be discussed later Data Mining and Machine Learning in a nutshell An Introduction to Data Mining 32
  • 32. REGRESSION Data Mining and Machine Learning in a nutshell An Introduction to Data Mining 33
  • 33. REGRESSION Regression analysis includes techniques of modeling and analyzing the relationship between a dependent variable and one or more independent variables • Regression analysis is widely used for prediction and forecasting • It can be used to infer relationships between the independent and dependent variables Data Mining and Machine Learning in a nutshell An Introduction to Data Mining 34
  • 34. REGRESSION In regression, we deal with real numbers as class values (Recall that in classification, class values or labels are categories) y ≈ f(X) Dependent variable Regressors y R x1, x2, …, xm Our task is to find the relation between y and the vector (x1, x2, …, xm) Data Mining and Machine Learning in a nutshell An Introduction to Data Mining 35
  • 35. LINEAR REGRESSION In linear regression, we assume the relation between the class attribute y and feature set x to be linear where w represents the vector of regression coefficients • The problem of regression can be solved by estimating w and using the provided dataset and the labels y – The least squares is often used to solve the problem Data Mining and Machine Learning in a nutshell An Introduction to Data Mining 36
  • 36. SOLVING LINEAR REGRESSION PROBLEMS • The problem of regression can be solved by estimating w and using the dataset provided and the labels y – “Least squares” is a popular method to solve regression problems Data Mining and Machine Learning in a nutshell An Introduction to Data Mining 37
  • 37. LEAST SQUARES Find W such that minimizing ǁY - XWǁ2 for regressors X and labels Y Data Mining and Machine Learning in a nutshell An Introduction to Data Mining 38
  • 38. LEAST SQUARES Data Mining and Machine Learning in a nutshell An Introduction to Data Mining 39
  • 39. REGRESSION COEFFICIENTS • When there is only one independent variable: y = w0 + w1x n x = å xi 1 n i • Two independent variables y = w0 + w1x1 + w2x2 Data Mining and Machine Learning in a nutshell An Introduction to Data Mining 40
  • 40. LINEAR REGRESSION: EXAMPLE Years of Salary ($K) experience 3 30 8 57 9 64 13 72 3 36 6 43 11 59 21 90 1 20 16 83 Data Mining and Machine Learning in a nutshell An Introduction to Data Mining 41
  • 41. EVALUATING REGRESSION PERFORMANCE • The labels cannot be predicted precisely • It is needed to set a margin to accept or reject the predictions – For example, when the observed temperature is 71 any prediction in the range of 71±0.5 can be considered as correct prediction Data Mining and Machine Learning in a nutshell An Introduction to Data Mining 43
  • 42. CLUSTERING Data Mining and Machine Learning in a nutshell An Introduction to Data Mining 44
  • 43. CLUSTERING Grouping together items that are similar in some way – according to some criteria • Clustering is a form of unsupervised learning – The clustering algorithms do not have examples showing how the samples should be group together • The clustering algorithms look for patterns or structures in the data that are of interest • Clustering algorithms group together similar items Data Mining and Machine Learning in a nutshell An Introduction to Data Mining 45
  • 44. CLUSTERING: EXAMPLE Data Mining and Machine Learning in a nutshell An Introduction to Data Mining 46
  • 45. MEASURING SIMILARITY IN CLUSTERING ALGORITHMS • The goal is to group together similar items • Different similarity measures can be used to find similar items • Usually similarity measures are critical to clustering algorithms The most popular (dis)similarity measure for continuous features are Euclidean Distance and Pearson Linear Correlation Data Mining and Machine Learning in a nutshell An Introduction to Data Mining 47
  • 46. EUCLIDEAN DISTANCE – A DISSIMILAR MEASURE • Here n is the number of dimensions in the data vector Data Mining and Machine Learning in a nutshell An Introduction to Data Mining 48
  • 47. PEARSON LINEAR CORRELATION n å (x i - x )(yi - y ) r (x, y) = n i=1 n å (x i - x )2 å (y i - y )2 i=1 i=1 1 n x = å xi n i 1 n y = å yi n i • We’re shifting the expression profiles down (subtracting the means) and scaling by the standard deviations (i.e., making the data have mean = 0 and std = 1) • Always between –1 and +1 (perfectly anti-correlated and perfectly correlated) Data Mining and Machine Learning in a nutshell An Introduction to Data Mining 49
  • 48. SIMILARITY MEASURES: MORE DEFINITIONS Data Mining and Machine Learning in a nutshell An Introduction to Data Mining 50
  • 49. CLUSTERING • Distance-based algorithms – K-Means • Hierarchical algorithms Data Mining and Machine Learning in a nutshell An Introduction to Data Mining 51
  • 50. K-MEANS k-means clustering aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean • Finding the global optimal of k partitions is computationally expensive (NP-hard). However, there are efficient heuristic algorithms that are commonly employed and converge quickly to an optimum that might not be global. Data Mining and Machine Learning in a nutshell An Introduction to Data Mining 52
  • 51. K-MEANS • Given a set of observations (x1, x2, …, xn), where each observation is a d-dimensional real vector, k-means clustering aims to partition the n observations into k sets (k ≤ n) S = {S1, S2, …, Sk} so as to minimize the within- cluster sum of squares: where μi is the mean of points in Si Data Mining and Machine Learning in a nutshell An Introduction to Data Mining 53
  • 52. K-MEANS: ALGORITHM Given data points xi and an initial set of k centroids m1(1),…,mk(1), the algorithm proceeds as follows: • Assignment step: Assign each data point to the cluster Si with the closest centroid each data point goes into exactly one cluster) • Update step: Calculate the new means to be the centroid of the data points in the cluster Data Mining and Machine Learning in a nutshell An Introduction to Data Mining 54
  • 53. K-MEANS: AN EXAMPLE Data Cluster 1 Cluster 2 X Y point Step Data point Centroid Data point Centroid 1 1 1 2 2 1 1 1 (1.0, 1.0) 2 (2.0, 1.0) 3 1 2 4 2 2 5 4 4 6 4 5 7 5 4 8 5 5 Data Mining and Machine Learning in a nutshell An Introduction to Data Mining 56
  • 54. RUNNING K-MEANS ON IRIS DATASET Data Mining and Machine Learning in a nutshell An Introduction to Data Mining 58
  • 55. HIERARCHICAL CLUSTERING Hierarchical clustering is a method of cluster analysis which seeks to build a hierarchy of clusters. • Strategies for hierarchical clustering generally fall into two types: – Agglomerative: This is a "bottom up" approach: each observation starts in its own cluster, and pairs of clusters are merged as one moves up the hierarchy. – Divisive: This is a "top down" approach: all observations start in one cluster, and splits are performed recursively as one moves down the hierarchy. Data Mining and Machine Learning in a nutshell An Introduction to Data Mining 60
  • 56. HIERARCHICAL ALGORITHMS • Initially n data points are considered as either 1 or n clusters in hierarchical clustering • These clusters are gradually split or merged (divisive or agglomerative hierarchical clustering algorithms), depending on the type of an algorithm • Until the desired number of clusters are reached Data Mining and Machine Learning in a nutshell An Introduction to Data Mining 61
  • 57. HIERARCHICAL AGGLOMERATIVE CLUSTERING • Start with each data point as a cluster • Keep merging the most similar pairs of data points/clusters until only one big cluster left • This is called a bottom-up or agglomerative method This produces a binary tree or dendrogram – The final cluster is the root and each data point is a leaf – The height of the bars indicate how close the points are Data Mining and Machine Learning in a nutshell An Introduction to Data Mining 62
  • 58. HIERARCHICAL CLUSTERING: AN EXAMPLE Data Mining and Machine Learning in a nutshell An Introduction to Data Mining 63
  • 59. MERGING THE DATA POINTS IN HIERARCHICAL CLUSTERING • Average Linkage – Each cluster ci is associated with a mean vector i which is the mean of all the data items in the cluster – The distance between two clusters ci and cj is then just d( i , j ) • Single Linkage – The minimum of all pairwise distances between points in the two clusters • Complete Linkage – The maximum of all pairwise distances between points in the two clusters Data Mining and Machine Learning in a nutshell An Introduction to Data Mining 64
  • 60. LINKAGE IN HIERARCHICAL CLUSTERING: EXAMPLE Single Linkage Average Linkage Complete Linkage Data Mining and Machine Learning in a nutshell An Introduction to Data Mining 65
  • 61. EVALUATING THE CLUSTERINGS When we are given objects of two different kinds, the perfect clustering would be that objects of the same type are clustered together. • Evaluation with ground truth • Evaluation without ground truth Data Mining and Machine Learning in a nutshell An Introduction to Data Mining 66
  • 62. EVALUATION WITH GROUND TRUTH When ground truth is available, the evaluator has prior knowledge of what a clustering should be – That is, we know the correct clustering assignments. • Measures – Precision and Recall, or F-Measure – Purity – Normalized Mutual Information (NMI) Data Mining and Machine Learning in a nutshell An Introduction to Data Mining 67
  • 63. PRECISION AND RECALL • True Positive (TP) : • False Negative (FN) : – when similar points are assigned to – when similar points are assigned to the same clusters different clusters – This is considered a correct – This is considered an incorrect decision. decision • True Negative (TN) : • False Positive (FP) : – when dissimilar points are – when dissimilar points are assigned to different clusters assigned to the same clusters – This is considered a correct – This is considered an incorrect decision decision Data Mining and Machine Learning in a nutshell An Introduction to Data Mining 68
  • 64. PRECISION AND RECALL: EXAMPLE 1 Data Mining and Machine Learning in a nutshell An Introduction to Data Mining 69
  • 65. F-MEASURE • To consolidate precision and recall into one measure, we can use the harmonic mean of precision of recall Computed for the same example, we get F = 0.54 Data Mining and Machine Learning in a nutshell An Introduction to Data Mining 71
  • 66. PURITY • In purity, we assume the majority of a cluster represents the cluster • Hence, we use the label of the majority against the label of each member to evaluate • the algorithm easily tampered; consider points Purity can be • The purity is then defined assize 1) or very large being singleton clusters (of the fraction of instances that have labels equal to the clusters. cluster’s majority label • In both cases, purity does not make much sense. where Lj defines label j (ground truth) and Mi defines the majority label for cluster i Data Mining and Machine Learning in a nutshell An Introduction to Data Mining 72
  • 67. MUTUAL INFORMATION The mutual information of two random variables is a quantity that measures the mutual dependence of the two random variables • p(x,y) is the joint probability distribution function of X and Y, • p(x) and p(y) are the marginal probability distribution functions of X and Y respectively Data Mining and Machine Learning in a nutshell An Introduction to Data Mining 73
  • 68. NORMALIZED MUTUAL INFORMATION Normalized Mutual Information has been derived from information theory where the Mutual Information (MI) between the clusterings found and the labels is normalized by the upper bound of (MI) which is a mean of the entropies (H) of labels and clusterings found Data Mining and Machine Learning in a nutshell An Introduction to Data Mining 74
  • 69. NORMALIZED MUTUAL INFORMATION • where l and h are labels and found clusterings, • nh and nl are the number of data points in the clusters h and l, respectively, • nh,l is the number of points in clusters h and labeled l, • n is the size of the dataset • NMI values close to one indicate high similarity between clusterings found and labels • Values close to zero indicate high dissimilarity between them Data Mining and Machine Learning in a nutshell An Introduction to Data Mining 75
  • 70. NORMALIZED MUTUAL INFORMATION: EXAMPLE Partition a: [1,1,1,1,1,1,1, 2,2,2,2,2,2,2] Partition b: [1,1,1,1,1,2,2, 1,2,2,2,2,2,2] nh nl nh,l l=1 l=2 n = 14 h=1 6 l=1 7 h=1 5 1 h=2 8 l=2 7 h=2 2 6 Data Mining and Machine Learning in a nutshell An Introduction to Data Mining 76
  • 71. EVALUATION WITHOUT GROUND TRUTH • Use domain experts • Use quality measures such as SSE – SSE: the sum of the squared error for all clusters • Use more than two clustering algorithms and compare the results and pick the algorithm with better quality measure Data Mining and Machine Learning in a nutshell An Introduction to Data Mining 77
  • 72. TEXT MINING Data Mining and Machine Learning in a nutshell An Introduction to Data Mining 78
  • 73. TEXT MINING • In social media, most of the data that is available online is in text format • In general, the way to perform data mining is to convert text data into tabular format and then perform data mining on this data • The process of converting text data into tabular data is called vectorization Data Mining and Machine Learning in a nutshell An Introduction to Data Mining 79
  • 74. TEXT MINING PROCESS A set of linguistic, statistical, and machine learning techniques that model and structure the information content of textual sources for business intelligence, exploratory data analysis, research, or investigation Data Mining and Machine Learning in a nutshell An Introduction to Data Mining 80
  • 75. TEXT PREPROCESSING Text preprocessing aims to make the input documents more consistent to facilitate text representation, which is necessary for most text analytics tasks • Methods: – Stop word removal • Stop word removal eliminates words using a stop word list, in which the words are considered more general and meaningless – e.g. the, a, is, at, which – Stemming • Stemming reduces inflected (or sometimes derived) words to their stem, base or root form – For example, “watch”, “watching”, “watched” are represented as “watch” Data Mining and Machine Learning in a nutshell An Introduction to Data Mining 81
  • 76. TEXT REPRESENTATION • The most common way to model documents is to transform them into sparse numeric vectors and then deal with them with linear algebraic operations • This representation is called “Bag of Words” • Methods: – Vector space model – tf-idf Data Mining and Machine Learning in a nutshell An Introduction to Data Mining 82
  • 77. VECTOR SPACE MODEL • In the vector space model, we start with a set of documents, D • Each document is a set of words • The goal is to convert these textual documents to vectors • di : document i, wj,i : the weight for word j in document i The weight can be set to 1 when the word exist in the document and 0 when it does not. Or we can set this weight to the number of times the word is observed in the document Data Mining and Machine Learning in a nutshell An Introduction to Data Mining 83
  • 78. VECTOR SPACE MODEL: AN EXAMPLE • Documents: – d1: data mining and social media mining – d2: social network analysis – d3: data mining • Reference vector: – (social, media, mining, network, analysis, data) • Vector representation: analysis data media mining network social d1 0 1 1 1 0 1 d2 1 0 0 0 1 1 d3 0 1 0 1 0 0 Data Mining and Machine Learning in a nutshell An Introduction to Data Mining 84
  • 79. TF-IDF (TERM FREQUENCY-INVERSE DOCUMENT FREQUENCY) tf-idf of term t, document d, and document corpus D is calculated as follows: tf-idf(t, d, D) = tf (t, d) * idf (t, D) The total number of documents in the corpus The number of documents where the term t appears Data Mining and Machine Learning in a nutshell An Introduction to Data Mining 85
  • 80. TF-IDF: AN EXAMPLE Consider words “apple” and “the” that appear 10 and 20 times in document 1 (d1), which contains 100 words. Consider |D| = 20 and word “apple” only appearing in d1 and word “the” appearing in all 20 documents Data Mining and Machine Learning in a nutshell An Introduction to Data Mining 86
  • 81. TF-IDF: AN EXAMPLE • Documents: – d1: data mining and social media mining – d2: social network analysis – d3: data mining • tf-idf representation: analysis data media mining network social df(w) 1 2 1 2 1 2 log(N/df(w)) 0.48 0.18 0.48 0.18 0.48 0.18 d1, tf 0 1 1 2 0 1 d2, tf 1 0 0 0 1 1 d3, tf 0 1 0 1 0 0 d1, tf-idf 0.00 0.18 0.48 0.35 0.00 0.18 d2, tf-idf 0.48 0.00 0.00 0.00 0.48 0.18 d3, tf-idf 0.00 0.18 0.00 0.18 0.00 0.00 Data Mining and Machine Learning in a nutshell An Introduction to Data Mining 87
  • 82. SENTIMENT ANALYSIS • Sentiment analysis or opinion mining refers to the application of natural language processing, computational linguistics, and text analytics to identify and extract subjective information in source materials • It aims to determine the attitude of a speaker or a writer with respect to some topic or the overall contextual polarity of a document. Data Mining and Machine Learning in a nutshell An Introduction to Data Mining 88
  • 83. POLARITY ANALYSIS • The basic task in opinion mining is classifying the polarity of a given document or text – The polarity could be positive, negative, or neutral • Methods: – Naïve Bayes – Pointwise Mutual Information (PMI) Data Mining and Machine Learning in a nutshell An Introduction to Data Mining 89
  • 84. MEASURING POLARITY, NAÏVE BAYES • Bayes’ rule: • If we consider that the occurrence of features (words) in the document are independent Data Mining and Machine Learning in a nutshell An Introduction to Data Mining 90
  • 85. MEASURING POLARITY, MAXIMUM ENTROPY • Z(d) is the normalization factor • is feature-weight parameter and shows the importance of each feature • Fi,c is defined as a feature/class function for feature fi and class c Data Mining and Machine Learning in a nutshell An Introduction to Data Mining 91
  • 86. MEASURING POLARITY, POINTWISE MUTUAL INFORMATION • P(word) is the number of results returned by search engine in response to search for term word • P(word1 word2) is the number of results for mutual search of word1 and word2 together Data Mining and Machine Learning in a nutshell An Introduction to Data Mining 92
  • 87. Mohammad-Ali Abbasi (Ali), Ali, is a Ph.D. student at Data Mining and Machine Learning Lab, Arizona State University. His research interests include Data Mining, Machine Learning, Social Computing, and Social Media Behavior Analysis. http://www.public.asu.edu/~mabbasi2/ Data Mining and Machine Learning in a nutshell An Introduction to Data Mining 93

Notas do Editor

  1. Weather dataset
  2. Garbage in and garbage out
  3. Source?
  4. Using the inverse of similarity as a distance measure
  5. \\frac{\\partial}{\\partial W} {(W^T X^T Y )} = X^T Y\\frac{\\partial}{\\partial W} { (W^T X^T X W) } = 2 X^T X W
  6. SVD form of matrix X: X = U \\Sigma V^T U and V are real valued, they each are an orthogonal matrixU^T = U^{-1}V is orthogonal so: V^T^{-1} = V(V \\Sigma^2 V^T)^{-1} = V^T^{-1} \\Sigma^{-2} V^T = V \\Sigma^{-2} V^T
  7. Solve w_0 first
  8. Source?
  9. TP – the intersection of the two oval shapes; TN – the rectangle minus the two oval shapes; FP – the circle minus the blue part; FN – the blue oval minus the circle. Accuracy = (TP+TN)/(TP+FP+FN+FN); Precision = TP/(TP+FP); Recall = TP/(TP+FN)
  10. TP+FP = C(6,2) + C(8,2) = 15+28 = 43; FP counts the wrong pairs within each cluster; FN counts the similar pairs but wrongly put into different clusters; TN counts dissimilar pairs in different clusters
  11. N – the total number of data points.
  12. number of documents where the term t appears 
  13. df(w) is document frequency for word w. tf is the term frequency in a document.
  14. SO – Sentiment Orientation“Thumbs Up, Thumbs Down …” Peter Turney