SlideShare uma empresa Scribd logo
A Data Mining Approach to Construct
Graduates Employability Model in Malaysia
Myzatul Akmam Sapaat, Aida Mustapha, Johanna Ahmad, Khadijah Chamili,
Rahamirzam Muhamad
Faculty of Computer Science and Information Technology, Universiti Putra Malaysia,
43400 UPM Serdang, Selangor, Malaysia
This study is to construct the Graduates
Employability Model using classification
task in data mining. To achieve it, we use
data sourced from the Tracer Study, a web-
based survey system from the Ministry of
Higher Education, Malaysia (MOHE) for the
year 2009. The classification experiment is
performed using various Bayes algorithms
to determine whether a graduate has been
employed, remains unemployed or in an
undetermined situation. The performance of
Bayes algorithms are also compared against
a number of tree-based algorithms.
Information Gain is also used to rank the
attributes and the results showed that top
three attributes that have direct impact on
employability are the job sector, job status
and reason for not working. Results showed
that J48, a variant of decision-tree algorithm
performed with highest accuracy, which is
92.3% as compared to the average of 91.3%
from other Bayes algorithms. This leads to
the conclusion that a tree-based classifier is
more suitable for the tracer data due to the
information gain strategy.
Classification, Bayes Methods, Decision
Tree, Employability
Tracer Study is a web-based survey
system developed by the Ministry of
Higher Education, Malaysia (MOHE). It
is compulsory to be filled by all students
graduating from polytechnics, public or
private institutions before their
convocation for any level of degree
awarded. The sole purpose of the survey
is to guide future planning and to
improve various aspects of local higher
education administrative system. The
survey also serves as a tool to gauge the
adequacy of higher education in
Malaysia in supplying manpower needs
in all areas across technical, managerial
or social science. Data sourced from the
Tracer Study is invaluable because it
provides correlation about the graduate
qualifications and skills along with
employment status.
Graduates employability remains as
national issues due to the increasing
number of graduates produced by higher
education institutions each year.
According to statistics generated from
the Tracer Study, total number of
graduates produced by higher
institutions in 2008 is 139,278. In 2009,
the volume has increased to 155,278
graduates. Taking this into
consideration, 50% of graduates in 2009
International Journal on New Computer Architectures and Their Applications (IJNCAA) 1(4): 1086-1098
The Society of Digital Information and Wireless Communications, 2011 (ISSN: 2220-9085)
are bachelor holder from public and
private universities. Only 49.20% or
38,191 of them successfully employed
within the first six months after finishing
their studies. Previous research on
graduate employability covers wide
range of domain such as education,
engineering, and social science. While
the researches are mainly based on
surveys or interviews, little has been
done using data mining techniques.
Bayes’ theorem is among the earliest
statistical method that is used to identify
patterns in data. But as datasets have
grown in size and complexity, data
mining has emerged as a technology to
apply methods such as neural networks,
genetic algorithms, decision trees, and
support vector machines to uncover
hidden patterns [1]. Today, data mining
technologies are dealing with huge
amount of data from various sources, for
example relational or transactional
databases, data warehouse, images, flat
files or in the form World Wide Web.
Classification is the task of
generalizing observations in the training
data, which are accompanied by specific
class of the observations. The objective
of this paper is to predict whether a
graduate has been employed, remains
unemployed or in an undetermined
situation within the first six months after
graduation. This will be achieved
through a classification experiment that
classifies a graduate profile as employed,
unemployed or others. The main
contribution of this paper is the
comparison of classification accuracy
between various algorithms from the two
most commonly used data mining
techniques in the education domain,
which are the Bayes methods and
decision trees.
The remainder of this paper is
organized as follows. Section 2 presents
the related works on graduate
employability and reviews recent
techniques employed in data mining.
Section 3 introduces the dataset and the
experimental setting. Section 4 discusses
finding of the results. Finally Section 5
concludes the paper with some direction
for future work.
A number of works have been done to
identify the factors that influenced
graduates employability in Malaysia. It
is as an initiative step to align the higher
education with the industry, where
currently exists unquestionable impact
against each other. Nonetheless, most of
the previous works were carried out
beyond the data mining domain.
Besides, data sources for previous works
were collected and assembled through
survey in sample population.
Research in [2] identifies three major
requirements concerned by the
employers in hiring employees, which
are basic academic skills, higher order
thinking skills, and personal qualities.
The work is restricted in the education
domain specifically analyzing the
effectiveness of a subject, English for
Occupational Purposes (EOP) in
enhancing employability skills. Similar
to [2], work by [3] proposes to
restructure the curriculum and methods
of instruction in preparing future
graduates for the forthcoming challenges
based on the model of the T-shaped
professional and newly developed field
International Journal on New Computer Architectures and Their Applications (IJNCAA) 1(4): 1086-1098
The Society of Digital Information and Wireless Communications, 2011 (ISSN: 2220-9085)
of Service Science, Management and
Engineering (SSME).
More recently, [4] proposes a new
Malaysian Engineering Employability
Skills Framework (MEES), which is
constructed based on requirement by
accrediting bodies and professional
bodies and existing research findings in
employability skills as a guideline in
training package and qualification in
Malaysia. Nonetheless, not surprisingly,
graduates employability is rarely being
studied especially within the scope of
data mining, mainly due to limited and
authentic data source available.
Employability issues have also been
taken into consideration in other
countries. Research by The Higher
Education Academy with the Council for
Industry and Higher Education (CIHE)
in United Kingdom concluded that there
are six competencies that employers
observe in individual who can transform
the organizations and add values in their
careers [5]. The six competencies are
cognitive skills or brainpower, generic
competencies, personal capabilities,
technical ability, business or
organization awareness and practical
elements. Furthermore, it covers a set of
achievements comprises skills,
understandings and personal attributes
that make graduates more likely to gain
employment and successful in their
chosen occupations which benefits the
graduates, the community and also the
However, data mining techniques
have indeed been employed in education
domain, for instance in prediction and
classification of student academic
performance using Artificial Neural
Network [6, 7] and a combination of
clustering and decision tree classification
techniques [6]. Experiments in [8]
classifies students to predict their final
grade using six common classifiers
(Quadratic Bayesian classifier, 1-nearest
neighbour (1-NN), k-nearest neighbor
(k-NN), Parzen-window, multilayer
perceptron (MLP), and Decision Tree).
With regards to student performance, [9]
discovers individual student
characteristics that are associated with
their success according to grade point
averages (GPA) by using a Microsoft
Decision Trees (MDT) classification
technique. [10] has shown some
applications of data mining in
educational institution that extract useful
information from the huge data sets.
Data mining through analytical tool
offers user to view and use current
information for decision making process
such as organization of syllabus,
predicting the registration of students in
an educational program, predicting
student performance, detecting cheating
in online examination as well as
identifying abnormal/erroneous values.
Among the related work, we found
that work done by [11] is most related to
this research, whereby the work mines
historical data of students' academic
results using different classifiers (Bayes,
trees, function) to rank influencing
factors that contribute in predicting
student academic performance.
The main objective of this paper is to
classify a graduate profile as employed,
unemployed or undetermined using data
sourced from the Tracer Study database
for the year of 2009. The dataset consists
International Journal on New Computer Architectures and Their Applications (IJNCAA) 1(4): 1086-1098
The Society of Digital Information and Wireless Communications, 2011 (ISSN: 2220-9085)
of 12,830 instances and 20 attributes
related to graduate profiles from 19
public universities and 138 private
universities. Table 1 shows the complete
attributes for the Tracer Study dataset.
To construct the classifiers, we use
the Waikato Environment for
Knowledge Analysis (WEKA), an open-
source data mining tool [12] which was
developed at University of Waikato New
Zealand. It provides various learning
algorithm that can be easily
implemented to the dataset. WEKA only
accepts dataset in Attribute-Relation File
Format (ARFF) format. Therefore, once
the data preparation being done, we
transform the dataset into ARFF file
with extension of .arff.
nternational Journal on New Computer Architectures and Their Applications (IJNCAA) 1(4): 1086-1098
The Society of Digital Information and Wireless Communications, 2011 (ISSN: 2220-9085)
Table 1. Attributes from the Tracer Study dataset after the pre-processing is performed.
No. Attributes Values Descriptions
1 sex {male, female} Gender of the graduate
2 age {20-25, 25-30, 30-40, 40-50, >50} Age of the graduate
3 univ {public_univ, private_univ} University/institution of current
4 level {certificate, diploma,
advanced_diploma, first_degree,
postGraduate_diploma, masters_ thesis,
masters_courseWork& Thesis,
masters_courseWork, phd_ thesis,
phd_courseWork&Thesis, professional}
Level of study for current
5 field {technical, ict, education, science,
art&soc_science }
Field of study for current qualification
6 cgpa {2.00-2.49, 2.50-2.99, 3.00-3.66, 3.67-
4.00, failed, 4.01-6.17}
CGPA for current qualification
7 emp_status {employed, unemployed, others} Current employment status
8 general_IT skills {satisfied, extremely_satisfied, average,
strongly_not_satisfied, not_satisfied,
Level of IT skills, Malay and English
language proficiency, general
knowledge, interpersonal
communication, creative and critical
thinking, analytical skills, problem
solving, inculcation of positive values,
and teamwork acquired from the
programme of study
9 Malay_lang
10 English_lang
11 gen_knowledge
12 interpersonal_
13 cc_thinking
14 analytical
15 prob_solving
16 positive_value
17 teamwork
18 job_status {permanent, contract, temp, self_
employed, family_business}
Job status of employed graduates
19 job_sector {local_private_company, multinational_
company, own_company, government,
NGO, GLC, statutory_body, others}
Job sector of employed graduates
20 reason_not_
{job_hunting, waiting_for_ posting,
further_study, participating_skills_
program, waiting_posting_of_study,
unsuitable_job, resting, others, family_
responsibilities, medical_ issues, not_
lack_of_confidence, chambering}
Reason for not working for
unemployed graduates
3.1 Data-Preprocessing
The raw data retrieved from the Tracer
Study database required pre-processing
to prepare the dataset for the
classification task. First, cleaning
activities involved eliminating data with
missing values in critical attributes,
identifying outliers, correcting
inconsistent data, as well as removing
duplicate data. From the total of 89,290
instances in the raw data, the data
cleaning process ended up 12,830
instances that are ready to be mined. For
International Journal on New Computer Architectures and Their Applications (IJNCAA) 1(4): 1086-1098
The Society of Digital Information and Wireless Communications, 2011 (ISSN: 2220-9085)
missing values (i.e., age attribute), we
replaced them with the mean values of
the attribute.
Second, data discretization is
required due to the fact that most of
attributes from the Tracer Study are
continuous attributes. In this case, we
discretized the values into interval so as
to prepare the dataset into categorical or
nominal attributes as below.
 cgpa previously in continuous
number is transformed into grade
 sex previously coded as 1 and 2 is
transformed into nominal
 age previously in continuous number
is transformed into age range
 field of study previously in
numerical code 1-4 is transformed
into nominal
 skill information (i.e., language
proficiency, general knowledge,
interpersonal communication etc)
previously in numerical 1-9 is
transformed into nominal
 employment status previously in
numerical code 1-3 is transformed
into nominal
3.2 Classification Task
The classification task at hand is to
predict the employment status
(employed, unemployed, others) for
graduate profiles in the Tracer Study.
The task is performed in two stages,
training and testing. Once the classifier
is constructed, testing dataset is used to
estimate the predictive accuracy of the
There are four types of testing option
in WEKA, which are using the training
set, supplied test set, cross validation and
percentage split. If we use training set as
the test option, the test data will be
sourced from the same training data,
hence this will decrease reliable estimate
of the true error rate. Supplied test set
permit us to set the test data which been
prepared separately from the training
data. Cross-validation is suitable for
limited dataset whereby the number of
fold can be determined by user. 10-fold
cross validation is widely use to get the
best estimate of error. It has been proven
by extensive test on numerous datasets
with different learning techniques [13].
With a number of dataset and to avoid
overfitting, we employed hold-out
validation method with 70-30 percentage
split, whereby 70% out of the 12,830
instances is used for training while the
remaining instances are used for testing.
Various algorithms from both Bayes and
decision tree families are used in
predicting the accuracy of the
employment status.
Information Gain. Information Gain is
an attribute selection measure uses in
ID3. If node N represents tuples of
partition D, attribute with highest
information gain will be chosen as
splitting attribute for node N. It resulted
towards minimizing number of tests
needed to classify a given tuples as well
as guarantees that a simple tree is found.
The expected information needed to
classify a tuple in D is given by
Info(D) = - ∑ pi log2(pi)
International Journal on New Computer Architectures and Their Applications (IJNCAA) 1(4): 1086-1098
The Society of Digital Information and Wireless Communications, 2011 (ISSN: 2220-9085)
Bayes Methods. In Bayes methods, the
classification task consists of classifying
a class variable, given a set of attribute
variables. It is a type of statistical in
which the prior distribution is estimated
from the data before any new data are
observed, hence every parameter is
assigned with a prior probability
distribution [14]. A Bayesian classifier
learns from the samples over both class
and attribute variables.
The naïve Bayesian classifier works
as follows: Let D be a training set of
tuples and their associated class labels.
As usual, each tuple is represented by an
n-dimensional attribute vector, X = (x1,
x2, …, xn), depicting n measurements
made on the tuple from n attributes,
respectively, A1, A2, … , An.
Suppose that there are m classes, C1,
C2, …, Cm. Given a tuple, X, the
classifier will predict that X belongs to
the class having the highest posterior
probability, conditioned on X. That is,
the naïve Bayesian classifier predicts
that tuple X belongs to the class Ci if and
only if
P(Ci|X) > P(Cj|X) for 1 ≤ j ≤ m; j ≠ i
Thus, we maximize P(Ci|X). The class Ci
for which P(Ci|X) is maximized is called
the maximum posteriori hypothesis.
Under the Bayes method in WEKA, we
performed the experiment with eight
algorithms, which are Averaged One-
Dependence Estimators (AODE),
AODEsr, WAODE, Bayes Network,
HNB, Naïve Bayesian, Naïve Bayesian
Simple and Naïve Bayesian Updateable.
AODE, HNB and Naïve Bayesian was
also used in [11] and the rest algorithms
were chosen to further compare the
results from the Bayes algorithm
experiment using the same dataset.
AODE algorithm achieved the
highest accuracy percentage averaging
all of smaller searching-space in
alternative naive Bayes-like models that
have weaker and hence less detrimental
independence assumptions than naive
Bayes. The resulting algorithm is
computationally efficient while
delivering highly accurate classification
on many learning tasks. AODEsr and
WAODE are expended from AODE.
AODEsr complement AODE with
Subsumption Resolution, which is
capable to detect specializations between
two attribute values at classification time
and deletes the generalization attribute
Meanwhile, WAODE constructs the
model called Weightily Averaged One-
Dependence Estimators by assigning
weight to each dataset. Bayes Network
learning using various search algorithms
and quality measures. HNB constructs
Hidden Naive Bayes classification
model with high classification accuracy
and AUC. In Naive Bayes, numeric
estimator precision values are chosen
based on analysis of the training data.
The Naïve Bayes Updateable classifier
will use a default precision of 0.1 for
numeric attributes when build classifier
is called with zero training instances.
Naive Bayes Simple modeled numeric
attributes by a normal distribution.
Tree Methods. Tree-based methods
classify instances by sorting the
instances down the tree from the root to
some leaf node, which provides the
classification of a particular instance.
Each node in the tree specifies a test of
International Journal on New Computer Architectures and Their Applications (IJNCAA) 1(4): 1086-1098
The Society of Digital Information and Wireless Communications, 2011 (ISSN: 2220-9085)
some attribute of the instance and each
branch descending from that node
corresponds to one of the possible values
for this attribute [15]. Figure 1 shows the
model produced by decision trees, which
is represented in the form of tree
Under the tree method in WEKA, we
performed the classification experiment
with nine algorithms, which are ID3,
J48, REPTree, J48graft, Random Tree,
Decision Stump, LADTree, Random
Forest and Simple Cart. J48 and
REPTree was also used in [11], but we
did not managed to use NBTree and
BFTree because the experiment worked
on large amount of datasets, thus
incompatible with the memory allocation
in WEKA. FT, User Classifier and LMT
algorithm also experienced the same
problem as NBTree and BFTree. In
addition, we employed ID3, J48graft,
Random Tree, Decision Stump, LAD
Tree, Random Forest and Simple Cart to
experiment with other alternative
algorithms in decision tree.
Figure 1. In a tree structure, each node denotes a
test on an attribute value, each branch represents
an outcome of the test, and tree leaves represent
classes or class distributions. A leaf node
indicates the class of the examples. The instances
are classified by sorting them down the tree from
the root node to some leaf node.
ID3 is a class for constructing an
unpruned decision tree based on the ID3
algorithm, which only deals with
nominal attributes. J48 is a class for
generating a pruned or unpruned C4.5
decision tree while J48 grafted generates
a grafted (pruned or unpruned) C4.5
decision tree. REPTree is fast decision
tree learner which builds a decision/
regression tree using information gain/
variance and prunes it using reduced-
error pruning (with backfitting).
Decision stump is usually being used in
conjunction with a boosting algorithm. A
multi-class alternating decision tree is
generated in LADTree using the
LogitBoost strategy. Random Forest
constructs a forest of random trees
whereas Random Tree constructs a tree
that considers K randomly chosen
attributes at each node without pruning.
SimpleCart implements minimal cost-
complexity pruning.
We segregated the experimental results
into three parts. The first is the result
from ranking attributes in the Tracer
Study dataset using the Information
Gain. The second and third parts
presents the predictive accuracy results
by various algorithms from the Bayes
method and decision tree families,
4.1 Information Gain
In this study, we employed Information
Gain to rank the attributes in
determining the target values as well as
to reduce the size of prediction. Decision
set of possible
leaf leaf
set of possible
International Journal on New Computer Architectures and Their Applications (IJNCAA) 1(4): 1086-1098
The Society of Digital Information and Wireless Communications, 2011 (ISSN: 2220-9085)
tree algorithms adopt a mutual-
information criterion to choose the
particular attribute to branch on that gain
the most information. This is inherently
a simple preference bias that explicitly
searches for a simple hypothesis.
Ranking attributes also increases the
speed and accuracy in making
prediction. Based on the attribute
selection using the Information Gain, the
job sector attribute was found the most
important factor in discriminating the
graduate profiles to predict the
graduate’s employment status. This is
shown in Figure 2.
Figure 2. Job sector is ranked the highest by attribute selection based on Information Gain. This is largely
because the attribute has small set of values, thus one instance is easily distinguishable than the remaining
4.2 Bayes Methods
Table 2 shows the classification
accuracies for various algorithms under
Bayes method. In addition, the table
provides comparative results for the
kappa statistics, mean absolute error,
root mean squared error, relative
absolute error, and root relative squared
error from the total of 3,840 testing
International Journal on New Computer Architectures and Their Applications (IJNCAA) 1(4): 1086-1098
The Society of Digital Information and Wireless Communications, 2011 (ISSN: 2220-9085)
The Weightily Averaged One-
Dependence Estimators (WAODE)
algorithm achieved the highest accuracy
percentage as compared to other
algorithms. Despite treating each tree
augmented naive Bayes equally, [16]
have extended AODE by assigning
weight for each tree augmented naive
Bayes differently as the facts that each
attributes do not play the same role in
Table 2. Classification accuracy using various algorithms under Bayes method in WEKA.
Algorithm Accurac
y (%)
e Error
e Error
WAODE 91.3 8.7 0.834 0.073 0.203 20.8 48.4
AODE 91.1 8.9 0.827 0.069 0.208 19.5 49.6
90.9 9.1 0.825 0.072 0.214 20.5 51.3
Naïve Bayes
90.9 9.1 0.825 0.072 0.214 20.5 51.3
BayesNet 90.9 9.1 0.824 0.072 0.215 20.5 51.4
AODEsr 90.9 9.1 0.824 0.071 0.210 20.1 50.2
Naïve Bayes
90.9 9.1 0.825 0.072 0.214 20.5 51.3
HNB 90.3 9.7 0.816 0.091 0.214 25.7 51.1
4.3 Tree Methods
Table 3 shows the classification
accuracies for various algorithms under
tree method. In addition, the table
provides comparative results for the
kappa statistics, mean absolute error,
root mean squared error, relative
absolute error, and root relative squared
error from the total of 3,840 testing
Table 3. Classification accuracy using various algorithms under Tree method in WEKA.
Algorithm Accuracy
Root Mean
Error (%)
Root Relative
Squared Error
J48Graft 92.3 7.7 0.849 0.078 0.204 22.1 48.7
J48 92.2 7.8 0.848 0.078 0.204 22.2 48.8
Simple Cart 92.0 8.0 0.844 0.079 0.199 22.3 47.5
Random Forest 91.4 8.6 0.832 0.083 0.205 23.4 49.1
LAD Tree 91.3 8.7 0.830 0.077 0.197 22.0 47.0
International Journal on New Computer Architectures and Their Applications (IJNCAA) 1(4): 1086-1098
The Society of Digital Information and Wireless Communications, 2011 (ISSN: 2220-9085)
REPTree 91.0 9.0 0.825 0.080 0.213 22.8 50.9
Decision Stump 91.0 9.0 0.821 0.108 0.232 30.6 55.3
RandomTree 88.9 11.1 0.787 0.081 0.269 23.0 64.4
ID3 86.7 13.3 0.795 0.072 0.268 21.1 65.2
The J48Graft algorithm achieved the
highest accuracy percentage as
compared to other algorithms. J48Graft
generates a grafted C4.5 decision tree,
whether pruned or unprunned. Grafting
is an inductive process that adds nodes
to the inferred decision tree. Unlike
pruning that uses only information as the
tree grows, grafting uses non-local
information to provide better predictive
accuracy. Figure 3 shows the difference
of tree structure in a J48 tree as well as
the grafted J48 tree.
Figure 3. The top figure is the tree structure for
J48 and the bottom figure is the tree structure for
grafted J48. Grafting adds nodes to the decision
trees to increase the predictive accuracy. In the
grafted J48, new branches are added in the place
of a single leaf or graft within leaves.
Comparing the performance of both
Bayes and tree-based methods, the
J48Graft algorithm achieved the highest
accuracy of 92.3% using the Tracer
Study dataset. The second highest
accuracy is also under Tree method,
which is J48 algorithm with an accuracy
of 92.2%. Bayes method only falls to
number three using WAODE algorithm
with prediction accuracy of 91.3%.
Nonetheless, we found that both
classification approaches were
complementary because the Bayes
methods provide better view of
association or dependencies among the
attributes while the results from the tree
method are easier to interpret.
Figure 4 shows the mapping of root
mean squared error values that resulted
from the classification experiment. This
knowledge could be used in getting
insights on the employment trend of
graduates from local higher institutions.
International Journal on New Computer Architectures and Their Applications (IJNCAA) 1(4): 1086-1098
The Society of Digital Information and Wireless Communications, 2011 (ISSN: 2220-9085)
Bayes Methods
Tree-based Methods
AODE vs.
Naïve Bayes
Simple vs.
vs. ID3
Figure 4. A radial display of the root mean squared error across all algorithms under both Bayes and tree-
based methods relative to accuracy. The smaller the mean squared error, the better is the forecast. Based on
this figure, three out of five tree-based algorithms indicate better forecast as compared to the corresponding
algorithms under the Bayes methods.
As the education sector blooms every
year, graduates are facing stiff
competitions to ensure their
employability in the industry. The sole
purpose of the Tracer Study system is to
aid the higher educational institutions in
preparing their graduates with sufficient
skills to enter the job market. This paper
focussed on identifying attributes that
influenced graduates’ employability
based on actual data from the graduates
themselves after six month of
graduation. Nonetheless, assembling the
dataset was difficult because only 90%
of the attributes made their way to the
classification task. This is due to
confidentiality and sensitivity issues,
hence the remaining 10% of the
attributes are not permitted by the data
This paper attempts to predict
whether a graduate has been employed,
remains unemployed or in an
undetermined situation within the first
six months after their graduation. The
prediction has been performed through a
series of classification experiments using
various algorithms under Bayes and
decision methods to classify a graduate
profile as employed, unemployed or
others. Results showed that J48, a
variant of decision-tree algorithm
yielded the highest accuracy, which is
92.3% as compared to the average of
91.3% across other Bayes algorithms.
As for future work, we are hoping to
expand the dataset from the Tracer Study
with more attributes and to annotate the
attributes with information like
correlation factor between the current
employer and the previous employer.
We are also looking at integration
dataset from different sources of data,
International Journal on New Computer Architectures and Their Applications (IJNCAA) 1(4): 1086-1098
The Society of Digital Information and Wireless Communications, 2011 (ISSN: 2220-9085)
for instance graduate profiles from the
alumni organization in the respective
educational institutions. Having this,
next we plan to introduce clustering as
part of pre-processing to cluster the
attributes before attribute ranking is
performed. Finally, other data mining
techniques such as anomaly detection or
classification-based association may be
implemented in order to gain more
knowledge on the graduates
employability in Malaysia.
Acknowledgments. Special thanks to
Prof. Dr. Md Yusof Abu Bakar and Puan
Salwati Badaroddin from Ministry of
Higher Education Malaysia (MOHE) for
their help with data gathering as well as
expert opinion.
1. Han, J., Kamber, M.: Data Mining: Concepts
and Techniques. Morgan Kaufman (2006)
2. Shafie, L.A, Nayan, S.: Employability
Awareness among Malaysian
Undergraduates. International Journal of
Business and Management, 5(8):119--123
3. Mukhtar, M., Yahya, Y., Abdullah, S.,
Hamdan, A.R., Jailani, N., Abdullah, Z.:
Employability and Service Science: Facing
the Challenges via Curriculum Design and
Restructuring. In: International Conference
on Electrical Engineering and Informatics,
pp. 357--361 (2009)
4. Zaharim, A., Omar, M.Z., Yusoff, Y.M.,
Muhamad, N., Mohamed, A., Mustapha, R.:
Practical Framework of Employability Skills
for Engineering Graduate in Malaysia. In:
IEEE EDUCON Education Engineering
2010: The Future Of Global Learning
Engineering Education, pp. 921--927 (2010)
5. Rees, C., Forbes, P., Kubler, B.: Student
Employability Profiles: A Guide for Higher
Education Practitioners (2006)
6. Wook, M., Yahaya, Y.H., Wahab, N., Isa,
M.R.M.: Predicting NDUM Student’s
Academic Performance using Data Mining
Techniques. In: Second International
Conference on Computer and Electrical
Engineering, pp. 357--361 (2009)
7. Ogor, E.N.: Student Academic Performance
Monitoring and Evaluation Using Data
Mining Techniques. In: Fourth Congress of
Electronics, Robotics and Automotive
Mechanics, pp. 354--359 (2007)
8. Minaei-Bidgoli, B., Kashy, D.A.,
Kortemeyer, G., Punch, W.F.: Predicting
Student Performance: An Application of Data
Mining Methods with an Educational Web-
based System. In: 33rd Frontiers in Education
Conference, pp. 13--18 (2003)
9. Guruler, H., Istanbullu, A., Karahasan, M.: A
New Student Performance Analysing System
using Knowledge Discovery in Higher
Educational Databases. Computers &
Education. 55(1), pp 247--254 (2010)
10. Kumar, V., Chadha, A.: An Empirical Study
of the Applications of Data Mining
Techniques in Higher Education,
International Journal of Advanced Computer
Science and Applications, Vol. 2, No.3,
March 2011, pp 80-84 (2011)
16. L. Jiang, H. Zhang: Weightily Averaged One-
Dependence Estimators. In: Proceedings of
the 9th Biennial Pacific Rim International
Conference on Artificial Intelligence,
PRICAI 2006, pp 970-974 (2006)
15. Mitchell, T.: Machine Learning. McGraw
Hill, New York (1997)
14. Jaynes, E.T.: Probability Theory: The Logic
of Science. Cambridge University Press
13. Ian H. Witten, Eibe Frank:Data Mining :
Practical Machine Learning Tools and
Techniques, Morgan Kaufmann (2005)
12. Hall, M., Frank, E., Holmes, G., Pfahringer,
B., Reutemann, P., Witten, I.H.: The WEKA
Data Mining Software: An Update; SIGKDD
Explorations, Volume 11, Issue 1 (2009)
11. Affendey, L.S., Paris, I.H.M., Mustapha, N.,
Sulaiman, M.N., Muda, Z.: Ranking of
Influencing Factors in Predicting Student
Academic Performance. Information
Technology Journal. 9(4):832--837 (2010)
International Journal on New Computer Architectures and Their Applications (IJNCAA) 1(4): 1086-1098
The Society of Digital Information and Wireless Communications, 2011 (ISSN: 2220-9085)

Mais conteúdo relacionado

Semelhante a A Data Mining Approach To Construct Graduates Employability Model In Malaysia

Data mining approach to predict academic performance of students
Data mining approach to predict academic performance of studentsData mining approach to predict academic performance of students
Data mining approach to predict academic performance of students
The Role of Big Data Management and Analytics in Higher Education
The Role of Big Data Management and Analytics in Higher EducationThe Role of Big Data Management and Analytics in Higher Education
The Role of Big Data Management and Analytics in Higher Education
Business, Management and Economics Research
Data mining for prediction of human
Data mining for prediction of humanData mining for prediction of human
Data mining for prediction of human
A Model for Predicting Students’ Academic Performance using a Hybrid of K-mea...
A Model for Predicting Students’ Academic Performance using a Hybrid of K-mea...A Model for Predicting Students’ Academic Performance using a Hybrid of K-mea...
A Model for Predicting Students’ Academic Performance using a Hybrid of K-mea...
A Model for Predicting Students’ Academic Performance using a Hybrid of K-mea...
A Model for Predicting Students’ Academic Performance using a Hybrid of K-mea...A Model for Predicting Students’ Academic Performance using a Hybrid of K-mea...
A Model for Predicting Students’ Academic Performance using a Hybrid of K-mea...

Semelhante a A Data Mining Approach To Construct Graduates Employability Model In Malaysia (20)

Multiple educational data mining approaches to discover patterns in universit...
Multiple educational data mining approaches to discover patterns in universit...Multiple educational data mining approaches to discover patterns in universit...
Multiple educational data mining approaches to discover patterns in universit...
Data mining approach to predict academic performance of students
Data mining approach to predict academic performance of studentsData mining approach to predict academic performance of students
Data mining approach to predict academic performance of students
Automated Data Integration, Cleaning and Analysis Using Data Mining and SPSS ...
Automated Data Integration, Cleaning and Analysis Using Data Mining and SPSS ...Automated Data Integration, Cleaning and Analysis Using Data Mining and SPSS ...
Automated Data Integration, Cleaning and Analysis Using Data Mining and SPSS ...
Data Mining Techniques in Higher Education an Empirical Study for the Univer...
Data Mining Techniques in Higher Education an Empirical Study  for the Univer...Data Mining Techniques in Higher Education an Empirical Study  for the Univer...
Data Mining Techniques in Higher Education an Empirical Study for the Univer...
Competency model for
Competency model forCompetency model for
Competency model for
Ijciet 10 01_195-2-3
Ijciet 10 01_195-2-3Ijciet 10 01_195-2-3
Ijciet 10 01_195-2-3
The Role of Big Data Management and Analytics in Higher Education
The Role of Big Data Management and Analytics in Higher EducationThe Role of Big Data Management and Analytics in Higher Education
The Role of Big Data Management and Analytics in Higher Education
Predicting student performance in higher education using multi-regression models
Predicting student performance in higher education using multi-regression modelsPredicting student performance in higher education using multi-regression models
Predicting student performance in higher education using multi-regression models
Data mining for prediction of human
Data mining for prediction of humanData mining for prediction of human
Data mining for prediction of human
Predictive Analytics in Education Context
Predictive Analytics in Education ContextPredictive Analytics in Education Context
Predictive Analytics in Education Context
Smartphone, PLC Control, Bluetooth, Android, Arduino.
Smartphone, PLC Control, Bluetooth, Android, Arduino. Smartphone, PLC Control, Bluetooth, Android, Arduino.
Smartphone, PLC Control, Bluetooth, Android, Arduino.
M-Learners Performance Using Intelligence and Adaptive E-Learning Classify th...
M-Learners Performance Using Intelligence and Adaptive E-Learning Classify th...M-Learners Performance Using Intelligence and Adaptive E-Learning Classify th...
M-Learners Performance Using Intelligence and Adaptive E-Learning Classify th...
IRJET- Student Performance Analysis System for Higher Secondary Education
IRJET- Student Performance Analysis System for Higher Secondary EducationIRJET- Student Performance Analysis System for Higher Secondary Education
IRJET- Student Performance Analysis System for Higher Secondary Education
Data Mining Techniques for School Failure and Dropout System
Data Mining Techniques for School Failure and Dropout SystemData Mining Techniques for School Failure and Dropout System
Data Mining Techniques for School Failure and Dropout System
A Model for Predicting Students’ Academic Performance using a Hybrid of K-mea...
A Model for Predicting Students’ Academic Performance using a Hybrid of K-mea...A Model for Predicting Students’ Academic Performance using a Hybrid of K-mea...
A Model for Predicting Students’ Academic Performance using a Hybrid of K-mea...
A Model for Predicting Students’ Academic Performance using a Hybrid of K-mea...
A Model for Predicting Students’ Academic Performance using a Hybrid of K-mea...A Model for Predicting Students’ Academic Performance using a Hybrid of K-mea...
A Model for Predicting Students’ Academic Performance using a Hybrid of K-mea...

Mais de Sandra Long

Mais de Sandra Long (20)

Essay On Teachers Day (2023) In English Short, Simple Best
Essay On Teachers Day (2023) In English Short, Simple BestEssay On Teachers Day (2023) In English Short, Simple Best
Essay On Teachers Day (2023) In English Short, Simple Best
10 Best Printable Handwriting Paper Template PDF For Free At Printablee
10 Best Printable Handwriting Paper Template PDF For Free At Printablee10 Best Printable Handwriting Paper Template PDF For Free At Printablee
10 Best Printable Handwriting Paper Template PDF For Free At Printablee
Buy College Application Essay. Online assignment writing service.
Buy College Application Essay. Online assignment writing service.Buy College Application Essay. Online assignment writing service.
Buy College Application Essay. Online assignment writing service.
FREE 6 Sample Informative Essay Templates In MS Word
FREE 6 Sample Informative Essay Templates In MS WordFREE 6 Sample Informative Essay Templates In MS Word
FREE 6 Sample Informative Essay Templates In MS Word
Small Essay On Education. Small Essay On The Educ
Small Essay On Education. Small Essay On The EducSmall Essay On Education. Small Essay On The Educ
Small Essay On Education. Small Essay On The Educ
Where Can I Buy A Persuasive Essay, Buy Per
Where Can I Buy A Persuasive Essay, Buy PerWhere Can I Buy A Persuasive Essay, Buy Per
Where Can I Buy A Persuasive Essay, Buy Per
Chinese Writing Practice Paper With Pinyin Goodnot
Chinese Writing Practice Paper With Pinyin GoodnotChinese Writing Practice Paper With Pinyin Goodnot
Chinese Writing Practice Paper With Pinyin Goodnot
Elephant Story Writing Sample - Aus - Elephant W
Elephant Story Writing Sample - Aus - Elephant WElephant Story Writing Sample - Aus - Elephant W
Elephant Story Writing Sample - Aus - Elephant W
391505 Paragraph-Writ. Online assignment writing service.
391505 Paragraph-Writ. Online assignment writing service.391505 Paragraph-Writ. Online assignment writing service.
391505 Paragraph-Writ. Online assignment writing service.
Get Essay Writing Assignment Help Writing Assignments, Essay Writing
Get Essay Writing Assignment Help Writing Assignments, Essay WritingGet Essay Writing Assignment Help Writing Assignments, Essay Writing
Get Essay Writing Assignment Help Writing Assignments, Essay Writing
Ampad EZ Flag Writing Pad, LegalWide, 8 12 X 11, Whi
Ampad EZ Flag Writing Pad, LegalWide, 8 12 X 11, WhiAmpad EZ Flag Writing Pad, LegalWide, 8 12 X 11, Whi
Ampad EZ Flag Writing Pad, LegalWide, 8 12 X 11, Whi
The Federalist Papers Writers Nozna.Net. Online assignment writing service.
The Federalist Papers Writers Nozna.Net. Online assignment writing service.The Federalist Papers Writers Nozna.Net. Online assignment writing service.
The Federalist Papers Writers Nozna.Net. Online assignment writing service.
Whoever Said That Money CanT Buy Happiness, Simply DidnT
Whoever Said That Money CanT Buy Happiness, Simply DidnTWhoever Said That Money CanT Buy Happiness, Simply DidnT
Whoever Said That Money CanT Buy Happiness, Simply DidnT
How To Write An Essay In College Odessa Howtowrit
How To Write An Essay In College Odessa HowtowritHow To Write An Essay In College Odessa Howtowrit
How To Write An Essay In College Odessa Howtowrit
How To Write A Career Research Paper. Online assignment writing service.
How To Write A Career Research Paper. Online assignment writing service.How To Write A Career Research Paper. Online assignment writing service.
How To Write A Career Research Paper. Online assignment writing service.
Columbia College Chicago Notable Alumni - INFOLEARNERS
Columbia College Chicago Notable Alumni - INFOLEARNERSColumbia College Chicago Notable Alumni - INFOLEARNERS
Columbia College Chicago Notable Alumni - INFOLEARNERS
001 P1 Accounting Essay Thatsnotus. Online assignment writing service.
001 P1 Accounting Essay Thatsnotus. Online assignment writing service.001 P1 Accounting Essay Thatsnotus. Online assignment writing service.
001 P1 Accounting Essay Thatsnotus. Online assignment writing service.
Essay Writing Tips That Will Make Col. Online assignment writing service.
Essay Writing Tips That Will Make Col. Online assignment writing service.Essay Writing Tips That Will Make Col. Online assignment writing service.
Essay Writing Tips That Will Make Col. Online assignment writing service.
Pin On Essay Writer Box. Online assignment writing service.
Pin On Essay Writer Box. Online assignment writing service.Pin On Essay Writer Box. Online assignment writing service.
Pin On Essay Writer Box. Online assignment writing service.
How To Write A Funny Essay For College - Ai
How To Write A Funny Essay For College - AiHow To Write A Funny Essay For College - Ai
How To Write A Funny Essay For College - Ai


Adversarial Attention Modeling for Multi-dimensional Emotion Regression.pdf
Adversarial Attention Modeling for Multi-dimensional Emotion Regression.pdfAdversarial Attention Modeling for Multi-dimensional Emotion Regression.pdf
Adversarial Attention Modeling for Multi-dimensional Emotion Regression.pdf
Po-Chuan Chen
Accounting and finance exit exam 2016 E.C.pdf
Accounting and finance exit exam 2016 E.C.pdfAccounting and finance exit exam 2016 E.C.pdf
Accounting and finance exit exam 2016 E.C.pdf

Último (20)

Sha'Carri Richardson Presentation 202345
Sha'Carri Richardson Presentation 202345Sha'Carri Richardson Presentation 202345
Sha'Carri Richardson Presentation 202345
Fish and Chips - have they had their chips
Fish and Chips - have they had their chipsFish and Chips - have they had their chips
Fish and Chips - have they had their chips
UNIT – IV_PCI Complaints: Complaints and evaluation of complaints, Handling o...
UNIT – IV_PCI Complaints: Complaints and evaluation of complaints, Handling o...UNIT – IV_PCI Complaints: Complaints and evaluation of complaints, Handling o...
UNIT – IV_PCI Complaints: Complaints and evaluation of complaints, Handling o...
Mattingly "AI & Prompt Design: Limitations and Solutions with LLMs"
Mattingly "AI & Prompt Design: Limitations and Solutions with LLMs"Mattingly "AI & Prompt Design: Limitations and Solutions with LLMs"
Mattingly "AI & Prompt Design: Limitations and Solutions with LLMs"
Sectors of the Indian Economy - Class 10 Study Notes pdf
Sectors of the Indian Economy - Class 10 Study Notes pdfSectors of the Indian Economy - Class 10 Study Notes pdf
Sectors of the Indian Economy - Class 10 Study Notes pdf
Palestine last event orientationfvgnh .pptx
Palestine last event orientationfvgnh .pptxPalestine last event orientationfvgnh .pptx
Palestine last event orientationfvgnh .pptx
B.ed spl. HI pdusu exam paper-2023-24.pdf
B.ed spl. HI pdusu exam paper-2023-24.pdfB.ed spl. HI pdusu exam paper-2023-24.pdf
B.ed spl. HI pdusu exam paper-2023-24.pdf
INU_CAPSTONEDESIGN_비밀번호486_업로드용 발표자료.pdf
INU_CAPSTONEDESIGN_비밀번호486_업로드용 발표자료.pdfINU_CAPSTONEDESIGN_비밀번호486_업로드용 발표자료.pdf
INU_CAPSTONEDESIGN_비밀번호486_업로드용 발표자료.pdf
Operations Management - Book1.p - Dr. Abdulfatah A. Salem
Operations Management - Book1.p  - Dr. Abdulfatah A. SalemOperations Management - Book1.p  - Dr. Abdulfatah A. Salem
Operations Management - Book1.p - Dr. Abdulfatah A. Salem
Advances in production technology of Grapes.pdf
Advances in production technology of Grapes.pdfAdvances in production technology of Grapes.pdf
Advances in production technology of Grapes.pdf
Adversarial Attention Modeling for Multi-dimensional Emotion Regression.pdf
Adversarial Attention Modeling for Multi-dimensional Emotion Regression.pdfAdversarial Attention Modeling for Multi-dimensional Emotion Regression.pdf
Adversarial Attention Modeling for Multi-dimensional Emotion Regression.pdf
Home assignment II on Spectroscopy 2024 Answers.pdf
Home assignment II on Spectroscopy 2024 Answers.pdfHome assignment II on Spectroscopy 2024 Answers.pdf
Home assignment II on Spectroscopy 2024 Answers.pdf
Introduction to Quality Improvement Essentials
Introduction to Quality Improvement EssentialsIntroduction to Quality Improvement Essentials
Introduction to Quality Improvement Essentials
Accounting and finance exit exam 2016 E.C.pdf
Accounting and finance exit exam 2016 E.C.pdfAccounting and finance exit exam 2016 E.C.pdf
Accounting and finance exit exam 2016 E.C.pdf
Salient features of Environment protection Act 1986.pptx
Salient features of Environment protection Act 1986.pptxSalient features of Environment protection Act 1986.pptx
Salient features of Environment protection Act 1986.pptx
The Art Pastor's Guide to Sabbath | Steve Thomason
The Art Pastor's Guide to Sabbath | Steve ThomasonThe Art Pastor's Guide to Sabbath | Steve Thomason
The Art Pastor's Guide to Sabbath | Steve Thomason
MARUTI SUZUKI- A Successful Joint Venture in India.pptx
MARUTI SUZUKI- A Successful Joint Venture in India.pptxMARUTI SUZUKI- A Successful Joint Venture in India.pptx
MARUTI SUZUKI- A Successful Joint Venture in India.pptx

A Data Mining Approach To Construct Graduates Employability Model In Malaysia

  • 1. A Data Mining Approach to Construct Graduates Employability Model in Malaysia Myzatul Akmam Sapaat, Aida Mustapha, Johanna Ahmad, Khadijah Chamili, Rahamirzam Muhamad Faculty of Computer Science and Information Technology, Universiti Putra Malaysia, 43400 UPM Serdang, Selangor, Malaysia {,,,,} ABSTRACT This study is to construct the Graduates Employability Model using classification task in data mining. To achieve it, we use data sourced from the Tracer Study, a web- based survey system from the Ministry of Higher Education, Malaysia (MOHE) for the year 2009. The classification experiment is performed using various Bayes algorithms to determine whether a graduate has been employed, remains unemployed or in an undetermined situation. The performance of Bayes algorithms are also compared against a number of tree-based algorithms. Information Gain is also used to rank the attributes and the results showed that top three attributes that have direct impact on employability are the job sector, job status and reason for not working. Results showed that J48, a variant of decision-tree algorithm performed with highest accuracy, which is 92.3% as compared to the average of 91.3% from other Bayes algorithms. This leads to the conclusion that a tree-based classifier is more suitable for the tracer data due to the information gain strategy. KEYWORDS Classification, Bayes Methods, Decision Tree, Employability 1 INTRODUCTION Tracer Study is a web-based survey system developed by the Ministry of Higher Education, Malaysia (MOHE). It is compulsory to be filled by all students graduating from polytechnics, public or private institutions before their convocation for any level of degree awarded. The sole purpose of the survey is to guide future planning and to improve various aspects of local higher education administrative system. The survey also serves as a tool to gauge the adequacy of higher education in Malaysia in supplying manpower needs in all areas across technical, managerial or social science. Data sourced from the Tracer Study is invaluable because it provides correlation about the graduate qualifications and skills along with employment status. Graduates employability remains as national issues due to the increasing number of graduates produced by higher education institutions each year. According to statistics generated from the Tracer Study, total number of graduates produced by higher institutions in 2008 is 139,278. In 2009, the volume has increased to 155,278 graduates. Taking this into consideration, 50% of graduates in 2009 1086 International Journal on New Computer Architectures and Their Applications (IJNCAA) 1(4): 1086-1098 The Society of Digital Information and Wireless Communications, 2011 (ISSN: 2220-9085)
  • 2. are bachelor holder from public and private universities. Only 49.20% or 38,191 of them successfully employed within the first six months after finishing their studies. Previous research on graduate employability covers wide range of domain such as education, engineering, and social science. While the researches are mainly based on surveys or interviews, little has been done using data mining techniques. Bayes’ theorem is among the earliest statistical method that is used to identify patterns in data. But as datasets have grown in size and complexity, data mining has emerged as a technology to apply methods such as neural networks, genetic algorithms, decision trees, and support vector machines to uncover hidden patterns [1]. Today, data mining technologies are dealing with huge amount of data from various sources, for example relational or transactional databases, data warehouse, images, flat files or in the form World Wide Web. Classification is the task of generalizing observations in the training data, which are accompanied by specific class of the observations. The objective of this paper is to predict whether a graduate has been employed, remains unemployed or in an undetermined situation within the first six months after graduation. This will be achieved through a classification experiment that classifies a graduate profile as employed, unemployed or others. The main contribution of this paper is the comparison of classification accuracy between various algorithms from the two most commonly used data mining techniques in the education domain, which are the Bayes methods and decision trees. The remainder of this paper is organized as follows. Section 2 presents the related works on graduate employability and reviews recent techniques employed in data mining. Section 3 introduces the dataset and the experimental setting. Section 4 discusses finding of the results. Finally Section 5 concludes the paper with some direction for future work. 2 RELATED WORK A number of works have been done to identify the factors that influenced graduates employability in Malaysia. It is as an initiative step to align the higher education with the industry, where currently exists unquestionable impact against each other. Nonetheless, most of the previous works were carried out beyond the data mining domain. Besides, data sources for previous works were collected and assembled through survey in sample population. Research in [2] identifies three major requirements concerned by the employers in hiring employees, which are basic academic skills, higher order thinking skills, and personal qualities. The work is restricted in the education domain specifically analyzing the effectiveness of a subject, English for Occupational Purposes (EOP) in enhancing employability skills. Similar to [2], work by [3] proposes to restructure the curriculum and methods of instruction in preparing future graduates for the forthcoming challenges based on the model of the T-shaped professional and newly developed field 1087 International Journal on New Computer Architectures and Their Applications (IJNCAA) 1(4): 1086-1098 The Society of Digital Information and Wireless Communications, 2011 (ISSN: 2220-9085)
  • 3. of Service Science, Management and Engineering (SSME). More recently, [4] proposes a new Malaysian Engineering Employability Skills Framework (MEES), which is constructed based on requirement by accrediting bodies and professional bodies and existing research findings in employability skills as a guideline in training package and qualification in Malaysia. Nonetheless, not surprisingly, graduates employability is rarely being studied especially within the scope of data mining, mainly due to limited and authentic data source available. Employability issues have also been taken into consideration in other countries. Research by The Higher Education Academy with the Council for Industry and Higher Education (CIHE) in United Kingdom concluded that there are six competencies that employers observe in individual who can transform the organizations and add values in their careers [5]. The six competencies are cognitive skills or brainpower, generic competencies, personal capabilities, technical ability, business or organization awareness and practical elements. Furthermore, it covers a set of achievements comprises skills, understandings and personal attributes that make graduates more likely to gain employment and successful in their chosen occupations which benefits the graduates, the community and also the economy. However, data mining techniques have indeed been employed in education domain, for instance in prediction and classification of student academic performance using Artificial Neural Network [6, 7] and a combination of clustering and decision tree classification techniques [6]. Experiments in [8] classifies students to predict their final grade using six common classifiers (Quadratic Bayesian classifier, 1-nearest neighbour (1-NN), k-nearest neighbor (k-NN), Parzen-window, multilayer perceptron (MLP), and Decision Tree). With regards to student performance, [9] discovers individual student characteristics that are associated with their success according to grade point averages (GPA) by using a Microsoft Decision Trees (MDT) classification technique. [10] has shown some applications of data mining in educational institution that extract useful information from the huge data sets. Data mining through analytical tool offers user to view and use current information for decision making process such as organization of syllabus, predicting the registration of students in an educational program, predicting student performance, detecting cheating in online examination as well as identifying abnormal/erroneous values. Among the related work, we found that work done by [11] is most related to this research, whereby the work mines historical data of students' academic results using different classifiers (Bayes, trees, function) to rank influencing factors that contribute in predicting student academic performance. 3 MATERIALS AND METHODS The main objective of this paper is to classify a graduate profile as employed, unemployed or undetermined using data sourced from the Tracer Study database for the year of 2009. The dataset consists 1088 International Journal on New Computer Architectures and Their Applications (IJNCAA) 1(4): 1086-1098 The Society of Digital Information and Wireless Communications, 2011 (ISSN: 2220-9085)
  • 4. of 12,830 instances and 20 attributes related to graduate profiles from 19 public universities and 138 private universities. Table 1 shows the complete attributes for the Tracer Study dataset. To construct the classifiers, we use the Waikato Environment for Knowledge Analysis (WEKA), an open- source data mining tool [12] which was developed at University of Waikato New Zealand. It provides various learning algorithm that can be easily implemented to the dataset. WEKA only accepts dataset in Attribute-Relation File Format (ARFF) format. Therefore, once the data preparation being done, we transform the dataset into ARFF file with extension of .arff. 1089 nternational Journal on New Computer Architectures and Their Applications (IJNCAA) 1(4): 1086-1098 The Society of Digital Information and Wireless Communications, 2011 (ISSN: 2220-9085)
  • 5. Table 1. Attributes from the Tracer Study dataset after the pre-processing is performed. No. Attributes Values Descriptions 1 sex {male, female} Gender of the graduate 2 age {20-25, 25-30, 30-40, 40-50, >50} Age of the graduate 3 univ {public_univ, private_univ} University/institution of current qualification 4 level {certificate, diploma, advanced_diploma, first_degree, postGraduate_diploma, masters_ thesis, masters_courseWork& Thesis, masters_courseWork, phd_ thesis, phd_courseWork&Thesis, professional} Level of study for current qualification 5 field {technical, ict, education, science, art&soc_science } Field of study for current qualification 6 cgpa {2.00-2.49, 2.50-2.99, 3.00-3.66, 3.67- 4.00, failed, 4.01-6.17} CGPA for current qualification 7 emp_status {employed, unemployed, others} Current employment status 8 general_IT skills {satisfied, extremely_satisfied, average, strongly_not_satisfied, not_satisfied, not_applicable} Level of IT skills, Malay and English language proficiency, general knowledge, interpersonal communication, creative and critical thinking, analytical skills, problem solving, inculcation of positive values, and teamwork acquired from the programme of study 9 Malay_lang 10 English_lang 11 gen_knowledge 12 interpersonal_ comm 13 cc_thinking 14 analytical 15 prob_solving 16 positive_value 17 teamwork 18 job_status {permanent, contract, temp, self_ employed, family_business} Job status of employed graduates 19 job_sector {local_private_company, multinational_ company, own_company, government, NGO, GLC, statutory_body, others} Job sector of employed graduates 20 reason_not_ working {job_hunting, waiting_for_ posting, further_study, participating_skills_ program, waiting_posting_of_study, unsuitable_job, resting, others, family_ responsibilities, medical_ issues, not_ interested_to_work, not_going_to_work, lack_of_confidence, chambering} Reason for not working for unemployed graduates 3.1 Data-Preprocessing The raw data retrieved from the Tracer Study database required pre-processing to prepare the dataset for the classification task. First, cleaning activities involved eliminating data with missing values in critical attributes, identifying outliers, correcting inconsistent data, as well as removing duplicate data. From the total of 89,290 instances in the raw data, the data cleaning process ended up 12,830 instances that are ready to be mined. For 1090 International Journal on New Computer Architectures and Their Applications (IJNCAA) 1(4): 1086-1098 The Society of Digital Information and Wireless Communications, 2011 (ISSN: 2220-9085)
  • 6. missing values (i.e., age attribute), we replaced them with the mean values of the attribute. Second, data discretization is required due to the fact that most of attributes from the Tracer Study are continuous attributes. In this case, we discretized the values into interval so as to prepare the dataset into categorical or nominal attributes as below.  cgpa previously in continuous number is transformed into grade range  sex previously coded as 1 and 2 is transformed into nominal  age previously in continuous number is transformed into age range  field of study previously in numerical code 1-4 is transformed into nominal  skill information (i.e., language proficiency, general knowledge, interpersonal communication etc) previously in numerical 1-9 is transformed into nominal  employment status previously in numerical code 1-3 is transformed into nominal 3.2 Classification Task The classification task at hand is to predict the employment status (employed, unemployed, others) for graduate profiles in the Tracer Study. The task is performed in two stages, training and testing. Once the classifier is constructed, testing dataset is used to estimate the predictive accuracy of the classifier. There are four types of testing option in WEKA, which are using the training set, supplied test set, cross validation and percentage split. If we use training set as the test option, the test data will be sourced from the same training data, hence this will decrease reliable estimate of the true error rate. Supplied test set permit us to set the test data which been prepared separately from the training data. Cross-validation is suitable for limited dataset whereby the number of fold can be determined by user. 10-fold cross validation is widely use to get the best estimate of error. It has been proven by extensive test on numerous datasets with different learning techniques [13]. With a number of dataset and to avoid overfitting, we employed hold-out validation method with 70-30 percentage split, whereby 70% out of the 12,830 instances is used for training while the remaining instances are used for testing. Various algorithms from both Bayes and decision tree families are used in predicting the accuracy of the employment status. Information Gain. Information Gain is an attribute selection measure uses in ID3. If node N represents tuples of partition D, attribute with highest information gain will be chosen as splitting attribute for node N. It resulted towards minimizing number of tests needed to classify a given tuples as well as guarantees that a simple tree is found. The expected information needed to classify a tuple in D is given by m Info(D) = - ∑ pi log2(pi) i=1 1091 International Journal on New Computer Architectures and Their Applications (IJNCAA) 1(4): 1086-1098 The Society of Digital Information and Wireless Communications, 2011 (ISSN: 2220-9085)
  • 7. Bayes Methods. In Bayes methods, the classification task consists of classifying a class variable, given a set of attribute variables. It is a type of statistical in which the prior distribution is estimated from the data before any new data are observed, hence every parameter is assigned with a prior probability distribution [14]. A Bayesian classifier learns from the samples over both class and attribute variables. The naïve Bayesian classifier works as follows: Let D be a training set of tuples and their associated class labels. As usual, each tuple is represented by an n-dimensional attribute vector, X = (x1, x2, …, xn), depicting n measurements made on the tuple from n attributes, respectively, A1, A2, … , An. Suppose that there are m classes, C1, C2, …, Cm. Given a tuple, X, the classifier will predict that X belongs to the class having the highest posterior probability, conditioned on X. That is, the naïve Bayesian classifier predicts that tuple X belongs to the class Ci if and only if P(Ci|X) > P(Cj|X) for 1 ≤ j ≤ m; j ≠ i Thus, we maximize P(Ci|X). The class Ci for which P(Ci|X) is maximized is called the maximum posteriori hypothesis. Under the Bayes method in WEKA, we performed the experiment with eight algorithms, which are Averaged One- Dependence Estimators (AODE), AODEsr, WAODE, Bayes Network, HNB, Naïve Bayesian, Naïve Bayesian Simple and Naïve Bayesian Updateable. AODE, HNB and Naïve Bayesian was also used in [11] and the rest algorithms were chosen to further compare the results from the Bayes algorithm experiment using the same dataset. AODE algorithm achieved the highest accuracy percentage averaging all of smaller searching-space in alternative naive Bayes-like models that have weaker and hence less detrimental independence assumptions than naive Bayes. The resulting algorithm is computationally efficient while delivering highly accurate classification on many learning tasks. AODEsr and WAODE are expended from AODE. AODEsr complement AODE with Subsumption Resolution, which is capable to detect specializations between two attribute values at classification time and deletes the generalization attribute value. Meanwhile, WAODE constructs the model called Weightily Averaged One- Dependence Estimators by assigning weight to each dataset. Bayes Network learning using various search algorithms and quality measures. HNB constructs Hidden Naive Bayes classification model with high classification accuracy and AUC. In Naive Bayes, numeric estimator precision values are chosen based on analysis of the training data. The Naïve Bayes Updateable classifier will use a default precision of 0.1 for numeric attributes when build classifier is called with zero training instances. Naive Bayes Simple modeled numeric attributes by a normal distribution. Tree Methods. Tree-based methods classify instances by sorting the instances down the tree from the root to some leaf node, which provides the classification of a particular instance. Each node in the tree specifies a test of 1092 International Journal on New Computer Architectures and Their Applications (IJNCAA) 1(4): 1086-1098 The Society of Digital Information and Wireless Communications, 2011 (ISSN: 2220-9085)
  • 8. some attribute of the instance and each branch descending from that node corresponds to one of the possible values for this attribute [15]. Figure 1 shows the model produced by decision trees, which is represented in the form of tree structure. Under the tree method in WEKA, we performed the classification experiment with nine algorithms, which are ID3, J48, REPTree, J48graft, Random Tree, Decision Stump, LADTree, Random Forest and Simple Cart. J48 and REPTree was also used in [11], but we did not managed to use NBTree and BFTree because the experiment worked on large amount of datasets, thus incompatible with the memory allocation in WEKA. FT, User Classifier and LMT algorithm also experienced the same problem as NBTree and BFTree. In addition, we employed ID3, J48graft, Random Tree, Decision Stump, LAD Tree, Random Forest and Simple Cart to experiment with other alternative algorithms in decision tree. Figure 1. In a tree structure, each node denotes a test on an attribute value, each branch represents an outcome of the test, and tree leaves represent classes or class distributions. A leaf node indicates the class of the examples. The instances are classified by sorting them down the tree from the root node to some leaf node. ID3 is a class for constructing an unpruned decision tree based on the ID3 algorithm, which only deals with nominal attributes. J48 is a class for generating a pruned or unpruned C4.5 decision tree while J48 grafted generates a grafted (pruned or unpruned) C4.5 decision tree. REPTree is fast decision tree learner which builds a decision/ regression tree using information gain/ variance and prunes it using reduced- error pruning (with backfitting). Decision stump is usually being used in conjunction with a boosting algorithm. A multi-class alternating decision tree is generated in LADTree using the LogitBoost strategy. Random Forest constructs a forest of random trees whereas Random Tree constructs a tree that considers K randomly chosen attributes at each node without pruning. SimpleCart implements minimal cost- complexity pruning. 4 RESULTS AND DISCUSSIONS We segregated the experimental results into three parts. The first is the result from ranking attributes in the Tracer Study dataset using the Information Gain. The second and third parts presents the predictive accuracy results by various algorithms from the Bayes method and decision tree families, respectively. 4.1 Information Gain In this study, we employed Information Gain to rank the attributes in determining the target values as well as to reduce the size of prediction. Decision set of possible answers leaf leaf root node set of possible answers 1093 International Journal on New Computer Architectures and Their Applications (IJNCAA) 1(4): 1086-1098 The Society of Digital Information and Wireless Communications, 2011 (ISSN: 2220-9085)
  • 9. tree algorithms adopt a mutual- information criterion to choose the particular attribute to branch on that gain the most information. This is inherently a simple preference bias that explicitly searches for a simple hypothesis. Ranking attributes also increases the speed and accuracy in making prediction. Based on the attribute selection using the Information Gain, the job sector attribute was found the most important factor in discriminating the graduate profiles to predict the graduate’s employment status. This is shown in Figure 2. Figure 2. Job sector is ranked the highest by attribute selection based on Information Gain. This is largely because the attribute has small set of values, thus one instance is easily distinguishable than the remaining instances. 4.2 Bayes Methods Table 2 shows the classification accuracies for various algorithms under Bayes method. In addition, the table provides comparative results for the kappa statistics, mean absolute error, root mean squared error, relative absolute error, and root relative squared error from the total of 3,840 testing instances. 1094 International Journal on New Computer Architectures and Their Applications (IJNCAA) 1(4): 1086-1098 The Society of Digital Information and Wireless Communications, 2011 (ISSN: 2220-9085)
  • 10. The Weightily Averaged One- Dependence Estimators (WAODE) algorithm achieved the highest accuracy percentage as compared to other algorithms. Despite treating each tree augmented naive Bayes equally, [16] have extended AODE by assigning weight for each tree augmented naive Bayes differently as the facts that each attributes do not play the same role in classification. Table 2. Classification accuracy using various algorithms under Bayes method in WEKA. Algorithm Accurac y (%) Error Rate (%) Kappa Statistic s Mean Absolut e Error Root Mean Squared Error Relative Absolut e Error (%) Root Relative Squared Error (%) WAODE 91.3 8.7 0.834 0.073 0.203 20.8 48.4 AODE 91.1 8.9 0.827 0.069 0.208 19.5 49.6 Naïve Bayesian 90.9 9.1 0.825 0.072 0.214 20.5 51.3 Naïve Bayes simple 90.9 9.1 0.825 0.072 0.214 20.5 51.3 BayesNet 90.9 9.1 0.824 0.072 0.215 20.5 51.4 AODEsr 90.9 9.1 0.824 0.071 0.210 20.1 50.2 Naïve Bayes Updateable 90.9 9.1 0.825 0.072 0.214 20.5 51.3 HNB 90.3 9.7 0.816 0.091 0.214 25.7 51.1 4.3 Tree Methods Table 3 shows the classification accuracies for various algorithms under tree method. In addition, the table provides comparative results for the kappa statistics, mean absolute error, root mean squared error, relative absolute error, and root relative squared error from the total of 3,840 testing instances. Table 3. Classification accuracy using various algorithms under Tree method in WEKA. Algorithm Accuracy (%) Error Rate (%) Kappa Statistics Mean Absolute Error Root Mean Squared Error Relative Absolute Error (%) Root Relative Squared Error (%) J48Graft 92.3 7.7 0.849 0.078 0.204 22.1 48.7 J48 92.2 7.8 0.848 0.078 0.204 22.2 48.8 Simple Cart 92.0 8.0 0.844 0.079 0.199 22.3 47.5 Random Forest 91.4 8.6 0.832 0.083 0.205 23.4 49.1 LAD Tree 91.3 8.7 0.830 0.077 0.197 22.0 47.0 1095 International Journal on New Computer Architectures and Their Applications (IJNCAA) 1(4): 1086-1098 The Society of Digital Information and Wireless Communications, 2011 (ISSN: 2220-9085)
  • 11. REPTree 91.0 9.0 0.825 0.080 0.213 22.8 50.9 Decision Stump 91.0 9.0 0.821 0.108 0.232 30.6 55.3 RandomTree 88.9 11.1 0.787 0.081 0.269 23.0 64.4 ID3 86.7 13.3 0.795 0.072 0.268 21.1 65.2 The J48Graft algorithm achieved the highest accuracy percentage as compared to other algorithms. J48Graft generates a grafted C4.5 decision tree, whether pruned or unprunned. Grafting is an inductive process that adds nodes to the inferred decision tree. Unlike pruning that uses only information as the tree grows, grafting uses non-local information to provide better predictive accuracy. Figure 3 shows the difference of tree structure in a J48 tree as well as the grafted J48 tree. Figure 3. The top figure is the tree structure for J48 and the bottom figure is the tree structure for grafted J48. Grafting adds nodes to the decision trees to increase the predictive accuracy. In the grafted J48, new branches are added in the place of a single leaf or graft within leaves. Comparing the performance of both Bayes and tree-based methods, the J48Graft algorithm achieved the highest accuracy of 92.3% using the Tracer Study dataset. The second highest accuracy is also under Tree method, which is J48 algorithm with an accuracy of 92.2%. Bayes method only falls to number three using WAODE algorithm with prediction accuracy of 91.3%. Nonetheless, we found that both classification approaches were complementary because the Bayes methods provide better view of association or dependencies among the attributes while the results from the tree method are easier to interpret. Figure 4 shows the mapping of root mean squared error values that resulted from the classification experiment. This knowledge could be used in getting insights on the employment trend of graduates from local higher institutions. 1096 International Journal on New Computer Architectures and Their Applications (IJNCAA) 1(4): 1086-1098 The Society of Digital Information and Wireless Communications, 2011 (ISSN: 2220-9085)
  • 12. 0 0.05 0.1 0.15 0.2 0.25 0.3 1 2 3 4 5 Bayes Methods Tree-based Methods AODE vs. J48Graft Naïve Bayesian Naïve Bayes Simple vs. REPTree BayesNet vs. RandomTr HNB vs. ID3 Figure 4. A radial display of the root mean squared error across all algorithms under both Bayes and tree- based methods relative to accuracy. The smaller the mean squared error, the better is the forecast. Based on this figure, three out of five tree-based algorithms indicate better forecast as compared to the corresponding algorithms under the Bayes methods. 6 CONCLUSIONS As the education sector blooms every year, graduates are facing stiff competitions to ensure their employability in the industry. The sole purpose of the Tracer Study system is to aid the higher educational institutions in preparing their graduates with sufficient skills to enter the job market. This paper focussed on identifying attributes that influenced graduates’ employability based on actual data from the graduates themselves after six month of graduation. Nonetheless, assembling the dataset was difficult because only 90% of the attributes made their way to the classification task. This is due to confidentiality and sensitivity issues, hence the remaining 10% of the attributes are not permitted by the data owner. This paper attempts to predict whether a graduate has been employed, remains unemployed or in an undetermined situation within the first six months after their graduation. The prediction has been performed through a series of classification experiments using various algorithms under Bayes and decision methods to classify a graduate profile as employed, unemployed or others. Results showed that J48, a variant of decision-tree algorithm yielded the highest accuracy, which is 92.3% as compared to the average of 91.3% across other Bayes algorithms. As for future work, we are hoping to expand the dataset from the Tracer Study with more attributes and to annotate the attributes with information like correlation factor between the current employer and the previous employer. We are also looking at integration dataset from different sources of data, 1097 International Journal on New Computer Architectures and Their Applications (IJNCAA) 1(4): 1086-1098 The Society of Digital Information and Wireless Communications, 2011 (ISSN: 2220-9085)
  • 13. for instance graduate profiles from the alumni organization in the respective educational institutions. Having this, next we plan to introduce clustering as part of pre-processing to cluster the attributes before attribute ranking is performed. Finally, other data mining techniques such as anomaly detection or classification-based association may be implemented in order to gain more knowledge on the graduates employability in Malaysia. Acknowledgments. Special thanks to Prof. Dr. Md Yusof Abu Bakar and Puan Salwati Badaroddin from Ministry of Higher Education Malaysia (MOHE) for their help with data gathering as well as expert opinion. 7 REFERENCES 1. Han, J., Kamber, M.: Data Mining: Concepts and Techniques. Morgan Kaufman (2006) 2. Shafie, L.A, Nayan, S.: Employability Awareness among Malaysian Undergraduates. International Journal of Business and Management, 5(8):119--123 (2010) 3. Mukhtar, M., Yahya, Y., Abdullah, S., Hamdan, A.R., Jailani, N., Abdullah, Z.: Employability and Service Science: Facing the Challenges via Curriculum Design and Restructuring. In: International Conference on Electrical Engineering and Informatics, pp. 357--361 (2009) 4. Zaharim, A., Omar, M.Z., Yusoff, Y.M., Muhamad, N., Mohamed, A., Mustapha, R.: Practical Framework of Employability Skills for Engineering Graduate in Malaysia. In: IEEE EDUCON Education Engineering 2010: The Future Of Global Learning Engineering Education, pp. 921--927 (2010) 5. Rees, C., Forbes, P., Kubler, B.: Student Employability Profiles: A Guide for Higher Education Practitioners (2006) 6. Wook, M., Yahaya, Y.H., Wahab, N., Isa, M.R.M.: Predicting NDUM Student’s Academic Performance using Data Mining Techniques. In: Second International Conference on Computer and Electrical Engineering, pp. 357--361 (2009) 7. Ogor, E.N.: Student Academic Performance Monitoring and Evaluation Using Data Mining Techniques. In: Fourth Congress of Electronics, Robotics and Automotive Mechanics, pp. 354--359 (2007) 8. Minaei-Bidgoli, B., Kashy, D.A., Kortemeyer, G., Punch, W.F.: Predicting Student Performance: An Application of Data Mining Methods with an Educational Web- based System. In: 33rd Frontiers in Education Conference, pp. 13--18 (2003) 9. Guruler, H., Istanbullu, A., Karahasan, M.: A New Student Performance Analysing System using Knowledge Discovery in Higher Educational Databases. Computers & Education. 55(1), pp 247--254 (2010) 10. Kumar, V., Chadha, A.: An Empirical Study of the Applications of Data Mining Techniques in Higher Education, International Journal of Advanced Computer Science and Applications, Vol. 2, No.3, March 2011, pp 80-84 (2011) 1098 16. L. Jiang, H. Zhang: Weightily Averaged One- Dependence Estimators. In: Proceedings of the 9th Biennial Pacific Rim International Conference on Artificial Intelligence, PRICAI 2006, pp 970-974 (2006) 15. Mitchell, T.: Machine Learning. McGraw Hill, New York (1997) 14. Jaynes, E.T.: Probability Theory: The Logic of Science. Cambridge University Press (2003) 13. Ian H. Witten, Eibe Frank:Data Mining : Practical Machine Learning Tools and Techniques, Morgan Kaufmann (2005) 12. Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., Witten, I.H.: The WEKA Data Mining Software: An Update; SIGKDD Explorations, Volume 11, Issue 1 (2009) 11. Affendey, L.S., Paris, I.H.M., Mustapha, N., Sulaiman, M.N., Muda, Z.: Ranking of Influencing Factors in Predicting Student Academic Performance. Information Technology Journal. 9(4):832--837 (2010) International Journal on New Computer Architectures and Their Applications (IJNCAA) 1(4): 1086-1098 The Society of Digital Information and Wireless Communications, 2011 (ISSN: 2220-9085)