Text Analysis: Discovering Insights for the Healthcare Industry.
Learn how Machine Learning helps discover insights for the Healthcare industry by analyzing text. The lecturer is Tomáš Kliegr, Associate Professor at the Department of Information and Knowledge Engineering at Prague University of Economics and Business (VSE).
*Machine Learning School for Business Schools 2021: Virtual Conference.
Multiple time frame trading analysis -brianshannon.pdf
Analyzing Citation Patterns in COVID-19 Literature
1. #BigMLSchool Twitter: @kliegr
Web: kliegr.eu
Citation patterns in COVID-19
related biomedical literature
Analyzing Text: Discovering Insights for
the Healthcare Industry
Why was this cited? Generating semantic explanations for the
CORD-19 corpus
Tomas Kliegr
Prague University of Economics and Businesss
Czech Republic
tomas.kliegr@vse.cz
2. #BigMLSchool Twitter: @kliegr
Web: kliegr.eu
Citation patterns in COVID-19
related biomedical literature
Prague University of Economics and Business
• The largest university in the
field of economics, business
and information technology in
Czechia, 15 000 students (BSc.,
MSc, MBA, Ph.D)
• English Master programmes
– Information Systems
Management
– Economic and Data Analysis
3. #BigMLSchool Twitter: @kliegr
Web: kliegr.eu
Citation patterns in COVID-19
related biomedical literature
Data Science Curriculum
• Programming is a mandatory part of our Applied Informatics BSc.
• However, we use visual approaches non-programming approach
in our data introductory data science course
• Up to 200 students/semester
3
2012 2014
credit risk case study
2021
CLUSTERING
4. #BigMLSchool Twitter: @kliegr
Web: kliegr.eu
Citation patterns in COVID-19
related biomedical literature
The problem
4
Imagine you manage a research team.
You need to pair the expertise of your staff
with the needs of the wider research
community, which will build upon your
results.
How to find out that research made impact?
One of the main KPIs in science is the
number of citations.
What made past research successful –
highly cited?
We will show how to leverage existing
freely available research articles to get the
answers.
Part of CORD-19 dataset containing more than 400.000
articles related to Sars-Cov-2. Source: VOSViewer
9. #BigMLSchool Twitter: @kliegr
Web: kliegr.eu
Citation patterns in COVID-19
related biomedical literature
Example
9
Pratelli, A. (2008). Canine coronavirus inactivation with physical and chemical
agents. The Veterinary Journal, 177(1), 71-79.
Increases citation probability Decreases citation probability
• Our original approach was implemented in Python and involved many trial and error
iterations and hundreds of hours of computation time.
Lucie Beranová, Marcin Joachimiak, Tomáš Kliegr, Gollam Rabby, Vilém Sklenák.
Why was this cited? Generating semantic explanations for the CORD-19 corpus.
Under preparation.
• In this tutorial, we show how the modelling part can be recreated without
programming skills using cloud-based machine learning.
10. #BigMLSchool Twitter: @kliegr
Web: kliegr.eu
Citation patterns in COVID-19
related biomedical literature
Data preprocessing
10
> 300.000 articles
36.000
3.000
Open citations
available
CORD-on-FHIR subset
14. #BigMLSchool Twitter: @kliegr
Web: kliegr.eu
Citation patterns in COVID-19
related biomedical literature
Text processing options
"structure of coronavirus main proteinase
reveals"
• Tokenization and Unigrams
– structure;of;coronavirus ;main;proteinase;reveals
• Bigrams:
– structure of; of coronavirus; coronavirus main;
main proteinase, proteinase releals;
14
15. #BigMLSchool Twitter: @kliegr
Web: kliegr.eu
Citation patterns in COVID-19
related biomedical literature
Text processing options
"Structure of coronavirus main proteinase
reveals"
• Stemming: proteinase => protein
• Stop-word removal:
– structure;of;coronavirus ;main;proteinase;reveals
15
16. #BigMLSchool Twitter: @kliegr
Web: kliegr.eu
Citation patterns in COVID-19
related biomedical literature
Adding a new column
• Since we aim for a classification task, we will
add a new column indicating if a paper is
above or under median of citation count.
16
18. #BigMLSchool Twitter: @kliegr
Web: kliegr.eu
Citation patterns in COVID-19
related biomedical literature
Creating train/test split
• To validate our evaluation model, we need to
create a train/test split
18
19. #BigMLSchool Twitter: @kliegr
Web: kliegr.eu
Citation patterns in COVID-19
related biomedical literature
Setting up the split
• Using a random seed will ensure we get the
same split each time
19
1
20. #BigMLSchool Twitter: @kliegr
Web: kliegr.eu
Citation patterns in COVID-19
related biomedical literature
Creating a logistic regression model
20
21. #BigMLSchool Twitter: @kliegr
Web: kliegr.eu
Citation patterns in COVID-19
related biomedical literature
Setting up the model
21
Suggested to be
removed based on
automatic feature
importance
assessment
Manualy remove
paper id
Check that target is
set correctly
31. #BigMLSchool Twitter: @kliegr
Web: kliegr.eu
Citation patterns in COVID-19
related biomedical literature
Evaluating the ensemble
31
Disadvantages of ensembles:
• Lower interpretability (accuracy-interpratibility tradefoff)
• Longer learning time
• Longer time required to apply the model
34. #BigMLSchool Twitter: @kliegr
Web: kliegr.eu
Citation patterns in COVID-19
related biomedical literature
Other insights from the full analysis
34
„Lucie Beranová, Marcin Joachimiak, Tomáš Kliegr*, Gollam Rabby, Vilém Sklenák.
Why was this cited? Generating semantic explanations for the CORD-19 corpus.
Under preparation.“ Preprints shared on request to *.
• Citation „biases“
• Articles with authors who have western sounding
names are better cited
• Phylogenetic distance from human virus: Feline
(FIPV, FCOV) and canine coronaviruses (CCOV) are
lowly cited possibly because these viruses are
more distant from the human virus than camel
and bat viruses.
• Accuracy-interpretability trade-off
• Embeddings-based language models, TF-IDF
• Random Forests, Rule learning, Neural networks
• Directly explainable models (rules) vs explanation
algorithms like LIME or Shapley values
35. #BigMLSchool Twitter: @kliegr
Web: kliegr.eu
Citation patterns in COVID-19
related biomedical literature
Credits
• Colleagues and Ph.D. students at VSE:
– Ing. Lucie Beranová (help with mining in Python)
– Gollam Rabby, MSc. (citation counts, proof of concept)
– prof. Vilém Sklenák – bibliometric expert
• Dr. Marcin Joachimiak - Computational Biosciences, LBL
Berkeley
– Interpretation of results
35
36. #BigMLSchool Twitter: @kliegr
Web: kliegr.eu
Citation patterns in COVID-19
related biomedical literature
Thanks for your
attention!
Tomáš Kliegr, UEP
tomas.kliegr@vse.cz
Open Doors Day on
YouTube March 3, 2021
https://fis.vse.cz/english/