5. I. Collaboration with Statistician?
中研院賴明昭副院長推廣跨領域合作
(November
10,
2004)
合作不一定是資料分析
與統計學家合作很可能是需要資料分析
6. 陳老師,我的同事用Factor Analysis上
同樣的Journal只花了三分之一的時間
Luxury Research vs. Necessity Research
Senior Researcher vs. Young Investigator
(Established) vs. (Struggling)
Corr.
-1 0 1
0 0.2 0.4 0.6 0.8 1
Corr.
-.1 .1
-1 0 1
Corr.
Corr.
-0.2 0.2 -0.4 0.4
-1 0 1
-1 0 1
7. Chun-houh, can you create some powerful
statistical/bioinformatics methods so we can
get our experiments published in Nature/
Science?
Sir, can you conduct some meaningful
biological/medical experiments so we can
get our methods published in Nature/
Science?
9. II.
矩陣視覺化於探索式資料分析
9
Matrix Visualization:
Approaching Statistics and Statistical Approach
矩陣視覺化:
趨近統計與統計趨勢
10. Lab 309 (???) for Information Visualization
Dr. 田銀錦
Postdoc. Fellow
張勝傑
張文宗
陳柏旭
鐘雅齡
黃建勳
林香誼
劉勝宗
曾聖澧
葉紫君
吳怡真
林倩如
歐陽智聞
. . .
10
Mr. 高君豪
Ph.D. student
Prof. 吳漢銘
Dept. Math.
Tamkang U.
Prof. 須上英
Dept. Stat.
Nat’l Taipei U.
Ms. 石佳鑫
Research Assistant
Dr. 何孟如
Postdoc. Fellow
11. 11
Data analysis
A process of
• inspecting data
• cleaning data
• transforming data
• modeling data
With the goal of
• discovering useful information
• suggesting conclusions
• supporting decision making
了解資料 探索式資料分析
(Exploratory Data Analysis)
資料視覺化
12. Exploratory Data Analysis
EDA, John Tukey (1977)
It is important to understand what you
CAN DO before you learn to measure
how WELL you seem to have DONE it.
1915~
2000
allow the data to speak for themselves
before standard assumptions or formal modeling
The greatest value of a picture is when it
forces us to notice what we never expected to see.
Matrix Visualization as an EDA tool for
assisting formal mathematical modeling
12
13. John
W.
Tukey在探索式資料分析
(Exploratory
Data
Analysis,
EDA)
書中開宗明義地提到:
It
is
important
to
understand
what
you
CAN
DO
before
you
learn
to
measure
how
WELL
you
seem
to
have
DONE
it.
學習你可以做什麼,有助於在資料分析的過程中達到
事半功倍的效果。EDA的作用在於從「看」資料獲得
資料所傳達的訊息,所著重的是簡單的算術與容易建
構的圖、表。透過
E D A
對於圖表中所顯露之型樣
(pattern)
做一初步的認知與描述,再進一步以人類的心
智
(mind)
對所接收的訊息做全面的分析與判斷,以探
索潛藏於資料中的訊息。強調的是探索式的分析而非
嚴謹的模式確認。
15. I. Setosa
I. Verginica
I. Versicolor
Species name
80
60
40
20
0
Pet É Pet É Sep É Sep É
Graphics/Visualization for
high dimensional data?
P5 p10 p100 p10000
80
60
40
20
0
Pet É Pet É Sep É Sep É
80
60
40
20
30 60 90 120
Series
Dat
a
nscores
Pet
al widt
h
I.S I.V I.V
50
40
30
20
10
Species name
15
16. Recent Review Articles for MV
The History of the Cluster Heat Map
Leland WILKINSON and Michael FRIENDLY
The American Statistician,
May 2009, Vol. 63, No. 2 179
REVIEW
Seriation and Matrix Reordering Methods: An
Historical Overview by Innar Liiv
Statistical Analysis and Data Mining
3: 70–91, 2010
Figure 2. Shaded matrix display from Loua
(1873), available online at http://
books.google.com/books/. This was designed
as a summary of 40 separate maps of Paris,
showing the characteristics (e.g., national
origin, professions, age, social classes) of 20
districts, using a color scale
ranging from white (low) through yellow and
blue to red (high).
Figure 3. Sorted shaded display from
Brinton (1914). The data are
ranks of U.S. states on each of 10
educational features assessed in
1910. The matrix has been sorted by
the row-marginal ranks.
Figure 5. Sorted shaded display
from Czekanowski (1909),
reproduced in Hage and Harary
(1995).
Figure 9. Cluster heat map from
Wilkinson (1994). The data are
social
statistics (i.e., urbanization,
literacy, life expectancy for
females, GDP, health
expenditures, educational
expenditures, military
expenditures, death rate, infant
mortality, birth rate, and ratio of
birth to death rate) from a
United Nations survey of world
countries. The variables were
standardized before the
hierarchical clustering was
performed.
Matrix Visualization (MV):
reorderable matrix, heatmap,
color histogram, data image1 6
19. Some essential elements in a GAP MV procedure
1. Data Matrix
(n * p)
(w/ Color coding)
Continuous
Ordinal
Binary
Nominal
2. Proximity Matrix for Subject
(n * n)
Continuous
Ordinal
Binary
Nominal
3. Proximity
(Variable p * p)
Continuous
Ordinal
Binary
Nominal
4. Permutation
(variable)
4. Permutation
(subject)
19
20. Statistical Approach
Identify Global Trend: Singular Value Decomposition
Chen 2002,
Statistica Sinica
Rank 2 Elliptical
R2E
20
SVD
SVD1
Alter O. et al
2000, PNAS
SVD2
-1 0 +1
(c) Correlation
-8 1:1 +8
(a) Expression
(d)
-1 0 +1
(b) Correlation
21. 21
Eisen et al. (1998)
Tree seriation flipping
of intermediate nodes (a)
A B C D E
D
(b)
A
E
B C
(c)
C E D B A
1 flip 3 flips 5 flips
many
flips
2n-1=25-1=16
Different Seriations (Ordering of Terminal Nodes or
Leaves) Generated from Identical Tree Structure
ideal
model
external and internal references
for guiding flipping mechanism
Statistical Approach:
Identify Local Clusters
22. -1 0 +1
(c) Correlation
-8 1:1 +8
(a) Expression
Approaching Statistics Statistical Approach
HCT + R2E = HCTR2E
(d)
-1 0 +1
(b) Correlation
-1 0 +1
(c) Correlation
(d) (e)
-1 0 +1
-8 1:1 +8
(a) Expression
(b) Correlation
- 1
0
+1
( c)
Correl at i on
( d)
- 1
0
+1
- 8
1: 1
+8
( a)
Expressi on ( b)
Correl at i on
Hierarchical Tree Seriation GAP Elliptical (R2E) Seriation Tree guided by (R2E) 22
23. GAP for Heritable (Genetic) Disease: Schizophrenia (National Taiwan University)
Admission
6 month
Psychiatry Research (1998) Lin, Chen et al.
Psychopathological Dimensions in
Schizophrenia: A Correlational Approach
to Items of the SANS and SAPS
Corr.
-1 0 1
0.2 0.4 0.6 0.8 1
Corr.
-0.2 0.2 -0.4 0.4
Absolute Random Error Coefficient
0 1
-.1 .1
G7
N6
G13
N1
N4
N2
N3
N5
G10
G12
G5
P2
G11
N7
G15
G3
G6
G4
G16
G8
P7
S2
G14
S1
S3
P4
P5
G2
G1
comforting the aggravating patient
assistant to the aggravating patient
transport of the aggravating patient to service setting
financial aid
general psychological/practical support
coping with medical team
understanding diagnosis and treatment
identifying early signs of relapse
understanding mental health laws
general social acceptance
occupational therapy
sheltered working facilities
advice on intimate relationship for patient
lifelong custodial care for patient
Need
cluster for
assistant
to patient
care
Need
cluster for
accessing
to relevant
information
Need
cluster
for
societal
support
Need
cluster
for
burden
release
Admission Hwu et al.
Schizophrenia Research (2002)
Symptom Patterns and Subgrouping
of Schizophrenic Patients:
Significance of Negative Symptoms
Assessed on Admission
0 0.2 0.4 0.6 0.8 1
Corr.
-1 0 1
Corr.
-1 0 1
-1 0 1
G7
N6
G13
N1
N4
N2
N3
N5
G10
G12
G5
P2
G11
N7
G15
G3
G6
G4
G16
G8
P7
S2
G14
S1
S3
P4
P5
G2
G1
G1
G2
G3
Average Correlation Negative Disorg. Host./
Genes, Brain and Behavior (2009) Lin
et al. Clustering by neurocognition for
fine-mapping of the schizophrenia
susceptibility loci on chromosome 6p
6 month Liu et al.
J. of the Formosan Med. Ass. (2012)
Medium-term course and outcome of
schizophrenia depicted by the sixth-month
subtype after an acute episode
P3
G9
P1
P6
Negative Disorg. Host./
Excit.
Del./
Hall.
15 10 5 0
Average Euc lidean Distanc e
GONEG
GWNEG
G4
PANSS Score
1 2 3 4 5 6 7
P3
G9
P1
P6
Average Correlation
1 0.8 0.6 0.4 0.2
Correlation Coefficient
-1 -0.5 0 0.5 1
Negative
Symptoms
Disorganized
Thought
Hostility /
Excitement
Delusion /
Ha llucination
Correlation Coefficient
-1 -0.5 0 0.5 1
PANSS Score
1 2 3 4 5 6 7
G7
N6
N3
N1
N2
N4
N5
G16
G10
N7
G5
G13
G11
P2
G15
G12
G8
P7
S1
G14
S2
S3
P4
P6
P3
G9
P1
P5
G4
G2
G1
G3
G6
G7
N6
N3
N1
N2
N4
N5
G16
G10
N7
G5
G13
G11
P2
G15
G12
G8
P7
S1
G14
S2
S3
P4
P6
P3
G9
P1
P5
G4
G2
G1
G3
G6
Negative
Symptoms
Disorganized
Thought
Hostility /
Excitement
Delusion /
Ha llucination
Anxiety
Symptoms
RMG
(n=61)
PDHG1
(n=14)
MBG
(n=50)
PDHG2
(n=38)
0 5 10
Average Euc lidean Distanc e
0.2 0.4 0.6 0.8 1
Excit.
Del./
Hall. Anxiety
J. of the Formosan Med. Ass.
(2008) Yeh et al. Factors
Related to Perceived Needs of
Chief Caregivers of Patients
with Schizophrenia
PLoS ONE (2011) Lai et al.
MicroRNA expression aberration
as potential peripheral blood
biomarkers for schizophrenia
Schizophrenia Research
(2013) Liu et al.
Development of a brief self-report
questionnaire for
screening putative pre-psychotic
states.
24. GAP for Comparative Metabolome: Chinese Herbal Medicine
Drs. Ning-Sun Yang, Lie-Fen Shyur, Wen-Chin Yang
Agricultural Biotechnology Research Center (ABRC) of Academia Sinica
BMC Genomics 9 (2008)
Genomics and proteomics of
immune modulatory effects
of a butanol fraction of
Echinacea purpurea in
human dendritic cells
Wang et al.
Phytochemistry 70 (2009) Anti-diabetic
properties of three common Bidens
pilosa variants in Taiwan
Chien et al.
Journal of Nutritional
Biochemistry 21 (2010)
Comparative metabolomics
approach coupled with cell-and
gene-based assays for
species classification and anti-inflammatory
bioactivity
validation of Echinacea plants
Hou et al.
BMC Complementary and
Alternative Medicine 13
(2013) Morus alba and active
compound oxyresveratrol exert
anti-inflammatory activity via
inhibition of leukocyte
migration involving MEK/
ERK signaling.
Chen et al.
紫錐菊
咸豐草
白桑
25. GAP for Cancer Study: Non–Small Cell Lung Cancer (National Taiwan University)
Journal of Clinical Oncology 23 (2005)
Tumor-Associated Macrophages in
Cancer Progression Chen J. J. et al.
The New England Journal of Medicine 356 (2007) A
Five-Gene Signature and Clinical Outcome in Non–
Small-Cell Lung Cancer Chen H. Y. et al.
Cancer Research 66 (2006)
Non–Small Cell Lung Cancer with Tumor
Cell Invasiveness Sher Y. P. et al.
BMC Genomics 6 (2005)Molecular
signature of clinical severity in recovering
patients with (SARS-CoV)
Lee Y. S. et al. (Chang Gung Hospital)
Open Access Scientific Reports 1 (2006) In silico
Therapeutic Drug Screening for Reversing the Lung
Adenocarcinoma Overexpressed Gene Signatures.
Kuo Y. L. et al. (Nat’l Yang-Ming Univ.)
GAP for Infectious Disease: SARS
Protein-Protien
Interaction
Nat’l Yang-Ming Univ.
Molecular and Cellular
Proteomics 12 (2013) An
analysis of protein-protein
interactions in cross-talk
pathways reveals CRKL as
a novel prognostic marker
in hepatocellular
carcinoma. Liu et al.
b Simple Match Between Pathways
F13A1,HSPB1!
MAPK14,EGFR!
EGFR,HSPB1!
STAT1,PDGFRB!
PDGFRB,CRKL!
HCK,CRKL!
ITGAV,PTK2!
FLT1,CRKL!
CRKL,MAPK1!
CRKL,RAF1!
MAPK3,PTPN11!
STAT5A,SHC1!
CRK,SRC!
GAB1,SOS1!
CRK,SHC1!
PXN,PTPN11!
PDGFRB,PTPN11!
PDGFRB,PLCG1!
PLCG1,PTK2!
CRKL,GAB1!
CRKL,PTPN11!
BAD,YWHAZ!
BAD,RAF1!
PTK2,PTEN!
PXN,PTEN!
CRKL,PIK3R1!
AKT1,HSPB1!
AKT1,PDPK1!
MAPK14,AKT1!
PIK3R1,SHC1!
PIK3R1,SRC!
HCK,SOS1!
CRKL,SOS1!
PDGFRB,RAF1!
FLT1,PTPN11!
HCK,PLCG1!
FLT1,PLCG1!
CRKL,EGFR!
CRK,KDR!
CRKL,PTK2!
FLT1,PTK2!
MAPK14,MAPK3!
BAD,MAPK8!
AKT1,SMAD4!
FLT1,HCK!
HCK,PIK3CB!
CTNNB1,FLT1!
PIK3R1,PXN!
FLT1,PIK3R1!
AKT1,PAK1!
AKT1,NOS3!
AKT1,MDM2!
PTK2,YES1!
PXN,MAPK8!
CRK,FLT1!
MAPK3,MAPK1!
PDGFRB,SLC9A3R1!
EGFR,HCK!
MCM7,CDC6!
CDC6,MCM6!
PLK1,PKMYT1!
E2F1,CDC6!
CCNB1,PKMYT1!
CDK7,E2F1!
PLK1,CCNB1!
CCNB1,CDC25A!
CCNA2,CCNB1!
GAP for
C-Y F. Huang,
a PPI to Pathway c Simple Match Between PPIs
M1!
M2!
B1!
B2!
H1!
H2!
A1!
A2!
P1!
P2!
P3!
P4!
P5!
Signalling to RAS!
Signaling by EGFR!
PDGFR-alpha signaling pathway!
PDGFR-beta signaling pathway!
Signaling events activated by Hepatocyte Growth Factor Receptor (c-Met)!
IGF1 pathway!
Signaling events mediated by VEGFR1 and VEGFR2!
role of pi3k subunit p85 in regulation of actin organization and cell migration!
PI3K/AKT signalling!
akt signaling pathway!
mTOR signaling pathway!
Hedgehog signaling events mediated by Gli proteins!
PPAR signaling pathway - Homo sapiens (human)!
Canonical Wnt signaling pathway!
Complement and coagulation cascades - Homo sapiens (human)!
Unwinding of DNA!
Activation of the pre-replicative complex!
cdk regulation of dna replication!
sonic hedgehog receptor ptc1 regulates cell cycle!
Cyclin A/B1 associated events during G2/M transition!
E2F mediated regulation of DNA replication!
a
Color legends
Not on the pathway!
Both Positive!
Mahlavu Only!
Huh7 Only!
b,c0 1
27. Matrix Visualization for Binary Data
Essential elements in a GAP MV procedure?
Continuous Binary
1. Data
Matrix
2. Subject
Proximity
3. Variable
Proximity
1. Data
Matrix
2. Subject
Proximity
3. Variable
Proximity
Correlation
Covariance
polychoric
Correlation . . .
Euclidean Distance
Manhattan Distance
Correlation … ?
27
28. Commonly used similarity
coefficients for binary data
28
Tzeng et al. (BMEI 2009)
(IEEE Xplore Digital Library)
29. Binary GAP Example
http://CGMIM Online
www.bccrc.ca/ccr/CGMIM/
CGMIM performs automated text-mining of OMIM to identify genetically-related
cancers
Online Mendelian In Man (OMIM) is a computerized database of information
about genes and heritable traits in human populations
OMIM is maintained on the Internet by the
National Center for Biotechnology Information at the
US National Institutes of Health
CGMIM considers 21 anatomic sites based on the major cancers
identified by the National Cancer Institute of Canada
CGMIM compares each OMIM entry name and alternative name with a list of
gene names assigned by HUGO (HUman Genome Organization).
CGMIM produces the number of genes for which an OMIM entry mentions
each pair of cancers, as well as a ratio of the observed and expected number 29
of
genes for the combination
30. CGMIM
All Data (1948 genes * 21 Sites)
Original Order
21
Cancer
Sites
1948
Related
Genes
Jaccard: a/(a+b+c)
30
31. 21
Cancer
Sites
1948
Related
Genes
CGMIM
All Data (1948 genes * 21 Sites)
Single_Tree_GrandPa_Guide
Jaccard: a/(a+b+c)
31
32. 21
Cancer
Sites
768
Related
Genes
CGMIM
768 genes at least at 2 Sites
Original Order
Jaccard: a/(a+b+c)
32
33. 21
Cancer
Sites
768
Related
Genes
CGMIM
768 genes at least at 2 Sites
GAP_Elliptical_Order
Jaccard: a/(a+b+c)
33
34. Approaching Statistics Statistical Approach
Matrix visualization
of nominal data
(GAP approach)
Example:
Classification of Animals Data
Shizuhiko Nishisato 2006
34
45. 45
11
6
7
15
14
12
13
9
4
3
5
8
10
2
1
Ostri c
Turk e y
Chicke
Pigeon
Hawk
Sparro
Duck
Crow
Cran e
L i zard
Frog
Tor toi
Snake
Alli ga
Hippop
Bear
Rhinoc
Elepha
L i o n
Tiger
Leopar
Cow
Fox
Racoo
Cheeta
Cat
Dog
P i g
Rabbit
Hors e
Goat
Camel
Gir aff
Monke
Chimpa
11
6
7
15
14
12
13
9
4
3
5
8
10
2
1
Aves
Reptilia
Mammalia
Primates
60. Approaching Statistics Statistical Approach
CIA
Data:
160 international organization
membership pattern (variables) for
230 countries/regions (subjects)
0. non-member □ 1. member ■
2. observer 3. associate member
4. guest 5. dialogue partner
CIA Political Map of the World
230
countries
(regions)
http://www.faqs.org/docs/factbook/index.html
160 international
organization
60
Matrix Visualization with cartography links
62. Cartography Coloring Scheme with Categorical GAP (CartoGAP) - 2
Data:
Ranks of
5 Candidates
(扁宋連許李)
on 360 Townships
2000 總統大選資料
Is it possible to visualize
information structure
for all 5 candidates
in a single MAP?
A B C
A B C
D E
D E
?
Rank
1
2
3
4
4.5
5
#
64. Cartography Coloring Scheme with GAP (CartoGAP)-2
(B). CateGAP Color Map for Each Individual Variable
A B C
E
A B C
D E
D
(C). Final Single
CateGAP Cartography
Color MAP for Complete
Information Visualization
扁
宋
連
李
扁宋連
許李
許
65. From physical maps to conceptual maps
64
Chromosome Map
Macro Biodiversity
Semiconductor
Wafer Quality
Control
Micro
Biodiversity
67. 1.1
Symbolic
Data
Analysis
(SDA)
and
1.2
Matrix
Visualizaon
(MV)
Fig.
1.
Diagram
for
related
conven5onal
data
matrix
and
symbolic
(interval
type)
data
table
with
their
corresponding
proximity
matrices
for
samples/concepts
and
variables.
68. Example: Japan Minryoku 2010 Data (with Junji Nakano, ISM)
67
Level 1
Level 2
Level 3
Level 4
Region (10)
Area (151)
District (821)
City (1899)
58 variables
1899
Level 4
Cities
市區町村
58 variables
(interval)
151
Level 2
Areas地域
continuous
Data
↓
Rank
Data
(1~1899)
merged
(interval of ranks)
data
covariate
10
Level 1
Regions
70. 12 displaying modes
for MV of interval data
58 interval variables
151 regions (concepts)
Min
Mid
Max
Length
Length 949
len949, 949mid
1746 length
900mid1000
Sufficient
Sediment
Row Condition
Col Condition
71. Statisticians, Data Analysts, Bioinformaticists
A statistician is someone who wants to get exactly the right
answer, even if it’s the answer to the wrong question.
A data analyst is someone who is willing to settle for an
approximate answer, as long as it’s the answer to the right
question.
A bioinformaticist is someone who is willing to settle for
answers of unknown accuracy, to questions that have not
been clearly articulated, as long as the results can be
graphed in color.
David B. Allison, Ph.D.
Department of Biostatistics
University of Alabama at Birmingham
72. Approaching Statistics Statistical Approach
12. MV for Color Blind people
Types of color blind
Monochromacy
Dichromacy
Protanopia and deuteranopia
Hereditary tritanopia
Anomalous Trichromacy
http://www.vischeck.com/examples/
To act passively to prevent from using color systems that are
difficult for color blind people to understand. or
To work actively in assisting people with visual impairments to
have better visualization of data/information.
“I believe there are more mathematics/statistics
blind people than color blind people” 71