4. DATA & INFORMATION
DATA
Data is raw, unorganized facts that need to
be processed.
Example:- Each student's test score is one
piece of data.
INFORMATION
When data is processed, organized,
structured or presented in a given context
so as to make it useful, it is called
information.
Example:- The average score of a class
or of the entire school is information that
can be derived from the given data.
4
5. DATA INFORMATION
Definition
(Oxford
Dictionaries)
Facts and statistics collected
together for reference or
analysis
Facts provided or
learned about something
or someone
Data as processed,
stored, or transmitted
by a computer
Refers to Raw Data Analyzed Data
Description
Qualitative Or Quantitative
Variables that can be used to
make ideas or conclusions
A group of data which
carries news and
meaning
In the form of
Numbers, letters, or a set of
characters.
Ideas and inferences
Collected via
Measurements, experiments,
etc.
Linking data and making
inferences
Represented in
A structure, such as tabular
data, data tree, a data graph,
etc.
Language, ideas, and
thoughts based on the
data
Interrelation Information that is collected
Data that has been
processed
C
O
M
P
A
R
I
S
O
N
B
E
T
W
E
E
N
D
A
T
A
&
I
N
F
O
R
M
A
T
I
O
N
5
6. S. No. Type of data Example(s) Weblinks
1. Sequence of
biomolecules viz., DNA,
RNA, proteins
GenBank, EMBL,
DDBJ, Swiss-Prot,
PIR
(i) www.ncbi.nlm.nih.gov/genba
nk/
(ii) https://www.ebi.ac.uk/embl/
(iii) www.ddbj.nig.ac.jp/
(iv)http://web.expasy.org/docs/s
wiss-prot_guideline.html
(v) http://pir.georgetown.edu/
2. Bio-molecular
structures
PDB http://www.rcsb.org/pdb/home
/home.do
3. Bibliography/scientific
literature **
PubMed, Scopus
(Search engine)
(i) www.ncbi.nlm.nih.gov/pubme
d
(ii) www.scopus.com
4. Patent databases USPTO www.uspto.gov/
5. Metabolic pathways /
molecular interactions
KEGG http://www.genome.jp/kegg/pa
thway.htm
6
TYPES OF DATA & INFORMATION
Databases are categorized based on the data type. A few examples are
listed below:-
7. DATABASE???
A database is a
collection of data
in an organized
manner, which is
accessible in
various ways.
7
9. Biological Databases serve a critical purpose in the collation
and organization of data related to biological systems.
They provide a computational support and a user-friendly
interface to a researcher for a meaningful analysis of biological
data.
9
11. PRIMARY DATABASES
Contains bio-molecular data in its original form.
Experimental results are submitted directly into the
database by researchers, and the data are essentially
archival in nature.
Once given a database accession number, the data in
primary databases are never changed.
Examples :- GenBank, EMBL and DDBJ for DNA/RNA
sequences, SWISS-PROT and PIR for protein sequences
and PDB for molecular structures.
11
12. GenBank
Database from NCBI, includes sequences from publicly
available resources.
http://www.ncbi.nlm.nih.gov/genbank/ 12
13. EMBL
European Molecular Biological Laboratory
Nucleic acid database from EBI (European
Bioinformatics Institute)
Produced in collaboration with DDBJ and GenBank
Search engine – SRS (Sequence Retrieval System)
http://www.ebi.ac.uk/
13
14. DDBJ
DNA Databank of Japan
Started in 1986 in collaboration with GenBank
Produced and maintained at NIG (National Institute
of Genetics)
http://www.ddbj.nig.ac.jp/ 14
15. SWISS PROT
Annotated sequence database established in 1986
Consists of sequence entries of different lie formats
Similar format to EMBL
http://us.expasy.org/sprot/sprot-top.html
http://www.ebi.ac.uk/uniprot/
15
16. PIR
Protein Information Resource
A division of National Biomedical Research
Foundation (NBRF) in U.S.
One can search for entries or do sequence similarity
search at PIR site.
http://pir.georgetown.edu/ 16
17. TrEMBL
Translated European Molecular Biology Laboratory
Computer annotated supplement of SWISS PROT.
Contains all the translations of EMBL nucleotide
sequence entries not yet integrated in SWISS PROT.
http://www.ebi.ac.uk/trembl/ 17
18. COMPOSITE DATABASES
Collection of various primary database sequences
Renders sequence searching highly efficient as it
searches multiple resources
Examples :- NRDB (Non Redundant Database), OWL,
MIPSX, SWISS PROT + TrEMBL
18
20. SECONDARY DATABASES
Contains data derived from the results of analysing
primary data
Manually created or automatically generated
Contains more relevant and useful information
structured to specific requirements
Example :- PROSITE, PRINTS, BLOCKS, Pfam
20
22. PROSITE
Families of proteins
Can search using regular expressions
Similar to unix commands using
wildcards, etc.
E.g., [AC]-x-V-x(4)-{ED}
Interpreted as:
[Ala or Cys]-any-Val-any-any-any-
any-{any but Glu or Asp}
Families exhibit these patterns
So we can search over families
http://ca.expasy.org/prosite/ 22
24. PRINTS
Most protein families are characterized not by one,
but by several conserved motifs
Fingerprints are groups of conserved motifs excised
from sequence alignments
Taken together, they provide diagnostic family
signatures. They are the basis of the PRINTS
database, and are stored in the form of aligned
motifs.
Input about protein families is done manually
24
25. Pfam
Maintained by the Sanger Centre (Cambridge)
Protein families aligned using HMMs
Hidden Markov Models
Given a new sequence
Find families which the sequence might fit into
Sequence Coverage
11912 families
Split into Pfam-A (high quality) and Pfam-B (low quality)
http://pfam.sanger.ac.uk/ 25