Revolutionising the Journal through Big Data Computational Research
1. Revolutionizing the Journal
through Big Data
Computational Research
Amye Kenall
Journal Development Manager, Open Data
DataCite Annual Conference
Inist-CNRS
Vandoeuvre-lès-Nancy, France
26 August 2014
2. 2
Who are we?
• Founded in 2000 (bought by Springer in 2008)
• Publish over 260 open access journals
• ~25,000 peer reviewed research articles published annually
• Genomics and computational biology are a significant fraction
e.g. Genome Biology, BMC Genomics, BMC Bioinformatics
• Other key fields include
• Public Health / Global Health / Infectious Disease
• Cancer
• All research articles are CC-BY licensed for reuse
• Since mid 2013, all data is covered by a CC0 rights waiver
3. 3
Data reuse @BioMedCentral
• Strong encouragement to
authors of all journals to provide
underlying datasets and
required on a select number (eg.
Genome Biology, Genome
Medicine, GigaScience)
• CC0 + CC-BY 4.0 by default
In the works…
• Interactive tabular data
• DOIs for all additional files
• Searchability of additional files
• Data Citation clearly tagged in
XML to aid harvesting
e.g. Data Citation Index
• Availability of Data section and Data
Citation
• Encourage use of ISA-TAB (especially
GigaScience and BMC Research Notes)
14. 14
Lessons Learned?
• With enough work, results can be replicated with a push of a button.
• But a lot of work costs a lot of money! No one would pay an APC that reflects that
cost.
• Learn a huge amount about the study and provides a lot of information not
present in the paper.
• Needs to happen before publication.
15. 15
Reproducibility of computational research
• Computational research in principle
should be easier to replicate/reproduce
than bench studies
• However, practical issues get in the way
• Even if source code is shared,
reproducing entire technical
setup/porting software, gathering
appropriate input data, rerunning
analysis is a significant effort
• This means readers and even
reviewers don’t bother
• We would like to reduce this
‘activation energy’
25. 25
Complementary roles of publishers, academia, and
cloud providers
• Publishers have role in enforcement of community
standards
• Public/academic databases can provide credible long term
archiving for key data with a focus on curation and
metadata standards
• Academic grid computing infrastructure can provide access
for researchers to large-scale computing resource
• Commercial cloud providers universalize/democratize
access to large-scale computing. Even if you are not at an
institution with its own facilities, you can carry out high-end
computations. No bureaucracy/politics – simply pay per
CPU-hour.
26. 26
Specific challenges with respect to data
• To what extent can/should datasets be included in the VM/suite or pulled
in externally?
• How can we avoid the costliness of moving data around, as it gets bigger
and bigger?
• To what extent are cross-domain standards for referring to and pulling in
underlying datasets feasible. Dataset DOIs typically point to metadata
• Multiple versions of datasets. To what extent is it practical, when dealing
with evolving datasets/databases, to make them available as reproducible
snapshots?
• Culture of data sharing. How to get authors to share their data?
27. 27
Conclusions
• With big data and computational tools, research is becoming more
“reproducible/reusable”
• The infrastructure is out there; we need to do a better job of using it
• What authors need to communicate their research is also changing, and as
publishers we must respond
• Clear publishers have a role, with other organisations, in setting some
community standards
• It took a few 100 years, but publishing is now getting exciting
28. 28
Questions?
“One reason that the worldwide web worked was because people reused each
other’s content in ways never imagined or achieved by those who created it.
The same will be true of open data.”
– Tim Berners-Lee and Nigel Shadbolt, The Times, New Year’s Eve 2011
Amye Kenall
Journal Development Manager (Open Data), BioMed Central
@AmyeKenall (also @OpenDataBMC)
amye.kenall@biomedcentral.com
Notas do Editor
More detail of infrastructure.
Linking.
Just as OA reduces activation energy to look at a paper
iPython iPythonNotebook Python, iPython
Galaxy Galaxy galaxy
Taverna,
R/Shiny R R R
ROpenSci ROpenSci
MATLAB
SCaViS
VMs
VMs
matplotlib
Plotly
deployment-technologies