SlideShare uma empresa Scribd logo
1 de 15
Baixar para ler offline
White Paper
Abstract
This white paper explains how the Renaissance Computing
Institute (RENCI) of the University of North Carolina uses
EMC Isilon scale-out NAS storage, Intel processor and system
technology, and iRODS-based data management to tackle Big
Data processing, Hadoop-based analytics, security and privacy
challenges in research and clinical genomics.
July 2013
LIFE SCIENCES AT RENCI
Big Data IT to manage, decipher, and inform
Copyright © 2013 EMC Corporation. All Rights Reserved.
EMC believes the information in this publication is accurate as
of its publication date. The information is subject to change
without notice.
The information in this publication is provided “as is.” EMC
Corporation makes no representations or warranties of any kind
with respect to the information in this publication, and specifically
disclaims implied warranties of merchantability or fitness for a
particular purpose.
Use, copying, and distribution of any EMC software described in
this publication requires an applicable software license.
For the most up-to-date listing of EMC product names, see
EMC Corporation Trademarks on EMC.com.
EMC2
, EMC, the EMC logo, Isilon, and OneFS are registered
trademarks or trademarks of EMC Corporation in the United
States and other countries.
All other trademarks used herein are the property of their
respective owners.
Part Number H11692.1
2Life sciences at RENCI: Big Data IT to manage, decipher, and inform
Table of Contents
Life sciences at RENCI: Big Data IT to manage, decipher, and inform ............4
Tackling clinical and research genomics.........................................................5
Data analysis—Hadoop assists in variant calling ............................................7
Data management—iRODS proving its value ..................................................9
Data security—overcoming UNIX limitations with iRODS .............................10
Protected insight into Big Data: the Secure Medical Workspace...................11
Big Data’s persistent challenges ..................................................................12
What IGS will deliver.................................................................................... 13
EMC Isilon OneFS tames Big Data .................................................................. 13
Intel’s HPC leadership empowers life sciences ................................................. 14
For more information ...................................................................................15
3Life sciences at RENCI: Big Data IT to manage, decipher, and inform
Life sciences at RENCI:
Big Data IT to manage, decipher, and inform
Turning Big Data into insight in the lab and therapy in the clinic is perhaps the
preeminent challenge of modern life sciences. Not only must massive datasets be
managed and analyzed, but the insights gleaned must also be delivered to healthcare
professionals and patients in a way they can understand and use. Kirk Wilhelmsen,
M.D./Ph.D., Charles Schmitt, Ph.D., and their colleagues at the Renaissance
Computing Institute (RENCI) of the University of North Carolina (UNC) are at the
forefront of efforts to create the necessary IT infrastructure and tools to advance
this ambitious goal.
RENCI’s Health & Bioscience initiatives span basic research, advanced genomics,
translational medicine, and clinical decision support. Some are tightly focused, such
as two “knowledge-based medicine” programs that are developing decision support
tools to enhance the way physicians treat epilepsy and prostate cancer. Another,
Secure Medical Workspace (SMW), is creating a platform for providing controlled
access to confidential medical records stored in the Carolina Data Warehouse for
Health (CDW-H). A fourth initiative, Informatics for Genetic Sequencing (IGS) is
the epitome of the Big Data challenge; it’s working on developing the end-to-end IT
infrastructure necessary to support advanced DNA sequencing, genomics research,
and the delivery of genomics-informed healthcare.
Taken as a whole, it’s worth noting the distinct ”translational” bent to RENCI
bioscience efforts, which fits naturally into the Institute’s broad mission to develop
technologies that boost North Carolina competitiveness. “Initially we looked at
traditional bioinformatics and systems biology, but genomics was really starting
to make the transition to medicine and there was a big gap in translational
capabilities. It was a natural place to focus,” said Schmitt, RENCI director
of data sciences and informatics.
The IGS project is an instructive use case in coping with Big Data. On the order of
30 human genomes are sequenced weekly for RENCI’s projects. Just one genome,
depending upon the type of sequencing and the coverage, can generate 100 GB
of data to manage. Capturing, analyzing, storing, and presenting the accumulating
data requires a hybrid HPC (high-performance computing) infrastructure that blends
traditional cluster computing with emerging tools such as iRODS (Integrated Rule-
Oriented Data System) and Hadoop. Unsurprisingly, the HPC infrastructure is always
a work in progress, noted Schmitt.
RENCI/UNC computing resourcesi
are already significant. They include large internal
clusters, links to the Open Science Grid, more than 2 PB of spinning disk storage, and
roughly 3 PB of tape storage. The IGS pipeline/analysis uses a substantial piece of the
overall computing power—RENCI-based DELL blade-based Linux clusters with more
than 1,400 cores; UNC’s Dell and HP blade-based Linux clusters with nearly 1,000
nodes. Primary storage is handled by a 909 TB EMC®
Isilon®
system at UNC, a 1.7 PB
Lustre scratch space at RENCI, and PB-scale tape storage systems at UNC and RENCI.
Intel is another important contributor to RENCI’s computing power, supplying
processor, development, and systems technology used throughout the RENCI/UNC
HPC infrastructure. “Intel is doing much more than just processors in HPC. We bring
domain experts as well as hardware, platforms, software, and HPC leadership to life
4Life sciences at RENCI: Big Data IT to manage, decipher, and inform
sciences and healthcare,” says Ketan Paranjape, Global Director, Healthcare and
Life Sciences (see Intel’s HPC leadership empowers life sciences, page 14). The RENCI
Big Data infrastructure is shown in the Figure 1 below.
Figure 1. RENCI Genomics Big Data infrastructure
Significant investments have also been made in the wet lab. UNC acquired 12 next-
generation high-throughput sequencers (NGS) from Illumina, Pacific Biosciences, and
Life Technologies to support both the clinical-care mission of the UNC healthcare
system and to further basic genomic and biology research.
Tackling clinical and research genomics
“There are two primary projects we are working on now,” said Schmitt. One is NCGENES
(North Carolina Clinical Genomic Evaluation for NextGen Exome Sequencing). Its official
description is, “a multidisciplinary effort to create a bioinformatics infrastructure and a
systematic process for using whole-exome sequencing (WES) as a tool in diagnosing
disease, revealing genetic markers for disease, and helping people understand the
relationship between their genotype and diseases they have or are at risk of developing.”
Much of that infrastructure has been built and is in production use for NCGENES.
In whole-exome sequencing, only those regions of the genome coding for expressed
proteins—roughly 1.5 percent of the human genome—are sequenced. Patients of the
UNC health system are the subjects. The direct goal here is to identify known mutations
in those sequences that are associated with disease risk or health and provide that
5Life sciences at RENCI: Big Data IT to manage, decipher, and inform
information to clinicians and patients. More broadly, it’s also intended to explore ethical
and psychological issues of explaining risk—sometimes when there is no treatment—
to patients. NCGENES is a good example of efforts to deliver translational medicine.
The second project is for the National Institute of Drug Abuse (NIDA) and involves
whole-genome sequencing. Its purpose is to investigate the genetics of drug addiction.
It takes about 10–15 days to sequence a full genome and costs $5–$10K per genome—
roughly 10 times the cost to sequence a whole exome. In terms of sequencing coverageii
,
the NCGENES program is considered moderate at 50X whereas the NIDA project at
~10X is considered low coverage, but, of course, it is sequencing the entire 3-billion
base-pair human genome. The NIDA work seeks to discover low-frequency, novel
variants and relies heavily on statistical imputation.
“Over a period of a year we’ve sequenced ~1,000 whole genomes for NIDA and are
now processing another round of 2,500 whole genomes to be completed by end of
2013,” noted Schmitt. “NCGENES has about 250 people in process right now, and
we’ll probably do another 750 over the grant period.” The size of the data per sample
(person) varies considerably between the projects. A typical whole genome sequenced
NIDA sample averages 100 GB, versus 15 GB1
per sample for NCGENES’ exome
sequenced samples. Currently, RENCI has on the order of 400 TB of genomics data
stored on the EMC Isilon system and projects growing to 600 TB by the end of 2013.
Here’s a snapshot of the three-stage analysis pipeline RENCI has developed:
• DNA sequencing. DNA extracted from tissue samples is run through the high-
throughput NGS instruments. These modern sequencers generate hundreds of
millions of short DNA sequences for each patient, which must then be “assembled”
into proper order to determine the genome. Researchers use parallelized computational
workflows to assemble the genome and perform quality control on the reassembly—
fixing errors in the reassembly.
• Variant calling. DNA variations (SNPs, haplotypes, indels, etc.) for an individual
are detected, often using large patient populations to help resolve ambiguities in
the individual’s sequence data. Data is organized into a hybrid solution that uses
a relational database to store canonical variations, high-performance file systems
to hold data, and a Hadoop-based approach for specialized data-intensive analysis.
Links to public and private databases help researchers identify the impact of
variations including, for example, whether variants have known associations
with clinically relevant conditions.
• Clinical binning. The final step in the NCGENES project is the report to the
physicians. Key to this stage is a process termed “clinical binning,” which is
performed using custom UNC-developed software. It assigns a clinical relevancy
to each variant, shown in Figure 2, allowing clinicians and patients to determine
which variants they care about. Once variants are “binned,” a website delivers the
information to physicians and patients (via the Secure Medical Workspace). The
overall process, from blood-draw to analysis to reporting, including several stages
that provide independent validation of the identified variants, is managed through
a custom workflow solution developed by RENCI.
1
These are FASTQ, BAM, and VCF files with ancillary log and metric files.
6Life sciences at RENCI: Big Data IT to manage, decipher, and inform
Figure 2. Clinical Binning: assigning a clinical relevancy to each variant
Criteria
Loci with clinical
utility
Loci with clinical validity
Loci with
unknown
clinical
implications
Loci with
important
reproductive
implications
Genes
Bins Bin 1
Genes, which when
mutated, result in
high risk of clinically
actionable condition
Bin 2A
Low risk
incidental
information
Bin 2B
Medium risk
incidental
information
Bin 2C
High risk
incidental
information
Bin3
All other Loci
Bin R
Carrier status
for severe AR
disease
Examples BRCA1/2
MLH1, MSH2
FBN1
NF1
Loci with proven
PGx
clinical utility
PGx variants
and common
risk SNPs
with no
proven
clinical utility
APOE, genes
associated with
Mendelian
disease for
which clinical
recommendations
exist
Huntington’s
disease
Prion
diseases
SCA, PS1,
PS2, APP
Tay Sachs
Familial
Dysautonomia
CF, etc.
Estimated
number of
Genes/Loci
Dozen(s) ~20
(eventually
100s-1000s)
100s Dozen(s) >20,000 Hundreds
“Most of what we do is traditional HPC,” said Schmitt. “There’s the analytical pipeline
most people associate with genomics sequencing, which is stitching (assembly) the
genome back together up to the point of starting to call variations. This can be handled
by the type of HPC clusters we have in place. In terms of disk space, more is always
better and our usage will grow several hundred terabytes this year. At the same time,
our usage per sample has dropped as we focus what we store more precisely on the
needs of downstream analysis and leverage tape for archiving.”
Data analysis—Hadoop assists in variant calling
Calling variations is relatively straightforward for NCGENES because of the manageable
size of exome datasets, the ready availability of software analysis tools, and well-
characterized reference genomes. “For NCGENES, we call variants in a very traditional
way using the GATKiii
software package from Broad Institute,” said Schmitt. Variant
calling is done in batches of 50 or 100, something easily handled by HPC clusters.
“It takes a week or less, depending upon the batch size.”
For the NIDA project, identifying meaningful variation is far more challenging. The
much larger datasets, the lower coverage, the search for novel variants across the
entire genome, and the need to characterize variations against a pool of genomes—
not just against a single reference—all combine to make variant calling for NIDA a
memory-intensive, computationally demanding task.
“NIDA is actually investigating new approaches to calling variants and finding haplotypes.
It’s doing something called imputing genotypesiv
, and the calculations can take up to
a month,” said Schmitt. “Of course, you don’t have to run it very often. You can run a
7Life sciences at RENCI: Big Data IT to manage, decipher, and inform
batch once every six months, basically keeping up with the flow of data. We are
looking at how to speed that up because clearly that’s not a very scalable solution.”
Schmitt said RENCI stays abreast of most computationally difficult genomics problems.
For example, RENCI is interested in de-novo sequencing once there are approaches
that can compete with or augment reference-based alignments. He added, “Developing
techniques to detect rare variations, as well as combinations of variations, are of high
interest to our group and we are doing research in this area. We currently aren’t
doing trio sequencing.”
One increasingly popular approach to accelerating data-intensive computing is Hadoop.
Essentially, Hadoop uses a distributed file system and framework (MapReduce) to break
large datasets into chunks, to distribute/store (Map) those chunks to nodes in a cluster,
and to gather (Reduce) results following computation. Hadoop’s distinguishing feature
is that it automatically stores the chunks of data on the same nodes on which they
will be processed. This strategy of co-locating data and processing power (proximity
computing) significantly accelerates performance.
It also turns out that Hadoop architecture is a good choice for many life sciences
applications. This is largely because so much of life sciences data is semi- or
unstructured file-based data and ideally suited for “embarrassingly parallel”
computation. Moreover, the use of commodity hardware (e.g., Linux cluster)
keeps cost down, and little or no hardware modification is required.
“We’ve used a few Hadoop-specific applications. The main one is to process VCF files
(variant call format) when determining allele frequency on NIDA sequences. We
developed a set of tools called Hadoop VCF that lets us put a number of VCF files into
Hadoop and perform MapReduce jobs across VCF files,” said Schmitt. There are several
challenges in processing NIDA sequences, not the least of which is the size of the
databases against which NIDA sequences are compared—e.g., the 1000 Genomes,
plus other sources. “In one case we had 6,000 or so genomes,” said Schmitt. “Hadoop
was a convenient, existing technology to do those kinds of parallel calculations.”
Native support of HDFS (Hadoop Distributed File System) is provided by the EMC
Isilon system. HDFS is a lightweight protocol layer between the Isilon OneFS®
file
system and HDFS clients. “This makes it simple for organizations to utilize protocols
like NFS, REST, FTP, HTTP, etc., to ingest data for their Hadoop workflows,” says
Sanjay Johshi, CTO–Life Sciences, EMC Isilon Storage Division. “If the data is already
stored on the EMC Isilon scale-out NAS, then an organization simply points its Hadoop
compute farm at OneFS without having to perform a time- and resource-intensive load
operation of the Hadoop workflow (see EMC Isilon OneFS tames Big Data, page 13).
This is the type of innovation that EMC Isilon brings that RENCI hopes to adopt in
order to leverage its investment in Hadoop and high-performance storage systems.
Nevertheless, Hadoop is only part of the answer. “We’ve looked at a number of uses
for Hadoop. We tried some BAM processing, developing our own file formats for some
of the sequencing data, but haven’t found it to be more valuable than using traditional
tools,” said Schmitt. “We’ve been able to get by so far in batch mode processing, doing
embarrassingly parallel calculations, but we don’t see that scaling as we move into
tens of thousands of sequences. Past that, we’re pretty sure we are going to have
to switch to a more data-intensive paradigm.”
8Life sciences at RENCI: Big Data IT to manage, decipher, and inform
Schmitt cites two concerns with Hadoop: 1) RENCI is increasingly emphasizing
algorithms that are either graph-based or Markov Model-oriented and, according
to Schmitt, “Hadoop isn’t necessarily the best way to scale those algorithms.”
2) The other big issue is that Hadoop does not work well in a shared HPC cluster
environment. “This keeps us from using Hadoop more. We just can’t take over a
shared cluster periodically and allocate it for Hadoop,” said Schmitt.
Data management—iRODS proving its value
Data management for RENCI’s health and biosciences initiatives is fairly complicated.
“Briefly, what happens is the sequencing facility puts out data (on disk), and all that
gets tracked through a laboratory information management system (LIMS). We pick
it up at that point,” said Schmitt. “We run all of our analysis pipelines on a single
HPC cluster and the large EMC Isilon system. We keep all the intermediate and
analyzed data products on the Isilon system, and our pipelines register the data
products associated with each pipeline stage into the LIMS.” Intermediate processed
sequencing data—FASTQ files—are moved to tape as part of the sequencing process.
UNC runs a LIMS called BSPLims that handles the processing of blood samples. RENCI
has developed a related LIMS called libLims that handles its sequencing workflows for
NIDA—libLims interacts with BSPLims, but is customized for the more specialized
NIDA workflow.
All of the canonical variant data are stored in a large database—VarDBv
—that also holds
reference genomic data: “Most importantly, in this regard, is that it holds several versions
of the NCBI reference genome and manages translating genomic locations between the
different versions,” said Schmitt. VarDB also holds variants from public data sources,
such as dbSNP and The 1000 Genomes Project, variants from UNC sequencing efforts,
as well as variants from HGMD, the database of human gene mutation data. Finally, it
holds annotations on data from public databases, such as OMIM and RefSeq, as well as
annotations derived from tools like Polyphen. All together, VarDB currently stores the
data on the EMC Isilon system and this will steadily grow.
To help cope with its Big Data management challenge—storage, access, archiving, data
security, etc.—RENCI is making growing use of iRODS. In fact, RENCI is spearheading
an E-iRODS development effort in which Schmitt is the leader.
Broadly speaking, iRODS (the integrated Rule-Oriented Data System) is a data grid
technology that essentially puts a unified namespace on data files, regardless of where
those files are physically located. You may have files in four or five different storage
systems, but to the user it appears as one directory tree. iRODS also allows setting
enforcement rules on any access to the data or submission of data. For example, if
someone entered data into the system, that might trigger a rule to replicate the data
to another system and compress it at the same time. Access protection rules based on
metadata about a file can be set.
RENCI is already using iRODS with the analytical pipelines. “When our analytical
pipelines are processing the data, they also register that data into iRODS,” Schmitt
says. At the end of the pipeline, the data exists on disks and is registered into iRODS.
Anyone wanting to use the data must come in through iRODS to get the data; this
allows RENCI to set policies on access and data use.
9Life sciences at RENCI: Big Data IT to manage, decipher, and inform
“We originally did this as a way to let the clinical system access the raw research
data,” Schmitt continued. “Within the clinical system there is the ability for a clinician
looking at a patient to click on a button and download the BAM file, and we wanted a
way to separate that clinical system from where we store the BAM file.”
Here’s how it works. The clinical system takes the ID of the patient, sends it to
iRODS, which does a look-up and gives back the BAM file. At the same time, it
compresses it, and pulls up just the section of data on that BAM file that the clinician
actually wants. The Integrative Genomics Viewer (IGV) from the Broad Institute is
then launched to allow the clinician to view the sequence reads associated with the
variation of interest in context with the reference genome and other relevant data
(e.g., locations of exons and regulatory regions). In that way, data can be moved
elsewhere, maybe even to tape, and iRODS manages hiding all of that from the
clinical side.
The IGS team is now investigating the use of iRODS to automate replication of the
raw data produced at UNC to storage at RENCI. “It’s not really a backup, just a
redundant store. We’re looking into the process of selectively copying some of the
data to put it onto tape,” noted Schmitt.
To some extent, RENCI/UNC’s archival strategy is still evolving. “FASTQs are all
archived to tape and that’s put on a copy at UNC and a copy off-site. That’s our
primary safety net. We are also starting to copy the BAM files to tape at RENCI.
That’s a little less secure than the FASTQs, but sufficient in that we can regenerate
those in the case of disaster. Those are the two main ones that we archive. The
phenotypic and demographic data are all stored in databases and those are
independently backed up and archived,” Schmitt said.
Data security—overcoming UNIX limitations
with iRODS
Because RENCI works on multiple, shared systems in different data centers,
implementing security is complex. Basic security is provided through IT groups
at UNC and RENCI that provide aspects such as anti-virus, network filtering, single
sign-on, and system-level logging. “Standard user ID/password is used on the
research side of our work for access to resources such as file systems or databases.
The number of people with such access is very limited and governed through UNC’s
IRB,” explained Schmitt.
“On the clinical side there are more people accessing the data, so access is through
websites that users have to authenticate against and are secured in standard ways
(e.g., SSL, database server/Web servers running on VMs behind locked doors). iRODS
is used to automate standard procedures, including archiving, replication, and access
to raw data from users on the clinical side—this allows us to use iRODS logging and
sign-on for security. We are moving to project-level access control as we bring iRODS
further into our overall solution,” said Schmitt.
One problem is that UNIX directories can only go so far in managing the project
orientation of data. “That becomes a real headache,” said Schmitt. “With iRODS,
we can assign protection based on metadata for that file. That’s important because
we have many different graduate students, medical students, and rotating
10Life sciences at RENCI: Big Data IT to manage, decipher, and inform
bioinformaticians coming in; otherwise we would have to devote whole directory
trees to them.”
Indeed, the use of iRODS is a growing trend in life sciences, according to Joshi.
“Isilon customers are turning to iRODS for its rule-based data management capabilities
to complement the OneFS system administration features. By leveraging both OneFS
capabilities and iRODS, storage administrators not only can implement data policies
for disaster recovery, archive, and replication, but can also empower research teams
with capabilities to manage data throughout the study (project) lifecycle. With iRODS,
investigators can take advantage of tools that allow them to automate annotation of
data sets with project information, move data based on the project lifecycle, and find
the data based on study attributes when they need it.”
Protected insight into Big Data: the Secure
Medical Workspace
At the end of the day, the goal is to be able to deliver important genomics information
to both clinicians and researchers. To accomplish this part of its broad genomics
infrastructure mission, RENCI, in collaboration with UNC TraCS, the School of Information
and Library Science (SILS), and UNC Hospitals, has developed the Secure Medical
Workspace (SMW) system to enable the CDW-H to provide researchers and healthcare
professionals secure access to patient records.
The SMW shown in Figure 3 combines a secure centralized infrastructure with
virtualization and data leakage protection technologies to allow researchers to analyze
their research data, while ensuring sensitive patient information remains within the
SMW environment. “It’s a front-end to get to the data,” said Schmitt. “So for those
people who need direct access to sensitive data containing PHI, we’re using this secure
workspace as a way to give them access to data files.” Authorized researchers connect
to SMW from their local computing devices over a secure network connection to a
dedicated virtual workspace.
Figure 3. The Secure Medical Workspace
11Life sciences at RENCI: Big Data IT to manage, decipher, and inform
“It’s a virtualization solution where we can give a researcher a virtual server, and
once on that server the researcher can get access to data, either directly attached to
that server or remote somewhere else. But we include data leakage protection on the
server, which gives us protection and screens against any data being pulled outside of
the system,” explained Schmitt. “Yet, researchers can freely bring their own data and
tools onto the server.” There are commercial solutions that allow you to set policies
for who can take data out and what happens when someone tries to take data out.
“The way that we have favored doing this,” he continued, “is if someone tries to copy
data out, we allow it but throw up a warning screen saying you have to abide by your
data usage agreement. That agreement and the data removed from the server are
then stored for compliance audits.”
Big Data’s persistent challenges
Amid the substantial progress in developing an infrastructure to handle life sciences’
Big Data challenge, many thorny challenges persist, noted Schmitt. Consider that a
database of sequenced and variant data associated with 10,000 patients would have
roughly a petabyte of data. Working with such a massive data repository complicates
basically everything—storage, replication, ongoing analysis, traditional ETL database
functions, etc.
Collaboration, for example, remains problematic, with data transmission the biggest
issue. RENCI’s current collaboration with UC San Diego and the Scripps Institute,
explained Schmitt, “has been done by sending BAM files in batches. The first batch
took a month to send. Then, talking back and forth by phone about issues regarding
the data takes more time. It’s not a great process,” he says.
Schmitt continued: “We are looking at some of the advanced networking coming
out of NSF to get the bandwidth we want to move data. Of course that’s all kind
of experimental right now. We are exploring using some of the OpenStack and
Open Science Cloud offerings as a way to help collaborate.”
Large-scale computation on Big Data—particularly some of the so-called n-squarevi
problems—remains challenging. “We continue to explore Hadoop as one answer down
the road, but we are looking at other approaches, including data flow solutions and
systems for computing over large-scale graphs,” said Schmitt.
Archiving is another bottleneck. “Our goal for UNC and the UNC healthcare system
is to be able to manage storing a genome for every individual patient and using that
for research, but to get to that level cost-wise is going to be very difficult in terms of
data storage,” Schmitt continued. “We need a better idea of what data we can throw
away and when we can throw away data, and how to represent data at various levels
of hierarchy.”
Nevertheless, RENCI’s progress on all fronts has been substantial. UNC healthcare
professionals are able to look at patient genomic data for clinical care through the
NCGENES project—the last stage in RENCI’s analysis and data delivery pipeline. The
NIDA project is longer-term, and still in data and analysis collection mode, but many
of the kinks to collecting and processing the larger NIDA sample datasets have been
12Life sciences at RENCI: Big Data IT to manage, decipher, and inform
worked out. RENCI is poised to play a growing and important role developing
the HPC infrastructure and necessary analysis pipelines to support life sciences
and healthcare.
What IGS will deliver
In addition to handling the basic processing of next-generation DNA sequencer (NGS)
output, the RENCI-built Informatics for Genetic Sequencing (IGS) infrastructure continues
to be enhanced in order to support:
• Improved population-oriented queries: Given a variant, the system will find
the frequency of that variant and related haplotypes in a large population to help
determine whether the variant is potentially deleterious.
• Automated annotation: The system will extract data from multiple different
source databases, extract annotation, and incorporate it back into the variant
database for use by researchers, thus providing an increasingly diverse range
of annotation sources.
• Reference rationalization: Data in the system could be used to redefine
the “reference” genome, the template used to compare genomes from
different individuals.
• Improved variant analysis: Enhanced data processing will help researchers
identify additional information about genetic variation between individuals
beyond that which is possible with current technologies.
• Visualization: Data visualization will help enable new insights and inspire new
research questions.
• Metadata grid: The system will enable automated generation and propagation of
metadata to enhance analysis and data management and to guide computational
and data workflows.
EMC Isilon OneFS tames Big Data
EMC Isilon OneFS 7.0 is designed to address the convergence of Big Data and enterprise
IT, and extend the benefits of Isilon scale-out NAS architecture to a wider range of
enterprise storage needs.
OneFS combines the three layers of traditional storage architectures—the file system,
volume manager, and RAID—into one unified software layer, creating a single intelligent
distributed file system that runs on one storage cluster. The advantages of OneFS for
NGS are many:
• Scalable: Scale out as needs grow. Linear scale with increasing capacity: from
18 TB to 20 PB in a single file system and a single global namespace.
• Predictable: Dynamic content balancing is performed as nodes are added,
upgraded, or as capacity changes. No added management time is required,
because this process is simple.
• Available: OneFS is “self-healing.” It protects your data from power loss, node or
disk failures, and loss of quorum and storage rebuild by distributing data, metadata,
and parity across all nodes.
13Life sciences at RENCI: Big Data IT to manage, decipher, and inform
• Efficient: Compared to the average 50 percent efficiency of traditional RAID
systems, OneFS provides over 80 percent efficiency, independent of CPU compute
or cache. This efficiency is achieved by tiering the process into three types, as
shown in the figure alongside and by the pools within these node types.
• Enterprise-ready: Administration of the storage clusters is via an intuitive
Web-based UI. Connectivity to your process is through standard protocols: CIFS,
SMB, NFS, FTP/ HTTP, Object, and HDFS. Standardized authentication and access
control is available at scale: AD, LDAP, and NIS.
Isilon is the only scale-out NAS offering that provides enterprise capabilities
at scale to manage rapidly growing unstructured data assets more effectively.
Isilon OneFS provides data protection through snapshots across the whole cluster,
and is the only scale-out NAS solution compliant to SEC 17a-4 standards. Isilon
is the world's fastest NAS platform, delivering over 100 GB/s system throughput,
and remains the world-record holder for scale-out NAS performance with 1.6 million
SpecSFS2008 CIFS operations per second. With OneFS 7.0, Isilon storage systems
now provide dramatically improved caching capability to reduce average latency by
60 percent for I/O-intensive applications.
Intel’s HPC leadership empowers life sciences
Intel technology is used throughout HPC and is particularly prevalent in life sciences,
where Big Data challenges are now the norm. For example, Intel Xeon processors,
both the E5 and Phi lines, are accelerating parallel computing and bringing greater
accuracy to genomics analysis. Similarly, Intel software, such as the Intel Distribution
for Hadoop and Intel Manager for Hadoop, helps administrators simplify configuring
hardware and tuning Hadoop performance.
In all aspects of HPC, Intel technology and products are at the forefront. Nowhere is
this leadership more important than life sciences and throughout the RENCI/UNC HPC
infrastructure, where Intel products are widely embedded and helping researchers
and clinicians manage and interpret the genomics data deluge.
Here’s a brief overview of just a few Intel enabling technologies:
• Xeon/E5. The E5 processor, a solid foundation for HPC, delivers 80 percent
greater performance, 70 percent more energy-efficiency, and 30 percent less
network latency than earlier Xeon processors. Servers based on the E5 family
provide an optimum combination of performance, built-in capabilities, and cost-
effectiveness. From virtualization and cloud computing solutions to design
automation or real-time financial transactions, the E5 provides needed power.
• Xeon/Phi. Intel’s new line of Xeon Phi coprocessors is optimized for performance
and programmability for highly parallel workloads. The 5110P, first member of the
line, has 60 cores at 1.053GHz and handles 240 threads. Importantly, Intel Xeon
processors and Phi coprocessors support the same code, reducing the complexity
of development. The same techniques—such as scaling applications to many cores
and threads—can be used on both.
• Intel software. This extensive portfolio includes, for example, Intel Cluster
Studio XE, which features high performance, standards-driven compilers, libraries,
analysis tools, OpenMP, and MPI. Intel Distribution for Hadoop and Intel Manager
for Hadoop are important products for life sciences. Other offerings include Intel
Data Center Manager (DCM) and Intel Node Manager (NM) for resource/power
management, and Intel Expressway Service Gateway for cloud usage models.
14Life sciences at RENCI: Big Data IT to manage, decipher, and inform
• Intel fabric. HPC workloads today are too large to be managed by unspecialized
tools. Intel has several specifically designed for large and complex workloads.
Among them are Intel True Scale Fabric, designed from the ground up for HPC,
and QDR-40 and QDR-80, which deliver performance that scales. These tools are
optimized support for Xeon E5 and Xeon Phi processors.
• Intel storage. Intel storage technologies are used throughout industry at every
level (enterprise, SM business, home). Here are a few: Intel Xeon processors
and platforms enabled with beneficial storage optimizations; Solid-state drives
(SSDs) and other NVM technologies improve storage performance; Intel Cache
Acceleration Software (CAS); and Intel’s open source Lustre file-system
support/development and Chroma management/provisioning tools.
For more information
For more information about the exciting work done at the Renaissance Computing
Institute (RENCI), visit http://www.renci.org.
To learn more about how EMC products, services, and solutions help solve your life
sciences IT challenges, contact your local representative or authorized reseller—or
visit us at www.EMC.com/isilon.
To learn more about Intel technology, visit
http://www.intel.com/content/www/us/en/healthcare-it/big-data-in-healthcare.html.
To learn how e-IRODS can solve your enterprise data management needs, visit
http://e-irods.org/.
i
http://www.renci.org/resources/computing
ii
“…coverage is a measure of the number of times that a specific genomic site is sequenced during a
sequencing.” – JP Sulzberger Columbia Genome Center, http://genomecenter.columbia.edu/?q=node/77.
iii
The Genome Analysis Toolkit or GATK is a software package developed at the Broad Institute to analyze
next-generation resequencing data. The Toolkit offers a wide variety of tools, with a primary focus on
variant discovery and genotyping, as well as strong emphasis on data quality assurance.
http://www.broadinstitute.org/gatk/.
iv
University of Oxford backgrounder on imputing,
http://mathgen.stats.ox.ac.uk/impute/impute_v2.html#home
v
VarDB is a PostgreSQL relational database. “VarDB doesn’t directly integrate with RENCI usage of
Hadoop other than occasionally store results from certain Hadoop calculations, such as allele frequencies
from VCF files, in VarDB.” – Charles Schmitt.
vi
N-squared is shorthand for problems that are actually O(n^2) or similar to n-squared, such as
O(n^2.8). That would be different from true np hard problems.
15Life sciences at RENCI: Big Data IT to manage, decipher, and inform

Mais conteúdo relacionado

Mais procurados

Biomedical Clusters, Clouds and Commons - DePaul Colloquium Oct 24, 2014
Biomedical Clusters, Clouds and Commons - DePaul Colloquium Oct 24, 2014Biomedical Clusters, Clouds and Commons - DePaul Colloquium Oct 24, 2014
Biomedical Clusters, Clouds and Commons - DePaul Colloquium Oct 24, 2014Robert Grossman
 
National Institutes of Health Maximize Computing Resources with Panasas
National Institutes of Health Maximize Computing Resources with PanasasNational Institutes of Health Maximize Computing Resources with Panasas
National Institutes of Health Maximize Computing Resources with PanasasPanasas
 
A Step to the Clouded Solution of Scalable Clinical Genome Sequencing (BDT308...
A Step to the Clouded Solution of Scalable Clinical Genome Sequencing (BDT308...A Step to the Clouded Solution of Scalable Clinical Genome Sequencing (BDT308...
A Step to the Clouded Solution of Scalable Clinical Genome Sequencing (BDT308...Amazon Web Services
 
Big Data, The Community and The Commons (May 12, 2014)
Big Data, The Community and The Commons (May 12, 2014)Big Data, The Community and The Commons (May 12, 2014)
Big Data, The Community and The Commons (May 12, 2014)Robert Grossman
 
The Gordon Data-intensive Supercomputer. Enabling Scientific Discovery
The Gordon Data-intensive Supercomputer. Enabling Scientific DiscoveryThe Gordon Data-intensive Supercomputer. Enabling Scientific Discovery
The Gordon Data-intensive Supercomputer. Enabling Scientific DiscoveryIntel IT Center
 
2015 balti-and-bioinformatics
2015 balti-and-bioinformatics2015 balti-and-bioinformatics
2015 balti-and-bioinformaticsc.titus.brown
 
Challenges and Opportunities of Big Data Genomics
Challenges and Opportunities of Big Data GenomicsChallenges and Opportunities of Big Data Genomics
Challenges and Opportunities of Big Data GenomicsYasin Memari
 
Advanced Research Computing at York
Advanced Research Computing at YorkAdvanced Research Computing at York
Advanced Research Computing at YorkMing Li
 
Next-generation sequencing: Data mangement
Next-generation sequencing: Data mangementNext-generation sequencing: Data mangement
Next-generation sequencing: Data mangementGuy Coates
 
iMicrobe and iVirus: Extending the iPlant cyberinfrastructure from plants to ...
iMicrobe and iVirus: Extending the iPlant cyberinfrastructure from plants to ...iMicrobe and iVirus: Extending the iPlant cyberinfrastructure from plants to ...
iMicrobe and iVirus: Extending the iPlant cyberinfrastructure from plants to ...Bonnie Hurwitz
 
Machine Learning in Healthcare Diagnostics
Machine Learning in Healthcare DiagnosticsMachine Learning in Healthcare Diagnostics
Machine Learning in Healthcare DiagnosticsLarry Smarr
 
Opportunities for HPC in pharma R&D - main deck
Opportunities for HPC in pharma R&D - main deckOpportunities for HPC in pharma R&D - main deck
Opportunities for HPC in pharma R&D - main deckPistoia Alliance
 
Life sciences big data use cases
Life sciences big data use casesLife sciences big data use cases
Life sciences big data use casesGuy Coates
 
HPC-MAQ : A PARALLEL SHORT-READ REFERENCE ASSEMBLER
HPC-MAQ : A PARALLEL SHORT-READ REFERENCE ASSEMBLERHPC-MAQ : A PARALLEL SHORT-READ REFERENCE ASSEMBLER
HPC-MAQ : A PARALLEL SHORT-READ REFERENCE ASSEMBLERcscpconf
 
The Rise of Machine Intelligence
The Rise of Machine IntelligenceThe Rise of Machine Intelligence
The Rise of Machine IntelligenceLarry Smarr
 
A study on cloud computing ppt n_24-12-2017
A study on cloud computing ppt n_24-12-2017A study on cloud computing ppt n_24-12-2017
A study on cloud computing ppt n_24-12-2017Manish K Patel
 
Future Architectures for genomics
Future Architectures for genomicsFuture Architectures for genomics
Future Architectures for genomicsGuy Coates
 
Storage for next-generation sequencing
Storage for next-generation sequencingStorage for next-generation sequencing
Storage for next-generation sequencingGuy Coates
 

Mais procurados (20)

Biomedical Clusters, Clouds and Commons - DePaul Colloquium Oct 24, 2014
Biomedical Clusters, Clouds and Commons - DePaul Colloquium Oct 24, 2014Biomedical Clusters, Clouds and Commons - DePaul Colloquium Oct 24, 2014
Biomedical Clusters, Clouds and Commons - DePaul Colloquium Oct 24, 2014
 
National Institutes of Health Maximize Computing Resources with Panasas
National Institutes of Health Maximize Computing Resources with PanasasNational Institutes of Health Maximize Computing Resources with Panasas
National Institutes of Health Maximize Computing Resources with Panasas
 
A Step to the Clouded Solution of Scalable Clinical Genome Sequencing (BDT308...
A Step to the Clouded Solution of Scalable Clinical Genome Sequencing (BDT308...A Step to the Clouded Solution of Scalable Clinical Genome Sequencing (BDT308...
A Step to the Clouded Solution of Scalable Clinical Genome Sequencing (BDT308...
 
Big Data, The Community and The Commons (May 12, 2014)
Big Data, The Community and The Commons (May 12, 2014)Big Data, The Community and The Commons (May 12, 2014)
Big Data, The Community and The Commons (May 12, 2014)
 
The Gordon Data-intensive Supercomputer. Enabling Scientific Discovery
The Gordon Data-intensive Supercomputer. Enabling Scientific DiscoveryThe Gordon Data-intensive Supercomputer. Enabling Scientific Discovery
The Gordon Data-intensive Supercomputer. Enabling Scientific Discovery
 
2015 balti-and-bioinformatics
2015 balti-and-bioinformatics2015 balti-and-bioinformatics
2015 balti-and-bioinformatics
 
Challenges and Opportunities of Big Data Genomics
Challenges and Opportunities of Big Data GenomicsChallenges and Opportunities of Big Data Genomics
Challenges and Opportunities of Big Data Genomics
 
Whither Small Data?
Whither Small Data?Whither Small Data?
Whither Small Data?
 
Advanced Research Computing at York
Advanced Research Computing at YorkAdvanced Research Computing at York
Advanced Research Computing at York
 
Next-generation sequencing: Data mangement
Next-generation sequencing: Data mangementNext-generation sequencing: Data mangement
Next-generation sequencing: Data mangement
 
FC Brochure & Insert
FC Brochure & InsertFC Brochure & Insert
FC Brochure & Insert
 
iMicrobe and iVirus: Extending the iPlant cyberinfrastructure from plants to ...
iMicrobe and iVirus: Extending the iPlant cyberinfrastructure from plants to ...iMicrobe and iVirus: Extending the iPlant cyberinfrastructure from plants to ...
iMicrobe and iVirus: Extending the iPlant cyberinfrastructure from plants to ...
 
Machine Learning in Healthcare Diagnostics
Machine Learning in Healthcare DiagnosticsMachine Learning in Healthcare Diagnostics
Machine Learning in Healthcare Diagnostics
 
Opportunities for HPC in pharma R&D - main deck
Opportunities for HPC in pharma R&D - main deckOpportunities for HPC in pharma R&D - main deck
Opportunities for HPC in pharma R&D - main deck
 
Life sciences big data use cases
Life sciences big data use casesLife sciences big data use cases
Life sciences big data use cases
 
HPC-MAQ : A PARALLEL SHORT-READ REFERENCE ASSEMBLER
HPC-MAQ : A PARALLEL SHORT-READ REFERENCE ASSEMBLERHPC-MAQ : A PARALLEL SHORT-READ REFERENCE ASSEMBLER
HPC-MAQ : A PARALLEL SHORT-READ REFERENCE ASSEMBLER
 
The Rise of Machine Intelligence
The Rise of Machine IntelligenceThe Rise of Machine Intelligence
The Rise of Machine Intelligence
 
A study on cloud computing ppt n_24-12-2017
A study on cloud computing ppt n_24-12-2017A study on cloud computing ppt n_24-12-2017
A study on cloud computing ppt n_24-12-2017
 
Future Architectures for genomics
Future Architectures for genomicsFuture Architectures for genomics
Future Architectures for genomics
 
Storage for next-generation sequencing
Storage for next-generation sequencingStorage for next-generation sequencing
Storage for next-generation sequencing
 

Destaque

NAGARA: SRB and iRODS
NAGARA: SRB and iRODSNAGARA: SRB and iRODS
NAGARA: SRB and iRODSMark Conrad
 
Green Shoots: Research Data Management Pilot at Imperial College London
Green Shoots:Research Data Management Pilot at Imperial College LondonGreen Shoots:Research Data Management Pilot at Imperial College London
Green Shoots: Research Data Management Pilot at Imperial College LondonTorsten Reimer
 
iRODS/Dataverse Project by Jonathan Crabtree
iRODS/Dataverse Project by Jonathan CrabtreeiRODS/Dataverse Project by Jonathan Crabtree
iRODS/Dataverse Project by Jonathan Crabtreedatascienceiqss
 
Data Management for Grown Ups
Data Management for Grown UpsData Management for Grown Ups
Data Management for Grown UpsAll Things Open
 
Research Data Management en bibliotheken
Research Data Management en bibliothekenResearch Data Management en bibliotheken
Research Data Management en bibliothekenSaskia Scheltjens
 
iRODS User Group Meeting 2016 - MUMC+
iRODS User Group Meeting 2016 - MUMC+iRODS User Group Meeting 2016 - MUMC+
iRODS User Group Meeting 2016 - MUMC+Maarten Coonen
 
iRODS Rule Language Cheat Sheet
iRODS Rule Language Cheat SheetiRODS Rule Language Cheat Sheet
iRODS Rule Language Cheat SheetSamuel Lampa
 
Access HDF-EOS data with OGC Web Coverage Service - Earth Observation Applica...
Access HDF-EOS data with OGC Web Coverage Service - Earth Observation Applica...Access HDF-EOS data with OGC Web Coverage Service - Earth Observation Applica...
Access HDF-EOS data with OGC Web Coverage Service - Earth Observation Applica...The HDF-EOS Tools and Information Center
 
Private Cloud Architecture
Private Cloud ArchitecturePrivate Cloud Architecture
Private Cloud ArchitectureDerek Keats
 
File management ppt
File management pptFile management ppt
File management pptmarotti
 
I rods분석(20170313,01,김선태)
I rods분석(20170313,01,김선태)I rods분석(20170313,01,김선태)
I rods분석(20170313,01,김선태)Suntae Kim
 
旅行カバンとNFC
旅行カバンとNFC旅行カバンとNFC
旅行カバンとNFCHirokuma Ueno
 
RSA Monthly Online Fraud Report -- February 2013
RSA Monthly Online Fraud Report -- February 2013RSA Monthly Online Fraud Report -- February 2013
RSA Monthly Online Fraud Report -- February 2013EMC
 
Pozo requena fotonovel·la
Pozo requena fotonovel·laPozo requena fotonovel·la
Pozo requena fotonovel·lamgonellgomez
 

Destaque (20)

NAGARA: SRB and iRODS
NAGARA: SRB and iRODSNAGARA: SRB and iRODS
NAGARA: SRB and iRODS
 
Green Shoots: Research Data Management Pilot at Imperial College London
Green Shoots:Research Data Management Pilot at Imperial College LondonGreen Shoots:Research Data Management Pilot at Imperial College London
Green Shoots: Research Data Management Pilot at Imperial College London
 
iRODS/Dataverse Project by Jonathan Crabtree
iRODS/Dataverse Project by Jonathan CrabtreeiRODS/Dataverse Project by Jonathan Crabtree
iRODS/Dataverse Project by Jonathan Crabtree
 
Data Management for Grown Ups
Data Management for Grown UpsData Management for Grown Ups
Data Management for Grown Ups
 
Research Data Management en bibliotheken
Research Data Management en bibliothekenResearch Data Management en bibliotheken
Research Data Management en bibliotheken
 
iRODS User Group Meeting 2016 - MUMC+
iRODS User Group Meeting 2016 - MUMC+iRODS User Group Meeting 2016 - MUMC+
iRODS User Group Meeting 2016 - MUMC+
 
UDT
UDTUDT
UDT
 
ODSC and iRODS
ODSC and iRODSODSC and iRODS
ODSC and iRODS
 
iRODS Rule Language Cheat Sheet
iRODS Rule Language Cheat SheetiRODS Rule Language Cheat Sheet
iRODS Rule Language Cheat Sheet
 
HDF5 iRODS
HDF5 iRODSHDF5 iRODS
HDF5 iRODS
 
Access HDF-EOS data with OGC Web Coverage Service - Earth Observation Applica...
Access HDF-EOS data with OGC Web Coverage Service - Earth Observation Applica...Access HDF-EOS data with OGC Web Coverage Service - Earth Observation Applica...
Access HDF-EOS data with OGC Web Coverage Service - Earth Observation Applica...
 
iRODS: Interoperability in Data Management
iRODS: Interoperability in Data ManagementiRODS: Interoperability in Data Management
iRODS: Interoperability in Data Management
 
Private Cloud Architecture
Private Cloud ArchitecturePrivate Cloud Architecture
Private Cloud Architecture
 
File management ppt
File management pptFile management ppt
File management ppt
 
I rods분석(20170313,01,김선태)
I rods분석(20170313,01,김선태)I rods분석(20170313,01,김선태)
I rods분석(20170313,01,김선태)
 
旅行カバンとNFC
旅行カバンとNFC旅行カバンとNFC
旅行カバンとNFC
 
RSA Monthly Online Fraud Report -- February 2013
RSA Monthly Online Fraud Report -- February 2013RSA Monthly Online Fraud Report -- February 2013
RSA Monthly Online Fraud Report -- February 2013
 
Explorer letters
Explorer lettersExplorer letters
Explorer letters
 
Awesome powerpoint
Awesome powerpointAwesome powerpoint
Awesome powerpoint
 
Pozo requena fotonovel·la
Pozo requena fotonovel·laPozo requena fotonovel·la
Pozo requena fotonovel·la
 

Semelhante a RENCI uses Big Data IT to advance life sciences research

White Paper: Next-Generation Genome Sequencing Using EMC Isilon Scale-Out NAS...
White Paper: Next-Generation Genome Sequencing Using EMC Isilon Scale-Out NAS...White Paper: Next-Generation Genome Sequencing Using EMC Isilon Scale-Out NAS...
White Paper: Next-Generation Genome Sequencing Using EMC Isilon Scale-Out NAS...EMC
 
White Paper: Next-Generation Genome Sequencing Using EMC Isilon Scale-Out NAS...
White Paper: Next-Generation Genome Sequencing Using EMC Isilon Scale-Out NAS...White Paper: Next-Generation Genome Sequencing Using EMC Isilon Scale-Out NAS...
White Paper: Next-Generation Genome Sequencing Using EMC Isilon Scale-Out NAS...EMC
 
UCSF07 - Research and HPC Infrastructure_Award_2007
UCSF07 - Research and HPC Infrastructure_Award_2007UCSF07 - Research and HPC Infrastructure_Award_2007
UCSF07 - Research and HPC Infrastructure_Award_2007Michael Williams
 
Cao report 2007-2012
Cao report 2007-2012Cao report 2007-2012
Cao report 2007-2012Elif Ceylan
 
Sequencing Genomics: The New Big Data Driver
Sequencing Genomics:The New Big Data DriverSequencing Genomics:The New Big Data Driver
Sequencing Genomics: The New Big Data DriverLarry Smarr
 
Deep learning for biomedicine
Deep learning for biomedicineDeep learning for biomedicine
Deep learning for biomedicineDeakin University
 
Inauguration Function - Ohio Center of Excellence in Knowledge-Enabled Comput...
Inauguration Function - Ohio Center of Excellence in Knowledge-Enabled Comput...Inauguration Function - Ohio Center of Excellence in Knowledge-Enabled Comput...
Inauguration Function - Ohio Center of Excellence in Knowledge-Enabled Comput...Artificial Intelligence Institute at UofSC
 
HPC and Precision Medicine: A New Framework for Alzheimer's and Parkinson's
HPC and Precision Medicine: A New Framework for Alzheimer's and Parkinson'sHPC and Precision Medicine: A New Framework for Alzheimer's and Parkinson's
HPC and Precision Medicine: A New Framework for Alzheimer's and Parkinson'sinside-BigData.com
 
OPTIMIZED PREDICTION IN MEDICAL DIAGNOSIS USING DNA SEQUENCES AND STRUCTURE I...
OPTIMIZED PREDICTION IN MEDICAL DIAGNOSIS USING DNA SEQUENCES AND STRUCTURE I...OPTIMIZED PREDICTION IN MEDICAL DIAGNOSIS USING DNA SEQUENCES AND STRUCTURE I...
OPTIMIZED PREDICTION IN MEDICAL DIAGNOSIS USING DNA SEQUENCES AND STRUCTURE I...IAEME Publication
 
Where Technology Meets Medicine: SickKids High Performance Computing Data Centre
Where Technology Meets Medicine: SickKids High Performance Computing Data CentreWhere Technology Meets Medicine: SickKids High Performance Computing Data Centre
Where Technology Meets Medicine: SickKids High Performance Computing Data CentreScalar Decisions
 
Caris Life Sciences
Caris Life SciencesCaris Life Sciences
Caris Life SciencesGorman K
 
Caris Life Sciences
Caris Life SciencesCaris Life Sciences
Caris Life SciencesKim Kozlik
 
Bionimbus Cambridge Workshop (3-28-11, v7)
Bionimbus Cambridge Workshop (3-28-11, v7)Bionimbus Cambridge Workshop (3-28-11, v7)
Bionimbus Cambridge Workshop (3-28-11, v7)Robert Grossman
 
The pulse of cloud computing with bioinformatics as an example
The pulse of cloud computing with bioinformatics as an exampleThe pulse of cloud computing with bioinformatics as an example
The pulse of cloud computing with bioinformatics as an exampleEnis Afgan
 
Next generation genomics: Petascale data in the life sciences
Next generation genomics: Petascale data in the life sciencesNext generation genomics: Petascale data in the life sciences
Next generation genomics: Petascale data in the life sciencesGuy Coates
 
Cancer genome repository_berkeley
Cancer genome repository_berkeleyCancer genome repository_berkeley
Cancer genome repository_berkeleyShyam Sarkar
 
Intelligent data analysis for medicinal diagnosis
Intelligent data analysis for medicinal diagnosisIntelligent data analysis for medicinal diagnosis
Intelligent data analysis for medicinal diagnosisIRJET Journal
 
Data supporting precision oncology fda wakibbe
Data supporting precision oncology fda wakibbeData supporting precision oncology fda wakibbe
Data supporting precision oncology fda wakibbeWarren Kibbe
 

Semelhante a RENCI uses Big Data IT to advance life sciences research (20)

White Paper: Next-Generation Genome Sequencing Using EMC Isilon Scale-Out NAS...
White Paper: Next-Generation Genome Sequencing Using EMC Isilon Scale-Out NAS...White Paper: Next-Generation Genome Sequencing Using EMC Isilon Scale-Out NAS...
White Paper: Next-Generation Genome Sequencing Using EMC Isilon Scale-Out NAS...
 
White Paper: Next-Generation Genome Sequencing Using EMC Isilon Scale-Out NAS...
White Paper: Next-Generation Genome Sequencing Using EMC Isilon Scale-Out NAS...White Paper: Next-Generation Genome Sequencing Using EMC Isilon Scale-Out NAS...
White Paper: Next-Generation Genome Sequencing Using EMC Isilon Scale-Out NAS...
 
UCSF07 - Research and HPC Infrastructure_Award_2007
UCSF07 - Research and HPC Infrastructure_Award_2007UCSF07 - Research and HPC Infrastructure_Award_2007
UCSF07 - Research and HPC Infrastructure_Award_2007
 
Informatics
Informatics Informatics
Informatics
 
Cao report 2007-2012
Cao report 2007-2012Cao report 2007-2012
Cao report 2007-2012
 
Sequencing Genomics: The New Big Data Driver
Sequencing Genomics:The New Big Data DriverSequencing Genomics:The New Big Data Driver
Sequencing Genomics: The New Big Data Driver
 
Deep learning for biomedicine
Deep learning for biomedicineDeep learning for biomedicine
Deep learning for biomedicine
 
Inauguration Function - Ohio Center of Excellence in Knowledge-Enabled Comput...
Inauguration Function - Ohio Center of Excellence in Knowledge-Enabled Comput...Inauguration Function - Ohio Center of Excellence in Knowledge-Enabled Comput...
Inauguration Function - Ohio Center of Excellence in Knowledge-Enabled Comput...
 
HPC and Precision Medicine: A New Framework for Alzheimer's and Parkinson's
HPC and Precision Medicine: A New Framework for Alzheimer's and Parkinson'sHPC and Precision Medicine: A New Framework for Alzheimer's and Parkinson's
HPC and Precision Medicine: A New Framework for Alzheimer's and Parkinson's
 
2016 davis-plantbio
2016 davis-plantbio2016 davis-plantbio
2016 davis-plantbio
 
OPTIMIZED PREDICTION IN MEDICAL DIAGNOSIS USING DNA SEQUENCES AND STRUCTURE I...
OPTIMIZED PREDICTION IN MEDICAL DIAGNOSIS USING DNA SEQUENCES AND STRUCTURE I...OPTIMIZED PREDICTION IN MEDICAL DIAGNOSIS USING DNA SEQUENCES AND STRUCTURE I...
OPTIMIZED PREDICTION IN MEDICAL DIAGNOSIS USING DNA SEQUENCES AND STRUCTURE I...
 
Where Technology Meets Medicine: SickKids High Performance Computing Data Centre
Where Technology Meets Medicine: SickKids High Performance Computing Data CentreWhere Technology Meets Medicine: SickKids High Performance Computing Data Centre
Where Technology Meets Medicine: SickKids High Performance Computing Data Centre
 
Caris Life Sciences
Caris Life SciencesCaris Life Sciences
Caris Life Sciences
 
Caris Life Sciences
Caris Life SciencesCaris Life Sciences
Caris Life Sciences
 
Bionimbus Cambridge Workshop (3-28-11, v7)
Bionimbus Cambridge Workshop (3-28-11, v7)Bionimbus Cambridge Workshop (3-28-11, v7)
Bionimbus Cambridge Workshop (3-28-11, v7)
 
The pulse of cloud computing with bioinformatics as an example
The pulse of cloud computing with bioinformatics as an exampleThe pulse of cloud computing with bioinformatics as an example
The pulse of cloud computing with bioinformatics as an example
 
Next generation genomics: Petascale data in the life sciences
Next generation genomics: Petascale data in the life sciencesNext generation genomics: Petascale data in the life sciences
Next generation genomics: Petascale data in the life sciences
 
Cancer genome repository_berkeley
Cancer genome repository_berkeleyCancer genome repository_berkeley
Cancer genome repository_berkeley
 
Intelligent data analysis for medicinal diagnosis
Intelligent data analysis for medicinal diagnosisIntelligent data analysis for medicinal diagnosis
Intelligent data analysis for medicinal diagnosis
 
Data supporting precision oncology fda wakibbe
Data supporting precision oncology fda wakibbeData supporting precision oncology fda wakibbe
Data supporting precision oncology fda wakibbe
 

Mais de EMC

INDUSTRY-LEADING TECHNOLOGY FOR LONG TERM RETENTION OF BACKUPS IN THE CLOUD
INDUSTRY-LEADING  TECHNOLOGY FOR LONG TERM RETENTION OF BACKUPS IN THE CLOUDINDUSTRY-LEADING  TECHNOLOGY FOR LONG TERM RETENTION OF BACKUPS IN THE CLOUD
INDUSTRY-LEADING TECHNOLOGY FOR LONG TERM RETENTION OF BACKUPS IN THE CLOUDEMC
 
Cloud Foundry Summit Berlin Keynote
Cloud Foundry Summit Berlin Keynote Cloud Foundry Summit Berlin Keynote
Cloud Foundry Summit Berlin Keynote EMC
 
EMC GLOBAL DATA PROTECTION INDEX
EMC GLOBAL DATA PROTECTION INDEX EMC GLOBAL DATA PROTECTION INDEX
EMC GLOBAL DATA PROTECTION INDEX EMC
 
Transforming Desktop Virtualization with Citrix XenDesktop and EMC XtremIO
Transforming Desktop Virtualization with Citrix XenDesktop and EMC XtremIOTransforming Desktop Virtualization with Citrix XenDesktop and EMC XtremIO
Transforming Desktop Virtualization with Citrix XenDesktop and EMC XtremIOEMC
 
Citrix ready-webinar-xtremio
Citrix ready-webinar-xtremioCitrix ready-webinar-xtremio
Citrix ready-webinar-xtremioEMC
 
EMC FORUM RESEARCH GLOBAL RESULTS - 10,451 RESPONSES ACROSS 33 COUNTRIES
EMC FORUM RESEARCH GLOBAL RESULTS - 10,451 RESPONSES ACROSS 33 COUNTRIES EMC FORUM RESEARCH GLOBAL RESULTS - 10,451 RESPONSES ACROSS 33 COUNTRIES
EMC FORUM RESEARCH GLOBAL RESULTS - 10,451 RESPONSES ACROSS 33 COUNTRIES EMC
 
EMC with Mirantis Openstack
EMC with Mirantis OpenstackEMC with Mirantis Openstack
EMC with Mirantis OpenstackEMC
 
Modern infrastructure for business data lake
Modern infrastructure for business data lakeModern infrastructure for business data lake
Modern infrastructure for business data lakeEMC
 
Force Cyber Criminals to Shop Elsewhere
Force Cyber Criminals to Shop ElsewhereForce Cyber Criminals to Shop Elsewhere
Force Cyber Criminals to Shop ElsewhereEMC
 
Pivotal : Moments in Container History
Pivotal : Moments in Container History Pivotal : Moments in Container History
Pivotal : Moments in Container History EMC
 
Data Lake Protection - A Technical Review
Data Lake Protection - A Technical ReviewData Lake Protection - A Technical Review
Data Lake Protection - A Technical ReviewEMC
 
Mobile E-commerce: Friend or Foe
Mobile E-commerce: Friend or FoeMobile E-commerce: Friend or Foe
Mobile E-commerce: Friend or FoeEMC
 
Virtualization Myths Infographic
Virtualization Myths Infographic Virtualization Myths Infographic
Virtualization Myths Infographic EMC
 
Intelligence-Driven GRC for Security
Intelligence-Driven GRC for SecurityIntelligence-Driven GRC for Security
Intelligence-Driven GRC for SecurityEMC
 
The Trust Paradox: Access Management and Trust in an Insecure Age
The Trust Paradox: Access Management and Trust in an Insecure AgeThe Trust Paradox: Access Management and Trust in an Insecure Age
The Trust Paradox: Access Management and Trust in an Insecure AgeEMC
 
EMC Technology Day - SRM University 2015
EMC Technology Day - SRM University 2015EMC Technology Day - SRM University 2015
EMC Technology Day - SRM University 2015EMC
 
EMC Academic Summit 2015
EMC Academic Summit 2015EMC Academic Summit 2015
EMC Academic Summit 2015EMC
 
Data Science and Big Data Analytics Book from EMC Education Services
Data Science and Big Data Analytics Book from EMC Education ServicesData Science and Big Data Analytics Book from EMC Education Services
Data Science and Big Data Analytics Book from EMC Education ServicesEMC
 
Using EMC Symmetrix Storage in VMware vSphere Environments
Using EMC Symmetrix Storage in VMware vSphere EnvironmentsUsing EMC Symmetrix Storage in VMware vSphere Environments
Using EMC Symmetrix Storage in VMware vSphere EnvironmentsEMC
 
Using EMC VNX storage with VMware vSphereTechBook
Using EMC VNX storage with VMware vSphereTechBookUsing EMC VNX storage with VMware vSphereTechBook
Using EMC VNX storage with VMware vSphereTechBookEMC
 

Mais de EMC (20)

INDUSTRY-LEADING TECHNOLOGY FOR LONG TERM RETENTION OF BACKUPS IN THE CLOUD
INDUSTRY-LEADING  TECHNOLOGY FOR LONG TERM RETENTION OF BACKUPS IN THE CLOUDINDUSTRY-LEADING  TECHNOLOGY FOR LONG TERM RETENTION OF BACKUPS IN THE CLOUD
INDUSTRY-LEADING TECHNOLOGY FOR LONG TERM RETENTION OF BACKUPS IN THE CLOUD
 
Cloud Foundry Summit Berlin Keynote
Cloud Foundry Summit Berlin Keynote Cloud Foundry Summit Berlin Keynote
Cloud Foundry Summit Berlin Keynote
 
EMC GLOBAL DATA PROTECTION INDEX
EMC GLOBAL DATA PROTECTION INDEX EMC GLOBAL DATA PROTECTION INDEX
EMC GLOBAL DATA PROTECTION INDEX
 
Transforming Desktop Virtualization with Citrix XenDesktop and EMC XtremIO
Transforming Desktop Virtualization with Citrix XenDesktop and EMC XtremIOTransforming Desktop Virtualization with Citrix XenDesktop and EMC XtremIO
Transforming Desktop Virtualization with Citrix XenDesktop and EMC XtremIO
 
Citrix ready-webinar-xtremio
Citrix ready-webinar-xtremioCitrix ready-webinar-xtremio
Citrix ready-webinar-xtremio
 
EMC FORUM RESEARCH GLOBAL RESULTS - 10,451 RESPONSES ACROSS 33 COUNTRIES
EMC FORUM RESEARCH GLOBAL RESULTS - 10,451 RESPONSES ACROSS 33 COUNTRIES EMC FORUM RESEARCH GLOBAL RESULTS - 10,451 RESPONSES ACROSS 33 COUNTRIES
EMC FORUM RESEARCH GLOBAL RESULTS - 10,451 RESPONSES ACROSS 33 COUNTRIES
 
EMC with Mirantis Openstack
EMC with Mirantis OpenstackEMC with Mirantis Openstack
EMC with Mirantis Openstack
 
Modern infrastructure for business data lake
Modern infrastructure for business data lakeModern infrastructure for business data lake
Modern infrastructure for business data lake
 
Force Cyber Criminals to Shop Elsewhere
Force Cyber Criminals to Shop ElsewhereForce Cyber Criminals to Shop Elsewhere
Force Cyber Criminals to Shop Elsewhere
 
Pivotal : Moments in Container History
Pivotal : Moments in Container History Pivotal : Moments in Container History
Pivotal : Moments in Container History
 
Data Lake Protection - A Technical Review
Data Lake Protection - A Technical ReviewData Lake Protection - A Technical Review
Data Lake Protection - A Technical Review
 
Mobile E-commerce: Friend or Foe
Mobile E-commerce: Friend or FoeMobile E-commerce: Friend or Foe
Mobile E-commerce: Friend or Foe
 
Virtualization Myths Infographic
Virtualization Myths Infographic Virtualization Myths Infographic
Virtualization Myths Infographic
 
Intelligence-Driven GRC for Security
Intelligence-Driven GRC for SecurityIntelligence-Driven GRC for Security
Intelligence-Driven GRC for Security
 
The Trust Paradox: Access Management and Trust in an Insecure Age
The Trust Paradox: Access Management and Trust in an Insecure AgeThe Trust Paradox: Access Management and Trust in an Insecure Age
The Trust Paradox: Access Management and Trust in an Insecure Age
 
EMC Technology Day - SRM University 2015
EMC Technology Day - SRM University 2015EMC Technology Day - SRM University 2015
EMC Technology Day - SRM University 2015
 
EMC Academic Summit 2015
EMC Academic Summit 2015EMC Academic Summit 2015
EMC Academic Summit 2015
 
Data Science and Big Data Analytics Book from EMC Education Services
Data Science and Big Data Analytics Book from EMC Education ServicesData Science and Big Data Analytics Book from EMC Education Services
Data Science and Big Data Analytics Book from EMC Education Services
 
Using EMC Symmetrix Storage in VMware vSphere Environments
Using EMC Symmetrix Storage in VMware vSphere EnvironmentsUsing EMC Symmetrix Storage in VMware vSphere Environments
Using EMC Symmetrix Storage in VMware vSphere Environments
 
Using EMC VNX storage with VMware vSphereTechBook
Using EMC VNX storage with VMware vSphereTechBookUsing EMC VNX storage with VMware vSphereTechBook
Using EMC VNX storage with VMware vSphereTechBook
 

Último

From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
Scale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterScale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterMydbops
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3
 
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...AliaaTarek5
 
Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Farhan Tariq
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3
 
Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Hiroshi SHIBATA
 
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPathCommunity
 
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesThousandEyes
 
Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rick Flair
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...Scott Andery
 
Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsPotential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsRavi Sanghani
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersRaghuram Pandurangan
 
A Framework for Development in the AI Age
A Framework for Development in the AI AgeA Framework for Development in the AI Age
A Framework for Development in the AI AgeCprime
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3
 
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...panagenda
 

Último (20)

From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
Scale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterScale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL Router
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
 
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
 
Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
 
Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024
 
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to Hero
 
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
 
Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
 
Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsPotential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and Insights
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information Developers
 
A Framework for Development in the AI Age
A Framework for Development in the AI AgeA Framework for Development in the AI Age
A Framework for Development in the AI Age
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
 
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
 

RENCI uses Big Data IT to advance life sciences research

  • 1. White Paper Abstract This white paper explains how the Renaissance Computing Institute (RENCI) of the University of North Carolina uses EMC Isilon scale-out NAS storage, Intel processor and system technology, and iRODS-based data management to tackle Big Data processing, Hadoop-based analytics, security and privacy challenges in research and clinical genomics. July 2013 LIFE SCIENCES AT RENCI Big Data IT to manage, decipher, and inform
  • 2. Copyright © 2013 EMC Corporation. All Rights Reserved. EMC believes the information in this publication is accurate as of its publication date. The information is subject to change without notice. The information in this publication is provided “as is.” EMC Corporation makes no representations or warranties of any kind with respect to the information in this publication, and specifically disclaims implied warranties of merchantability or fitness for a particular purpose. Use, copying, and distribution of any EMC software described in this publication requires an applicable software license. For the most up-to-date listing of EMC product names, see EMC Corporation Trademarks on EMC.com. EMC2 , EMC, the EMC logo, Isilon, and OneFS are registered trademarks or trademarks of EMC Corporation in the United States and other countries. All other trademarks used herein are the property of their respective owners. Part Number H11692.1 2Life sciences at RENCI: Big Data IT to manage, decipher, and inform
  • 3. Table of Contents Life sciences at RENCI: Big Data IT to manage, decipher, and inform ............4 Tackling clinical and research genomics.........................................................5 Data analysis—Hadoop assists in variant calling ............................................7 Data management—iRODS proving its value ..................................................9 Data security—overcoming UNIX limitations with iRODS .............................10 Protected insight into Big Data: the Secure Medical Workspace...................11 Big Data’s persistent challenges ..................................................................12 What IGS will deliver.................................................................................... 13 EMC Isilon OneFS tames Big Data .................................................................. 13 Intel’s HPC leadership empowers life sciences ................................................. 14 For more information ...................................................................................15 3Life sciences at RENCI: Big Data IT to manage, decipher, and inform
  • 4. Life sciences at RENCI: Big Data IT to manage, decipher, and inform Turning Big Data into insight in the lab and therapy in the clinic is perhaps the preeminent challenge of modern life sciences. Not only must massive datasets be managed and analyzed, but the insights gleaned must also be delivered to healthcare professionals and patients in a way they can understand and use. Kirk Wilhelmsen, M.D./Ph.D., Charles Schmitt, Ph.D., and their colleagues at the Renaissance Computing Institute (RENCI) of the University of North Carolina (UNC) are at the forefront of efforts to create the necessary IT infrastructure and tools to advance this ambitious goal. RENCI’s Health & Bioscience initiatives span basic research, advanced genomics, translational medicine, and clinical decision support. Some are tightly focused, such as two “knowledge-based medicine” programs that are developing decision support tools to enhance the way physicians treat epilepsy and prostate cancer. Another, Secure Medical Workspace (SMW), is creating a platform for providing controlled access to confidential medical records stored in the Carolina Data Warehouse for Health (CDW-H). A fourth initiative, Informatics for Genetic Sequencing (IGS) is the epitome of the Big Data challenge; it’s working on developing the end-to-end IT infrastructure necessary to support advanced DNA sequencing, genomics research, and the delivery of genomics-informed healthcare. Taken as a whole, it’s worth noting the distinct ”translational” bent to RENCI bioscience efforts, which fits naturally into the Institute’s broad mission to develop technologies that boost North Carolina competitiveness. “Initially we looked at traditional bioinformatics and systems biology, but genomics was really starting to make the transition to medicine and there was a big gap in translational capabilities. It was a natural place to focus,” said Schmitt, RENCI director of data sciences and informatics. The IGS project is an instructive use case in coping with Big Data. On the order of 30 human genomes are sequenced weekly for RENCI’s projects. Just one genome, depending upon the type of sequencing and the coverage, can generate 100 GB of data to manage. Capturing, analyzing, storing, and presenting the accumulating data requires a hybrid HPC (high-performance computing) infrastructure that blends traditional cluster computing with emerging tools such as iRODS (Integrated Rule- Oriented Data System) and Hadoop. Unsurprisingly, the HPC infrastructure is always a work in progress, noted Schmitt. RENCI/UNC computing resourcesi are already significant. They include large internal clusters, links to the Open Science Grid, more than 2 PB of spinning disk storage, and roughly 3 PB of tape storage. The IGS pipeline/analysis uses a substantial piece of the overall computing power—RENCI-based DELL blade-based Linux clusters with more than 1,400 cores; UNC’s Dell and HP blade-based Linux clusters with nearly 1,000 nodes. Primary storage is handled by a 909 TB EMC® Isilon® system at UNC, a 1.7 PB Lustre scratch space at RENCI, and PB-scale tape storage systems at UNC and RENCI. Intel is another important contributor to RENCI’s computing power, supplying processor, development, and systems technology used throughout the RENCI/UNC HPC infrastructure. “Intel is doing much more than just processors in HPC. We bring domain experts as well as hardware, platforms, software, and HPC leadership to life 4Life sciences at RENCI: Big Data IT to manage, decipher, and inform
  • 5. sciences and healthcare,” says Ketan Paranjape, Global Director, Healthcare and Life Sciences (see Intel’s HPC leadership empowers life sciences, page 14). The RENCI Big Data infrastructure is shown in the Figure 1 below. Figure 1. RENCI Genomics Big Data infrastructure Significant investments have also been made in the wet lab. UNC acquired 12 next- generation high-throughput sequencers (NGS) from Illumina, Pacific Biosciences, and Life Technologies to support both the clinical-care mission of the UNC healthcare system and to further basic genomic and biology research. Tackling clinical and research genomics “There are two primary projects we are working on now,” said Schmitt. One is NCGENES (North Carolina Clinical Genomic Evaluation for NextGen Exome Sequencing). Its official description is, “a multidisciplinary effort to create a bioinformatics infrastructure and a systematic process for using whole-exome sequencing (WES) as a tool in diagnosing disease, revealing genetic markers for disease, and helping people understand the relationship between their genotype and diseases they have or are at risk of developing.” Much of that infrastructure has been built and is in production use for NCGENES. In whole-exome sequencing, only those regions of the genome coding for expressed proteins—roughly 1.5 percent of the human genome—are sequenced. Patients of the UNC health system are the subjects. The direct goal here is to identify known mutations in those sequences that are associated with disease risk or health and provide that 5Life sciences at RENCI: Big Data IT to manage, decipher, and inform
  • 6. information to clinicians and patients. More broadly, it’s also intended to explore ethical and psychological issues of explaining risk—sometimes when there is no treatment— to patients. NCGENES is a good example of efforts to deliver translational medicine. The second project is for the National Institute of Drug Abuse (NIDA) and involves whole-genome sequencing. Its purpose is to investigate the genetics of drug addiction. It takes about 10–15 days to sequence a full genome and costs $5–$10K per genome— roughly 10 times the cost to sequence a whole exome. In terms of sequencing coverageii , the NCGENES program is considered moderate at 50X whereas the NIDA project at ~10X is considered low coverage, but, of course, it is sequencing the entire 3-billion base-pair human genome. The NIDA work seeks to discover low-frequency, novel variants and relies heavily on statistical imputation. “Over a period of a year we’ve sequenced ~1,000 whole genomes for NIDA and are now processing another round of 2,500 whole genomes to be completed by end of 2013,” noted Schmitt. “NCGENES has about 250 people in process right now, and we’ll probably do another 750 over the grant period.” The size of the data per sample (person) varies considerably between the projects. A typical whole genome sequenced NIDA sample averages 100 GB, versus 15 GB1 per sample for NCGENES’ exome sequenced samples. Currently, RENCI has on the order of 400 TB of genomics data stored on the EMC Isilon system and projects growing to 600 TB by the end of 2013. Here’s a snapshot of the three-stage analysis pipeline RENCI has developed: • DNA sequencing. DNA extracted from tissue samples is run through the high- throughput NGS instruments. These modern sequencers generate hundreds of millions of short DNA sequences for each patient, which must then be “assembled” into proper order to determine the genome. Researchers use parallelized computational workflows to assemble the genome and perform quality control on the reassembly— fixing errors in the reassembly. • Variant calling. DNA variations (SNPs, haplotypes, indels, etc.) for an individual are detected, often using large patient populations to help resolve ambiguities in the individual’s sequence data. Data is organized into a hybrid solution that uses a relational database to store canonical variations, high-performance file systems to hold data, and a Hadoop-based approach for specialized data-intensive analysis. Links to public and private databases help researchers identify the impact of variations including, for example, whether variants have known associations with clinically relevant conditions. • Clinical binning. The final step in the NCGENES project is the report to the physicians. Key to this stage is a process termed “clinical binning,” which is performed using custom UNC-developed software. It assigns a clinical relevancy to each variant, shown in Figure 2, allowing clinicians and patients to determine which variants they care about. Once variants are “binned,” a website delivers the information to physicians and patients (via the Secure Medical Workspace). The overall process, from blood-draw to analysis to reporting, including several stages that provide independent validation of the identified variants, is managed through a custom workflow solution developed by RENCI. 1 These are FASTQ, BAM, and VCF files with ancillary log and metric files. 6Life sciences at RENCI: Big Data IT to manage, decipher, and inform
  • 7. Figure 2. Clinical Binning: assigning a clinical relevancy to each variant Criteria Loci with clinical utility Loci with clinical validity Loci with unknown clinical implications Loci with important reproductive implications Genes Bins Bin 1 Genes, which when mutated, result in high risk of clinically actionable condition Bin 2A Low risk incidental information Bin 2B Medium risk incidental information Bin 2C High risk incidental information Bin3 All other Loci Bin R Carrier status for severe AR disease Examples BRCA1/2 MLH1, MSH2 FBN1 NF1 Loci with proven PGx clinical utility PGx variants and common risk SNPs with no proven clinical utility APOE, genes associated with Mendelian disease for which clinical recommendations exist Huntington’s disease Prion diseases SCA, PS1, PS2, APP Tay Sachs Familial Dysautonomia CF, etc. Estimated number of Genes/Loci Dozen(s) ~20 (eventually 100s-1000s) 100s Dozen(s) >20,000 Hundreds “Most of what we do is traditional HPC,” said Schmitt. “There’s the analytical pipeline most people associate with genomics sequencing, which is stitching (assembly) the genome back together up to the point of starting to call variations. This can be handled by the type of HPC clusters we have in place. In terms of disk space, more is always better and our usage will grow several hundred terabytes this year. At the same time, our usage per sample has dropped as we focus what we store more precisely on the needs of downstream analysis and leverage tape for archiving.” Data analysis—Hadoop assists in variant calling Calling variations is relatively straightforward for NCGENES because of the manageable size of exome datasets, the ready availability of software analysis tools, and well- characterized reference genomes. “For NCGENES, we call variants in a very traditional way using the GATKiii software package from Broad Institute,” said Schmitt. Variant calling is done in batches of 50 or 100, something easily handled by HPC clusters. “It takes a week or less, depending upon the batch size.” For the NIDA project, identifying meaningful variation is far more challenging. The much larger datasets, the lower coverage, the search for novel variants across the entire genome, and the need to characterize variations against a pool of genomes— not just against a single reference—all combine to make variant calling for NIDA a memory-intensive, computationally demanding task. “NIDA is actually investigating new approaches to calling variants and finding haplotypes. It’s doing something called imputing genotypesiv , and the calculations can take up to a month,” said Schmitt. “Of course, you don’t have to run it very often. You can run a 7Life sciences at RENCI: Big Data IT to manage, decipher, and inform
  • 8. batch once every six months, basically keeping up with the flow of data. We are looking at how to speed that up because clearly that’s not a very scalable solution.” Schmitt said RENCI stays abreast of most computationally difficult genomics problems. For example, RENCI is interested in de-novo sequencing once there are approaches that can compete with or augment reference-based alignments. He added, “Developing techniques to detect rare variations, as well as combinations of variations, are of high interest to our group and we are doing research in this area. We currently aren’t doing trio sequencing.” One increasingly popular approach to accelerating data-intensive computing is Hadoop. Essentially, Hadoop uses a distributed file system and framework (MapReduce) to break large datasets into chunks, to distribute/store (Map) those chunks to nodes in a cluster, and to gather (Reduce) results following computation. Hadoop’s distinguishing feature is that it automatically stores the chunks of data on the same nodes on which they will be processed. This strategy of co-locating data and processing power (proximity computing) significantly accelerates performance. It also turns out that Hadoop architecture is a good choice for many life sciences applications. This is largely because so much of life sciences data is semi- or unstructured file-based data and ideally suited for “embarrassingly parallel” computation. Moreover, the use of commodity hardware (e.g., Linux cluster) keeps cost down, and little or no hardware modification is required. “We’ve used a few Hadoop-specific applications. The main one is to process VCF files (variant call format) when determining allele frequency on NIDA sequences. We developed a set of tools called Hadoop VCF that lets us put a number of VCF files into Hadoop and perform MapReduce jobs across VCF files,” said Schmitt. There are several challenges in processing NIDA sequences, not the least of which is the size of the databases against which NIDA sequences are compared—e.g., the 1000 Genomes, plus other sources. “In one case we had 6,000 or so genomes,” said Schmitt. “Hadoop was a convenient, existing technology to do those kinds of parallel calculations.” Native support of HDFS (Hadoop Distributed File System) is provided by the EMC Isilon system. HDFS is a lightweight protocol layer between the Isilon OneFS® file system and HDFS clients. “This makes it simple for organizations to utilize protocols like NFS, REST, FTP, HTTP, etc., to ingest data for their Hadoop workflows,” says Sanjay Johshi, CTO–Life Sciences, EMC Isilon Storage Division. “If the data is already stored on the EMC Isilon scale-out NAS, then an organization simply points its Hadoop compute farm at OneFS without having to perform a time- and resource-intensive load operation of the Hadoop workflow (see EMC Isilon OneFS tames Big Data, page 13). This is the type of innovation that EMC Isilon brings that RENCI hopes to adopt in order to leverage its investment in Hadoop and high-performance storage systems. Nevertheless, Hadoop is only part of the answer. “We’ve looked at a number of uses for Hadoop. We tried some BAM processing, developing our own file formats for some of the sequencing data, but haven’t found it to be more valuable than using traditional tools,” said Schmitt. “We’ve been able to get by so far in batch mode processing, doing embarrassingly parallel calculations, but we don’t see that scaling as we move into tens of thousands of sequences. Past that, we’re pretty sure we are going to have to switch to a more data-intensive paradigm.” 8Life sciences at RENCI: Big Data IT to manage, decipher, and inform
  • 9. Schmitt cites two concerns with Hadoop: 1) RENCI is increasingly emphasizing algorithms that are either graph-based or Markov Model-oriented and, according to Schmitt, “Hadoop isn’t necessarily the best way to scale those algorithms.” 2) The other big issue is that Hadoop does not work well in a shared HPC cluster environment. “This keeps us from using Hadoop more. We just can’t take over a shared cluster periodically and allocate it for Hadoop,” said Schmitt. Data management—iRODS proving its value Data management for RENCI’s health and biosciences initiatives is fairly complicated. “Briefly, what happens is the sequencing facility puts out data (on disk), and all that gets tracked through a laboratory information management system (LIMS). We pick it up at that point,” said Schmitt. “We run all of our analysis pipelines on a single HPC cluster and the large EMC Isilon system. We keep all the intermediate and analyzed data products on the Isilon system, and our pipelines register the data products associated with each pipeline stage into the LIMS.” Intermediate processed sequencing data—FASTQ files—are moved to tape as part of the sequencing process. UNC runs a LIMS called BSPLims that handles the processing of blood samples. RENCI has developed a related LIMS called libLims that handles its sequencing workflows for NIDA—libLims interacts with BSPLims, but is customized for the more specialized NIDA workflow. All of the canonical variant data are stored in a large database—VarDBv —that also holds reference genomic data: “Most importantly, in this regard, is that it holds several versions of the NCBI reference genome and manages translating genomic locations between the different versions,” said Schmitt. VarDB also holds variants from public data sources, such as dbSNP and The 1000 Genomes Project, variants from UNC sequencing efforts, as well as variants from HGMD, the database of human gene mutation data. Finally, it holds annotations on data from public databases, such as OMIM and RefSeq, as well as annotations derived from tools like Polyphen. All together, VarDB currently stores the data on the EMC Isilon system and this will steadily grow. To help cope with its Big Data management challenge—storage, access, archiving, data security, etc.—RENCI is making growing use of iRODS. In fact, RENCI is spearheading an E-iRODS development effort in which Schmitt is the leader. Broadly speaking, iRODS (the integrated Rule-Oriented Data System) is a data grid technology that essentially puts a unified namespace on data files, regardless of where those files are physically located. You may have files in four or five different storage systems, but to the user it appears as one directory tree. iRODS also allows setting enforcement rules on any access to the data or submission of data. For example, if someone entered data into the system, that might trigger a rule to replicate the data to another system and compress it at the same time. Access protection rules based on metadata about a file can be set. RENCI is already using iRODS with the analytical pipelines. “When our analytical pipelines are processing the data, they also register that data into iRODS,” Schmitt says. At the end of the pipeline, the data exists on disks and is registered into iRODS. Anyone wanting to use the data must come in through iRODS to get the data; this allows RENCI to set policies on access and data use. 9Life sciences at RENCI: Big Data IT to manage, decipher, and inform
  • 10. “We originally did this as a way to let the clinical system access the raw research data,” Schmitt continued. “Within the clinical system there is the ability for a clinician looking at a patient to click on a button and download the BAM file, and we wanted a way to separate that clinical system from where we store the BAM file.” Here’s how it works. The clinical system takes the ID of the patient, sends it to iRODS, which does a look-up and gives back the BAM file. At the same time, it compresses it, and pulls up just the section of data on that BAM file that the clinician actually wants. The Integrative Genomics Viewer (IGV) from the Broad Institute is then launched to allow the clinician to view the sequence reads associated with the variation of interest in context with the reference genome and other relevant data (e.g., locations of exons and regulatory regions). In that way, data can be moved elsewhere, maybe even to tape, and iRODS manages hiding all of that from the clinical side. The IGS team is now investigating the use of iRODS to automate replication of the raw data produced at UNC to storage at RENCI. “It’s not really a backup, just a redundant store. We’re looking into the process of selectively copying some of the data to put it onto tape,” noted Schmitt. To some extent, RENCI/UNC’s archival strategy is still evolving. “FASTQs are all archived to tape and that’s put on a copy at UNC and a copy off-site. That’s our primary safety net. We are also starting to copy the BAM files to tape at RENCI. That’s a little less secure than the FASTQs, but sufficient in that we can regenerate those in the case of disaster. Those are the two main ones that we archive. The phenotypic and demographic data are all stored in databases and those are independently backed up and archived,” Schmitt said. Data security—overcoming UNIX limitations with iRODS Because RENCI works on multiple, shared systems in different data centers, implementing security is complex. Basic security is provided through IT groups at UNC and RENCI that provide aspects such as anti-virus, network filtering, single sign-on, and system-level logging. “Standard user ID/password is used on the research side of our work for access to resources such as file systems or databases. The number of people with such access is very limited and governed through UNC’s IRB,” explained Schmitt. “On the clinical side there are more people accessing the data, so access is through websites that users have to authenticate against and are secured in standard ways (e.g., SSL, database server/Web servers running on VMs behind locked doors). iRODS is used to automate standard procedures, including archiving, replication, and access to raw data from users on the clinical side—this allows us to use iRODS logging and sign-on for security. We are moving to project-level access control as we bring iRODS further into our overall solution,” said Schmitt. One problem is that UNIX directories can only go so far in managing the project orientation of data. “That becomes a real headache,” said Schmitt. “With iRODS, we can assign protection based on metadata for that file. That’s important because we have many different graduate students, medical students, and rotating 10Life sciences at RENCI: Big Data IT to manage, decipher, and inform
  • 11. bioinformaticians coming in; otherwise we would have to devote whole directory trees to them.” Indeed, the use of iRODS is a growing trend in life sciences, according to Joshi. “Isilon customers are turning to iRODS for its rule-based data management capabilities to complement the OneFS system administration features. By leveraging both OneFS capabilities and iRODS, storage administrators not only can implement data policies for disaster recovery, archive, and replication, but can also empower research teams with capabilities to manage data throughout the study (project) lifecycle. With iRODS, investigators can take advantage of tools that allow them to automate annotation of data sets with project information, move data based on the project lifecycle, and find the data based on study attributes when they need it.” Protected insight into Big Data: the Secure Medical Workspace At the end of the day, the goal is to be able to deliver important genomics information to both clinicians and researchers. To accomplish this part of its broad genomics infrastructure mission, RENCI, in collaboration with UNC TraCS, the School of Information and Library Science (SILS), and UNC Hospitals, has developed the Secure Medical Workspace (SMW) system to enable the CDW-H to provide researchers and healthcare professionals secure access to patient records. The SMW shown in Figure 3 combines a secure centralized infrastructure with virtualization and data leakage protection technologies to allow researchers to analyze their research data, while ensuring sensitive patient information remains within the SMW environment. “It’s a front-end to get to the data,” said Schmitt. “So for those people who need direct access to sensitive data containing PHI, we’re using this secure workspace as a way to give them access to data files.” Authorized researchers connect to SMW from their local computing devices over a secure network connection to a dedicated virtual workspace. Figure 3. The Secure Medical Workspace 11Life sciences at RENCI: Big Data IT to manage, decipher, and inform
  • 12. “It’s a virtualization solution where we can give a researcher a virtual server, and once on that server the researcher can get access to data, either directly attached to that server or remote somewhere else. But we include data leakage protection on the server, which gives us protection and screens against any data being pulled outside of the system,” explained Schmitt. “Yet, researchers can freely bring their own data and tools onto the server.” There are commercial solutions that allow you to set policies for who can take data out and what happens when someone tries to take data out. “The way that we have favored doing this,” he continued, “is if someone tries to copy data out, we allow it but throw up a warning screen saying you have to abide by your data usage agreement. That agreement and the data removed from the server are then stored for compliance audits.” Big Data’s persistent challenges Amid the substantial progress in developing an infrastructure to handle life sciences’ Big Data challenge, many thorny challenges persist, noted Schmitt. Consider that a database of sequenced and variant data associated with 10,000 patients would have roughly a petabyte of data. Working with such a massive data repository complicates basically everything—storage, replication, ongoing analysis, traditional ETL database functions, etc. Collaboration, for example, remains problematic, with data transmission the biggest issue. RENCI’s current collaboration with UC San Diego and the Scripps Institute, explained Schmitt, “has been done by sending BAM files in batches. The first batch took a month to send. Then, talking back and forth by phone about issues regarding the data takes more time. It’s not a great process,” he says. Schmitt continued: “We are looking at some of the advanced networking coming out of NSF to get the bandwidth we want to move data. Of course that’s all kind of experimental right now. We are exploring using some of the OpenStack and Open Science Cloud offerings as a way to help collaborate.” Large-scale computation on Big Data—particularly some of the so-called n-squarevi problems—remains challenging. “We continue to explore Hadoop as one answer down the road, but we are looking at other approaches, including data flow solutions and systems for computing over large-scale graphs,” said Schmitt. Archiving is another bottleneck. “Our goal for UNC and the UNC healthcare system is to be able to manage storing a genome for every individual patient and using that for research, but to get to that level cost-wise is going to be very difficult in terms of data storage,” Schmitt continued. “We need a better idea of what data we can throw away and when we can throw away data, and how to represent data at various levels of hierarchy.” Nevertheless, RENCI’s progress on all fronts has been substantial. UNC healthcare professionals are able to look at patient genomic data for clinical care through the NCGENES project—the last stage in RENCI’s analysis and data delivery pipeline. The NIDA project is longer-term, and still in data and analysis collection mode, but many of the kinks to collecting and processing the larger NIDA sample datasets have been 12Life sciences at RENCI: Big Data IT to manage, decipher, and inform
  • 13. worked out. RENCI is poised to play a growing and important role developing the HPC infrastructure and necessary analysis pipelines to support life sciences and healthcare. What IGS will deliver In addition to handling the basic processing of next-generation DNA sequencer (NGS) output, the RENCI-built Informatics for Genetic Sequencing (IGS) infrastructure continues to be enhanced in order to support: • Improved population-oriented queries: Given a variant, the system will find the frequency of that variant and related haplotypes in a large population to help determine whether the variant is potentially deleterious. • Automated annotation: The system will extract data from multiple different source databases, extract annotation, and incorporate it back into the variant database for use by researchers, thus providing an increasingly diverse range of annotation sources. • Reference rationalization: Data in the system could be used to redefine the “reference” genome, the template used to compare genomes from different individuals. • Improved variant analysis: Enhanced data processing will help researchers identify additional information about genetic variation between individuals beyond that which is possible with current technologies. • Visualization: Data visualization will help enable new insights and inspire new research questions. • Metadata grid: The system will enable automated generation and propagation of metadata to enhance analysis and data management and to guide computational and data workflows. EMC Isilon OneFS tames Big Data EMC Isilon OneFS 7.0 is designed to address the convergence of Big Data and enterprise IT, and extend the benefits of Isilon scale-out NAS architecture to a wider range of enterprise storage needs. OneFS combines the three layers of traditional storage architectures—the file system, volume manager, and RAID—into one unified software layer, creating a single intelligent distributed file system that runs on one storage cluster. The advantages of OneFS for NGS are many: • Scalable: Scale out as needs grow. Linear scale with increasing capacity: from 18 TB to 20 PB in a single file system and a single global namespace. • Predictable: Dynamic content balancing is performed as nodes are added, upgraded, or as capacity changes. No added management time is required, because this process is simple. • Available: OneFS is “self-healing.” It protects your data from power loss, node or disk failures, and loss of quorum and storage rebuild by distributing data, metadata, and parity across all nodes. 13Life sciences at RENCI: Big Data IT to manage, decipher, and inform
  • 14. • Efficient: Compared to the average 50 percent efficiency of traditional RAID systems, OneFS provides over 80 percent efficiency, independent of CPU compute or cache. This efficiency is achieved by tiering the process into three types, as shown in the figure alongside and by the pools within these node types. • Enterprise-ready: Administration of the storage clusters is via an intuitive Web-based UI. Connectivity to your process is through standard protocols: CIFS, SMB, NFS, FTP/ HTTP, Object, and HDFS. Standardized authentication and access control is available at scale: AD, LDAP, and NIS. Isilon is the only scale-out NAS offering that provides enterprise capabilities at scale to manage rapidly growing unstructured data assets more effectively. Isilon OneFS provides data protection through snapshots across the whole cluster, and is the only scale-out NAS solution compliant to SEC 17a-4 standards. Isilon is the world's fastest NAS platform, delivering over 100 GB/s system throughput, and remains the world-record holder for scale-out NAS performance with 1.6 million SpecSFS2008 CIFS operations per second. With OneFS 7.0, Isilon storage systems now provide dramatically improved caching capability to reduce average latency by 60 percent for I/O-intensive applications. Intel’s HPC leadership empowers life sciences Intel technology is used throughout HPC and is particularly prevalent in life sciences, where Big Data challenges are now the norm. For example, Intel Xeon processors, both the E5 and Phi lines, are accelerating parallel computing and bringing greater accuracy to genomics analysis. Similarly, Intel software, such as the Intel Distribution for Hadoop and Intel Manager for Hadoop, helps administrators simplify configuring hardware and tuning Hadoop performance. In all aspects of HPC, Intel technology and products are at the forefront. Nowhere is this leadership more important than life sciences and throughout the RENCI/UNC HPC infrastructure, where Intel products are widely embedded and helping researchers and clinicians manage and interpret the genomics data deluge. Here’s a brief overview of just a few Intel enabling technologies: • Xeon/E5. The E5 processor, a solid foundation for HPC, delivers 80 percent greater performance, 70 percent more energy-efficiency, and 30 percent less network latency than earlier Xeon processors. Servers based on the E5 family provide an optimum combination of performance, built-in capabilities, and cost- effectiveness. From virtualization and cloud computing solutions to design automation or real-time financial transactions, the E5 provides needed power. • Xeon/Phi. Intel’s new line of Xeon Phi coprocessors is optimized for performance and programmability for highly parallel workloads. The 5110P, first member of the line, has 60 cores at 1.053GHz and handles 240 threads. Importantly, Intel Xeon processors and Phi coprocessors support the same code, reducing the complexity of development. The same techniques—such as scaling applications to many cores and threads—can be used on both. • Intel software. This extensive portfolio includes, for example, Intel Cluster Studio XE, which features high performance, standards-driven compilers, libraries, analysis tools, OpenMP, and MPI. Intel Distribution for Hadoop and Intel Manager for Hadoop are important products for life sciences. Other offerings include Intel Data Center Manager (DCM) and Intel Node Manager (NM) for resource/power management, and Intel Expressway Service Gateway for cloud usage models. 14Life sciences at RENCI: Big Data IT to manage, decipher, and inform
  • 15. • Intel fabric. HPC workloads today are too large to be managed by unspecialized tools. Intel has several specifically designed for large and complex workloads. Among them are Intel True Scale Fabric, designed from the ground up for HPC, and QDR-40 and QDR-80, which deliver performance that scales. These tools are optimized support for Xeon E5 and Xeon Phi processors. • Intel storage. Intel storage technologies are used throughout industry at every level (enterprise, SM business, home). Here are a few: Intel Xeon processors and platforms enabled with beneficial storage optimizations; Solid-state drives (SSDs) and other NVM technologies improve storage performance; Intel Cache Acceleration Software (CAS); and Intel’s open source Lustre file-system support/development and Chroma management/provisioning tools. For more information For more information about the exciting work done at the Renaissance Computing Institute (RENCI), visit http://www.renci.org. To learn more about how EMC products, services, and solutions help solve your life sciences IT challenges, contact your local representative or authorized reseller—or visit us at www.EMC.com/isilon. To learn more about Intel technology, visit http://www.intel.com/content/www/us/en/healthcare-it/big-data-in-healthcare.html. To learn how e-IRODS can solve your enterprise data management needs, visit http://e-irods.org/. i http://www.renci.org/resources/computing ii “…coverage is a measure of the number of times that a specific genomic site is sequenced during a sequencing.” – JP Sulzberger Columbia Genome Center, http://genomecenter.columbia.edu/?q=node/77. iii The Genome Analysis Toolkit or GATK is a software package developed at the Broad Institute to analyze next-generation resequencing data. The Toolkit offers a wide variety of tools, with a primary focus on variant discovery and genotyping, as well as strong emphasis on data quality assurance. http://www.broadinstitute.org/gatk/. iv University of Oxford backgrounder on imputing, http://mathgen.stats.ox.ac.uk/impute/impute_v2.html#home v VarDB is a PostgreSQL relational database. “VarDB doesn’t directly integrate with RENCI usage of Hadoop other than occasionally store results from certain Hadoop calculations, such as allele frequencies from VCF files, in VarDB.” – Charles Schmitt. vi N-squared is shorthand for problems that are actually O(n^2) or similar to n-squared, such as O(n^2.8). That would be different from true np hard problems. 15Life sciences at RENCI: Big Data IT to manage, decipher, and inform