SlideShare uma empresa Scribd logo
1 de 5
Baixar para ler offline
1
The bigger that big data gets, the
more it seems to elude our grasp. While
it holds great potential for creating new
opportunities in every field, big data is
growing so fast that it is now outpacing
the ability of our current tools to take
full advantage of it.
Much of the problem lies in the need
to extensively prepare the data before it
can be analyzed. Data must be converted
into recognizable formats—a laborious,
time-consuming process that becomes
increasingly impractical as data collections
grow larger. Although organizations are
amassing impressive amounts of data, they
simply do not have the time or resources
to prepare it all in the traditional manner.
This is particularly an issue with
“unstructured” data that does not easily
lend itself to formatting, such as photo-
graphs, doctors’ examination notes,
police accident reports, and posts on
social media sites. Unstructured data
accounts for much of the explosion in big
data today, and is widely seen as holding
the most promise for creating new areas
of business growth and government
efficiency. But because unstructured data
is so difficult to prepare, its enormous
value remains largely untapped.
With such constraints, organizations
are now reaching the limits of what they
can do with big data. They are going as
far as the current tools will take them, but
no further. And as big data grows larger,
organizations will only be increasingly
inundated with information that they
have only a narrow ability to use. It is like
the line, “Water, water, everywhere…”
What is needed is an entirely new
approach to this overwhelming flood of
data, one that can manage it and make it
useful, no matter how big it grows. That is
the concept behind Booz Allen Hamilton’s
“data lake,” a groundbreaking invention
that scales to an organization’s growing
data, and makes it easily accessible.
With the data lake, an organization’s
repository of information—including
structured and unstructured data—is
consolidated in a single, large table. Every
inquiry can make use of the entire body
of information stored in the data lake—
and it is all available at once.
The data lake completely eliminates
the current cumbersome data-preparation
process. All types of data, including
unstructureddata,aresmoothlyandrapidly
“ingested” into the data lake. There is no
longer any need for the rigid, regimented
data structures—essentially data silos—
that currently house most data. Such silos
are difficult to connect, which has long
hampered the ability of organizations to
integrate and analyze their data. The data
lake solves this problem by eliminating the
silos altogether.
With the data lake, it now becomes
practical—in terms of time, cost, and
analytic ability—to turn big data into
The Data Lake:
Taking Big Data
Beyondthe Cloud
by
Mark Herman
Executive Vice President
Booz Allen Hamilton
			
Michael Delurey
Principal
Booz Allen Hamilton
©2013 Booz Allen Hamilton Inc. All rights reserved. No part of this document may be reproduced without
prior written permission of Booz Allen Hamilton.
2 MARCH 2013
opportunity. We can now ask more far-reaching and
complex questions, and find the often-hidden patterns
and relationships that can lead to game-changing
knowledge and insight.
More than the Cloud
With the advent of cloud computing, business and
governmentorganizationsarenowstoringandanalyzing
far larger amounts of data than ever before. But simply
bringing a great deal of data together in the cloud is
not the same as creating a data lake. Organizations may
have embraced the cloud, but if they continue to use
conventional tools, they still must laboriously prepare
the data and place it in its designated location (i.e., the
silo). Despite its promise to revolutionize data analysis,
the cloud does not truly integrate data—it simply makes
the data silos taller and fatter.
While the data lake relies on cloud computing,
it represents a new and different mindset. Big data
requires organizations to stop thinking in terms of
data mining and data warehouses—the equivalent of
industrial-era processes—and to begin considering how
data can be more fluid and expansive, like in a data lake.
Since with the conventional approach it is difficult to
integrate data—even in the cloud—we tend to use the
cloud mostly for storage, and remove portions of it for
analysis. But no matter how powerful our analytics are,
because we are applying them only to discrete datasets at
any time, we never see the full picture. With the data lake,
however, all of our data remains in the cloud, consoli-
dated and connected. We can now apply our analytics to
the whole of the data, and get far deeper insights.
Organizations may be concerned that by consoli-
dating their data, they might be making it more
vulnerable. Just the opposite is true. The data lake incor-
porates a granular level of data security and privacy not
available in conventional cloud computing.1
The data lake was initially created to achieve a
high-stakes goal. The US government needed a way
to integrate many sources and types of intelligence
data, in a secure manner, to search for terrorists and
other threats. Booz Allen assisted the government in
developing the data lake to achieve that goal, as part
of a larger computing framework known as the Cloud
Analytics Reference Architecture.
The data lake and Cloud Analytics Reference Archi-
tecture are now being adapted to the larger business
1	See Booz Allen Viewpoint “Enabling Cloud Analytics with
Data-Level Security: Tapping the Full Value of Big Data and
the Cloud,” http://www.boozallen.com/media/file/En-
abling_Cloud_Analytics_with_Data-Level_Security.pdf
and government communities, bringing with them a
range of features that have been successfully tested in
the most demanding situations.
Building the Data Lake
One of the biggest limitations of the conventional
approach to data analysis is that analysts often need
to spend the bulk of their time just readying the data
for use. With each new line of inquiry, a specific data
structure and analytic is custom-built. All information
entered into the data structure must first be converted
into a recognizable format, often a slow, painstaking
task. For example, an analyst might be faced with
merging several different data sources that each use
different fields. The analyst must decide which fields to
use and whether entirely new ones need to be created.
The more complex the query, the more data sources that
typically must be homogenized. At some organizations,
analysts may spend as much as 80 percent of their time
preparing the data, leaving just 20 percent for conducting
actual analysis. Formatting also carries the risk of data-
entry errors. With the data lake, there are no individual
data structures—and so there is no need for formal
data formatting. Data from a wide range of sources is
smoothly and easily ingested into the data lake.
One metaphor for the data lake might be a giant
collection grid, like a spreadsheet—one with billions
of rows and billions of columns available to hold
data. Each cell of the grid contains a piece of data—a
document, perhaps, or maybe a paragraph or even a
single word from the document. Cells might contain
names, photographs, incident reports, or Twitter
feeds—anything and everything. It does not matter
where in the grid each bit of information is located. It
also makes no difference where the data comes from,
whether it is formatted, or how it might relate to any
other piece of information in the data lake. The data
simply takes its place in the cell, and after minimal
preparation is ready for use.
The image of the grid helps describe the difference
between data mining and the data lake. If we want to
mine precious metals, we have to find where they are,
then dig deep to retrieve them. But imagine if, when the
Earth was formed, nuggets of precious metals were laid
out in a big grid on top of the ground. We could just
walk along, picking up what we wanted. The data lake
makes information just as readily available.
The process of placing the data in open cells as it
comes in gives the ingest process remarkable speed.
Large amounts of data that might take 3 weeks to
prepare using conventional cloud computing can be
placed into the data lake in as little as 3 hours. This
3MARCH 2013
enables organizations to achieve substantial savings in
IT resources and manpower. Just as important, it frees
analysts for the more important task of finding connec-
tions and value in the data. Many organizations today
are trying to “do more with less.” That is difficult with
the conventional approach, but becomes possible, for
the first time, with the data lake.
Opening Up the Data
The ingest process of the data lake also removes
another disadvantage of the conventional approach—
the need to pre-define our questions. With conventional
computing techniques, we have to know in advance
what kinds of answers we are looking for and where
in the existing data the computer needs to look to
answer the inquiry. Analysts do not really ask questions
of the data—they form hypotheses well in advance of
the actual analysis, and then create data structures and
analytics that will enable them to test those hypotheses.
The only results that come back are the ones that the
custom-made databases and analytics happen to provide.
What makes this exercise even more constraining is
that the data supporting an analysis typically contains
only a portion of the potentially available information.
Because the process of formatting and structuring the
data is so time-intensive, analysts have no choice but
to cull the data by some method. One of the most
prevalent techniques is to discount (and even ignore)
unstructured data. This simplifies the data ingest, but it
severely reduces the value of the data for analysis.
Hampered by these severe limitations, analysts can
pose only narrow questions of the data. And there
is a risk that the data structures will become closed-
loop systems—echo chambers that merely validate the
original hypotheses. When we ask the system what is
important, it points to the data that we happened to put
in. The fact that a particular piece of data is included
in a database tends to make it de facto significant—it is
important only because the hypothesis sees it that way.
With the data lake, data is ingested with a wide-open
view as to the queries that may come later. Because
there are no structures, we can get all of the data in—
all 100 variables, or 500, or any other number, so that
the data in its totality becomes available. Organizations
may have a great deal of data stored in the cloud, but
without the data lake they cannot easily connect it all,
and discover the often-hidden relationships in the world
around us. It is in those relationships that knowledge
and insight—and opportunity—reside.
Tagging the Data
The data lake also radically differs from conven-
tional cloud computing in the way the data itself is
managed. When a piece of data is ingested, certain
details, called metadata (or “data about the data”), are
added so that the basic information can be quickly
located and identified. For example, an investor’s
portfolio balance (the data) might be stored with the
name of the investor, the account number, the location
of the account, the types of investments, the country
the investor lives in, and so on. These metadata “tags”
serve the same purpose as old-style card catalogues,
which allow readers to find a book by searching the
author, title, or subject. As with the card catalogues, tags
enable us to find particular information from a number
of different starting points—but with today’s tagging
abilities, we can characterize data in nearly limitless
ways. The more tags, the more complex and rich the
analytics can become.
With the tags, we can look not only for connec-
tions and patterns in the data, but in the tags as well.
To consider how this technology might be applied,
imagine if a pharmaceutical company were able to
fully integrate a wide range of public data to identify
drug compounds with few adverse reactions, and a
high likelihood of clinical and commercial success.
Those sources might include social media and market
data—to help determine the need—and clinical test
data, chemical structure, disease analysis, even infor-
mation about patents—to find where gaps might exist.
In a sense, the pharmaceutical company is looking for a
needle in a haystack, a prohibitively expensive and time-
consuming task with conventional cloud computing.
However, if the structured and unstructured data
is appropriately tagged and placed in the data lake, it
becomes cost-effective to find the essential connections
in all that data, and make the needle stand out brightly.
The data lake allows us to ask questions and
search for patterns using either the data itself, the tags
themselves, or a combination of both. We can begin
our search with any piece of data or tag—for example,
a market analysis or the existing patents on a type of
drug—and pivot off of it in any direction to look for
connections.
While the process of tagging information is not new,
the data lake uses it in a unique way—as the primary
method of locating and managing the data. With the
tags, the rigid data structures that so limit the conven-
tional approach are no longer needed.
4 MARCH 2013
Along with the streamlined ingest process, tags help
give the data lake its speed. When organizations need
to update or search the data in new ways, they do not
have to tear down and rebuild data structures, as in the
conventional method. They can simply update the tags
already in place.
Tagging all of the data, and at a much more
granular level than is possible in the conventional cloud
approach, greatly expands the value that big data can
provide. Information in the data lake is not random and
chaotic, but rather is purposeful. The tags help make
the data lake like a viscous medium that holds the data
in place, and at the same time fosters connections.
The tags also provide a strong new layer of security.
We can tag each piece of data, down to the image or
paragraph in a document, with the relevant restrictions,
authorities, and security and privacy levels. Organizations
can establish rules regarding which information can be
shared, with whom, and under what circumstances.
A New Way of Storing Data
With the conventional approach, data storage is
expensive—even in the cloud. The reason is that so
much space is wasted. Imagine a spreadsheet combining
two data sources, an original one with 100 fields and the
other with 50. The process of combining means that
we will be adding 50 new “columns” into the original
spreadsheet. Rows from the original will hold no data
for the new columns, and rows from the new source
will hold no data from the original. The result will be a
great deal of empty cells. This is wasted storage space,
and creates the opportunity for a great many errors.
In the data lake, however, every cell is filled—no space
is wasted. This makes it possible to store vast amounts of
data in far less space than would be required for even
relatively small conventional cloud databases. As a result,
the data lake can cost-effectively scale to an organiza-
tion’s growing data, including multiple outside sources.
The data lake’s almost limitless capacity enables
organizations to store data in a variety of different
forms, to aid in later analysis. A financial institution,
for example, could store records of certain transac-
tions converted into all of the world’s major currencies.
Or, a company could translate every document on
a particular subject into Chinese, and store it until it
might be needed.
One of the more transformative aspects of the
data lake is that it stores every type of data equally—
not just structured and unstructured, but also batch
and streaming. Batch data is typically collected on
an automated basis and then delivered for analysis en
masse—for example, the utility meter readings from
homes. Streaming data is information from a continuous
feed, such as video surveillance.
Formatting unstructured, batch, and streaming data
inevitably strips it of much of its richness. And even if
a portion of the information can be put into a conven-
tional cloud database, we are still constrained by limited,
pre-defined questions. The data lake holds no such
constraints. When unstructured, batch, and streaming
data are ingested, analytics can take advantage of the
tagging approach to begin to look for patterns that
naturally emerge. All types of data, and the value they
hold, now become fully accessible.
The US military is taking advantage of this capability
to help track insurgents and others who are planting
improvised explosive devices (IEDs) and other bombs.
Many of the military’s data sources include unstruc-
tured data, and using the conventional approach—with
its extensive preparation—had proved unwieldy and
time-consuming. With the data lake, the military is
now able to quickly integrate and analyze its vast array
of disparate data sources—including its unstructured
data—giving military commanders unprecedented
situational awareness. This is another example of why
simply amassing large amounts of data does not create
a data lake. The military was collecting an enormous
quantity of data, but without the data lake could not
make full use of it to try to stop IEDs. Commanders
have reported that the current approach—which has the
data lake as its centerpiece—is saving more lives, and at
a lower operating cost than the traditional methods.
Accessing the Data for Analytics
One of the chief drawbacks of the conventional
approach, which the cloud does not ameliorate, is that
it essentially samples the data. When we have questions
(or want to test hypotheses), we select a sample of the
available data and apply analytics to it. The problem is
that we are never quite sure we are pulling the right
sample—that is, whether it is really representative of
the whole. The data lake eliminates sampling. We no
longer have to guess about which data to use, because
we are using it all.
With the data lake, our information is available
for analysis on-demand, when the need arises. The
conventional approach not only requires extensive
data preparation, but it is difficult to change databases
as queries change. Say the pharmaceutical company
wants to add new data sources to identify promising
5MARCH 2013
drug compounds, or perhaps wants to change the type
of financial analyses it uses. With the conventional
approach, analysts would have to tear down the initial
data and analytics structures, and re-engineer new ones.
With the data lake, analysts would simply add the new
data, and ask the new questions.
Because it is not easy to change conventional data
structures, the information they contain can become
outdated and even obsolete fairly quickly. By contrast,
we are able to add new information to the data lake the
moment we need it.
This ease in accessibility sets the stage for the
advanced, high-powered analytics that can point the
way to top-line business growth, and help government
achieve its goals in innovative ways. Analytics that search
for connections and look for patterns have long been
hamstrung by being confined to limited, rigid datasets
and databases. The data lake frees them to search for
knowledge and insight across all of the data. In essence,
it allows the analytics, for the first time, to reach their
true potential.
Because there is no need to continually engineer and
re-engineer data structures, the data lake also becomes
accessible to non-technical subject matter experts.
They no longer need to rely on computer scientists and
others to explore the data—they can ask the questions
themselves. Subject matter experts best understand
how big data can provide value to their businesses and
agencies. The data lake helps put the answers directly in
their hands.
A New Mindset
Virtually every aspect of the data lake creates cost
savings and efficiencies, from freeing up analysts to its
ability to easily and inexpensively scale to an organi-
zation’s growing data. Because the data lake enables
organizations to gather and analyze ever-greater
amounts of data, it also gives them new opportunities
for top-line revenue growth. The data lake enables both
business and government to reach that tipping point at
which data helps us to do things not just cheaper and
better, but in ways we have not yet imagined.
Organizations may believe that because they are
now in the cloud and can put all their data in one place,
they already have a version of the data lake. But greater
amounts of data—no matter how large—will not
necessarily yield more knowledge and insight. The trick
is to connect the data and make it useful—essentially,
to create the kinds of conditions that can turn big data
into opportunity. The data lake and the larger Cloud
Analytics Reference Architecture represent a revolu-
tionary approach—and a new mindset—that make
those conditions possible. Opportunity is out there, if
we have the tools to look for it.
FOR MORE INFORMATION
Mark Herman
herman_mark@bah.com
703-902-5986
Michael Delurey
delurey_mike@bah.com
703-902-6858
www.boozallen.com/cloud
This document is part of a collection of papers developed by Booz Allen Hamilton to introduce new concepts and ideas spanning cloud
solutions, challenges, and opportunities across government and business. For media inquiries or more information on reproducing this
document, please contact:
James Fisher—Senior Manager, Media Relations, 703-377-7595, fisher_james_w@bah.com
Carrie Lake—Manager, Media Relations, 703-377-7785, lake_carrie@bah.com
12.032.12P

Mais conteúdo relacionado

Destaque

The Government's Effective Migration to a Cloud Computing Environment
The Government's Effective Migration to a Cloud Computing EnvironmentThe Government's Effective Migration to a Cloud Computing Environment
The Government's Effective Migration to a Cloud Computing EnvironmentBooz Allen Hamilton
 
Mission Engineering Solution Infographic
Mission Engineering Solution InfographicMission Engineering Solution Infographic
Mission Engineering Solution InfographicBooz Allen Hamilton
 
ontology based- data_integration.ali_aljadaa.1125048
ontology based- data_integration.ali_aljadaa.1125048ontology based- data_integration.ali_aljadaa.1125048
ontology based- data_integration.ali_aljadaa.1125048AliAlJadaa
 
Balancing the tension between Lean and Agile
Balancing the tension between Lean and AgileBalancing the tension between Lean and Agile
Balancing the tension between Lean and AgileJames Coplien
 
Pre-Con Education: Effective Change/Configuration Management With CA Service...
Pre-Con Education: Effective Change/Configuration Management With CA Service...Pre-Con Education: Effective Change/Configuration Management With CA Service...
Pre-Con Education: Effective Change/Configuration Management With CA Service...CA Technologies
 
Tribute to Muhammad Ali 1942 2016
Tribute to Muhammad Ali 1942 2016Tribute to Muhammad Ali 1942 2016
Tribute to Muhammad Ali 1942 2016Arbunize
 
The Rise and Fall of Ellen Pao. Perpetrator or Victim?
The Rise and Fall of Ellen Pao. Perpetrator or Victim?The Rise and Fall of Ellen Pao. Perpetrator or Victim?
The Rise and Fall of Ellen Pao. Perpetrator or Victim?Sage HR
 
Ten Things You Should not Forget in Mainframe Security
Ten Things You Should not Forget in Mainframe Security Ten Things You Should not Forget in Mainframe Security
Ten Things You Should not Forget in Mainframe Security CA Technologies
 
India Vs Australia - A Social Media Analysis
India Vs Australia - A Social Media AnalysisIndia Vs Australia - A Social Media Analysis
India Vs Australia - A Social Media AnalysisGermin8
 
Retail Revolution: Thrive in Disruption
Retail Revolution: Thrive in DisruptionRetail Revolution: Thrive in Disruption
Retail Revolution: Thrive in DisruptionBooz Allen Hamilton
 
Paper Jam: Why Documents are Dragging Us Down
Paper Jam: Why Documents are Dragging Us DownPaper Jam: Why Documents are Dragging Us Down
Paper Jam: Why Documents are Dragging Us DownAdobe
 
The Marketing Automation Revolution
The Marketing Automation RevolutionThe Marketing Automation Revolution
The Marketing Automation RevolutionUberflip
 

Destaque (14)

The Government's Effective Migration to a Cloud Computing Environment
The Government's Effective Migration to a Cloud Computing EnvironmentThe Government's Effective Migration to a Cloud Computing Environment
The Government's Effective Migration to a Cloud Computing Environment
 
Mission Engineering Solution Infographic
Mission Engineering Solution InfographicMission Engineering Solution Infographic
Mission Engineering Solution Infographic
 
ontology based- data_integration.ali_aljadaa.1125048
ontology based- data_integration.ali_aljadaa.1125048ontology based- data_integration.ali_aljadaa.1125048
ontology based- data_integration.ali_aljadaa.1125048
 
Balancing the tension between Lean and Agile
Balancing the tension between Lean and AgileBalancing the tension between Lean and Agile
Balancing the tension between Lean and Agile
 
Pre-Con Education: Effective Change/Configuration Management With CA Service...
Pre-Con Education: Effective Change/Configuration Management With CA Service...Pre-Con Education: Effective Change/Configuration Management With CA Service...
Pre-Con Education: Effective Change/Configuration Management With CA Service...
 
Tribute to Muhammad Ali 1942 2016
Tribute to Muhammad Ali 1942 2016Tribute to Muhammad Ali 1942 2016
Tribute to Muhammad Ali 1942 2016
 
The Rise and Fall of Ellen Pao. Perpetrator or Victim?
The Rise and Fall of Ellen Pao. Perpetrator or Victim?The Rise and Fall of Ellen Pao. Perpetrator or Victim?
The Rise and Fall of Ellen Pao. Perpetrator or Victim?
 
Ten Things You Should not Forget in Mainframe Security
Ten Things You Should not Forget in Mainframe Security Ten Things You Should not Forget in Mainframe Security
Ten Things You Should not Forget in Mainframe Security
 
India Vs Australia - A Social Media Analysis
India Vs Australia - A Social Media AnalysisIndia Vs Australia - A Social Media Analysis
India Vs Australia - A Social Media Analysis
 
Retail Revolution: Thrive in Disruption
Retail Revolution: Thrive in DisruptionRetail Revolution: Thrive in Disruption
Retail Revolution: Thrive in Disruption
 
The Retail Reality Check
The Retail Reality CheckThe Retail Reality Check
The Retail Reality Check
 
Paper Jam: Why Documents are Dragging Us Down
Paper Jam: Why Documents are Dragging Us DownPaper Jam: Why Documents are Dragging Us Down
Paper Jam: Why Documents are Dragging Us Down
 
The Marketing Automation Revolution
The Marketing Automation RevolutionThe Marketing Automation Revolution
The Marketing Automation Revolution
 
The Signs of Life
The Signs of LifeThe Signs of Life
The Signs of Life
 

Mais de Booz Allen Hamilton

You Can Hack That: How to Use Hackathons to Solve Your Toughest Challenges
You Can Hack That: How to Use Hackathons to Solve Your Toughest ChallengesYou Can Hack That: How to Use Hackathons to Solve Your Toughest Challenges
You Can Hack That: How to Use Hackathons to Solve Your Toughest ChallengesBooz Allen Hamilton
 
Examining Flexibility in the Workplace for Working Moms
Examining Flexibility in the Workplace for Working MomsExamining Flexibility in the Workplace for Working Moms
Examining Flexibility in the Workplace for Working MomsBooz Allen Hamilton
 
Booz Allen's 10 Cyber Priorities for Boards of Directors
Booz Allen's 10 Cyber Priorities for Boards of DirectorsBooz Allen's 10 Cyber Priorities for Boards of Directors
Booz Allen's 10 Cyber Priorities for Boards of DirectorsBooz Allen Hamilton
 
Homeland Threats: Today and Tomorrow
Homeland Threats: Today and TomorrowHomeland Threats: Today and Tomorrow
Homeland Threats: Today and TomorrowBooz Allen Hamilton
 
Preparing for New Healthcare Payment Models
Preparing for New Healthcare Payment ModelsPreparing for New Healthcare Payment Models
Preparing for New Healthcare Payment ModelsBooz Allen Hamilton
 
The Product Owner’s Universe: Agile Coaching
The Product Owner’s Universe: Agile CoachingThe Product Owner’s Universe: Agile Coaching
The Product Owner’s Universe: Agile CoachingBooz Allen Hamilton
 
Immersive Learning: The Future of Training is Here
Immersive Learning: The Future of Training is HereImmersive Learning: The Future of Training is Here
Immersive Learning: The Future of Training is HereBooz Allen Hamilton
 
Nuclear Promise: Reducing Cost While Improving Performance
Nuclear Promise: Reducing Cost While Improving PerformanceNuclear Promise: Reducing Cost While Improving Performance
Nuclear Promise: Reducing Cost While Improving PerformanceBooz Allen Hamilton
 
Frenemies – When Unlikely Partners Join Forces
Frenemies – When Unlikely Partners Join ForcesFrenemies – When Unlikely Partners Join Forces
Frenemies – When Unlikely Partners Join ForcesBooz Allen Hamilton
 
Booz Allen Secure Agile Development
Booz Allen Secure Agile DevelopmentBooz Allen Secure Agile Development
Booz Allen Secure Agile DevelopmentBooz Allen Hamilton
 
Booz Allen Industrial Cybersecurity Threat Briefing
Booz Allen Industrial Cybersecurity Threat BriefingBooz Allen Industrial Cybersecurity Threat Briefing
Booz Allen Industrial Cybersecurity Threat BriefingBooz Allen Hamilton
 
Booz Allen Hamilton and Market Connections: C4ISR Survey Report
Booz Allen Hamilton and Market Connections: C4ISR Survey ReportBooz Allen Hamilton and Market Connections: C4ISR Survey Report
Booz Allen Hamilton and Market Connections: C4ISR Survey ReportBooz Allen Hamilton
 
Modern C4ISR Integrates, Innovates and Secures Military Networks
Modern C4ISR Integrates, Innovates and Secures Military NetworksModern C4ISR Integrates, Innovates and Secures Military Networks
Modern C4ISR Integrates, Innovates and Secures Military NetworksBooz Allen Hamilton
 
Agile and Open C4ISR Systems - Helping the Military Integrate, Innovate and S...
Agile and Open C4ISR Systems - Helping the Military Integrate, Innovate and S...Agile and Open C4ISR Systems - Helping the Military Integrate, Innovate and S...
Agile and Open C4ISR Systems - Helping the Military Integrate, Innovate and S...Booz Allen Hamilton
 
Booz Allen Field Guide to Data Science
Booz Allen Field Guide to Data Science Booz Allen Field Guide to Data Science
Booz Allen Field Guide to Data Science Booz Allen Hamilton
 

Mais de Booz Allen Hamilton (20)

You Can Hack That: How to Use Hackathons to Solve Your Toughest Challenges
You Can Hack That: How to Use Hackathons to Solve Your Toughest ChallengesYou Can Hack That: How to Use Hackathons to Solve Your Toughest Challenges
You Can Hack That: How to Use Hackathons to Solve Your Toughest Challenges
 
Examining Flexibility in the Workplace for Working Moms
Examining Flexibility in the Workplace for Working MomsExamining Flexibility in the Workplace for Working Moms
Examining Flexibility in the Workplace for Working Moms
 
The True Cost of Childcare
The True Cost of ChildcareThe True Cost of Childcare
The True Cost of Childcare
 
Booz Allen's 10 Cyber Priorities for Boards of Directors
Booz Allen's 10 Cyber Priorities for Boards of DirectorsBooz Allen's 10 Cyber Priorities for Boards of Directors
Booz Allen's 10 Cyber Priorities for Boards of Directors
 
Inaugural Addresses
Inaugural AddressesInaugural Addresses
Inaugural Addresses
 
Military Spouse Career Roadmap
Military Spouse Career Roadmap Military Spouse Career Roadmap
Military Spouse Career Roadmap
 
Homeland Threats: Today and Tomorrow
Homeland Threats: Today and TomorrowHomeland Threats: Today and Tomorrow
Homeland Threats: Today and Tomorrow
 
Preparing for New Healthcare Payment Models
Preparing for New Healthcare Payment ModelsPreparing for New Healthcare Payment Models
Preparing for New Healthcare Payment Models
 
The Product Owner’s Universe: Agile Coaching
The Product Owner’s Universe: Agile CoachingThe Product Owner’s Universe: Agile Coaching
The Product Owner’s Universe: Agile Coaching
 
Immersive Learning: The Future of Training is Here
Immersive Learning: The Future of Training is HereImmersive Learning: The Future of Training is Here
Immersive Learning: The Future of Training is Here
 
Nuclear Promise: Reducing Cost While Improving Performance
Nuclear Promise: Reducing Cost While Improving PerformanceNuclear Promise: Reducing Cost While Improving Performance
Nuclear Promise: Reducing Cost While Improving Performance
 
Frenemies – When Unlikely Partners Join Forces
Frenemies – When Unlikely Partners Join ForcesFrenemies – When Unlikely Partners Join Forces
Frenemies – When Unlikely Partners Join Forces
 
Booz Allen Secure Agile Development
Booz Allen Secure Agile DevelopmentBooz Allen Secure Agile Development
Booz Allen Secure Agile Development
 
Booz Allen Industrial Cybersecurity Threat Briefing
Booz Allen Industrial Cybersecurity Threat BriefingBooz Allen Industrial Cybersecurity Threat Briefing
Booz Allen Industrial Cybersecurity Threat Briefing
 
Booz Allen Hamilton and Market Connections: C4ISR Survey Report
Booz Allen Hamilton and Market Connections: C4ISR Survey ReportBooz Allen Hamilton and Market Connections: C4ISR Survey Report
Booz Allen Hamilton and Market Connections: C4ISR Survey Report
 
CITRIX IN AMAZON WEB SERVICES
CITRIX IN AMAZON WEB SERVICESCITRIX IN AMAZON WEB SERVICES
CITRIX IN AMAZON WEB SERVICES
 
Modern C4ISR Integrates, Innovates and Secures Military Networks
Modern C4ISR Integrates, Innovates and Secures Military NetworksModern C4ISR Integrates, Innovates and Secures Military Networks
Modern C4ISR Integrates, Innovates and Secures Military Networks
 
Agile and Open C4ISR Systems - Helping the Military Integrate, Innovate and S...
Agile and Open C4ISR Systems - Helping the Military Integrate, Innovate and S...Agile and Open C4ISR Systems - Helping the Military Integrate, Innovate and S...
Agile and Open C4ISR Systems - Helping the Military Integrate, Innovate and S...
 
Women On The Leading Edge
Women On The Leading Edge Women On The Leading Edge
Women On The Leading Edge
 
Booz Allen Field Guide to Data Science
Booz Allen Field Guide to Data Science Booz Allen Field Guide to Data Science
Booz Allen Field Guide to Data Science
 

Último

Cash Is Still King: ATM market research '2023
Cash Is Still King: ATM market research '2023Cash Is Still King: ATM market research '2023
Cash Is Still King: ATM market research '2023Vladislav Solodkiy
 
Master's Thesis - Data Science - Presentation
Master's Thesis - Data Science - PresentationMaster's Thesis - Data Science - Presentation
Master's Thesis - Data Science - PresentationGiorgio Carbone
 
5 Ds to Define Data Archiving Best Practices
5 Ds to Define Data Archiving Best Practices5 Ds to Define Data Archiving Best Practices
5 Ds to Define Data Archiving Best PracticesDataArchiva
 
CCS336-Cloud-Services-Management-Lecture-Notes-1.pptx
CCS336-Cloud-Services-Management-Lecture-Notes-1.pptxCCS336-Cloud-Services-Management-Lecture-Notes-1.pptx
CCS336-Cloud-Services-Management-Lecture-Notes-1.pptxdhiyaneswaranv1
 
CI, CD -Tools to integrate without manual intervention
CI, CD -Tools to integrate without manual interventionCI, CD -Tools to integrate without manual intervention
CI, CD -Tools to integrate without manual interventionajayrajaganeshkayala
 
Virtuosoft SmartSync Product Introduction
Virtuosoft SmartSync Product IntroductionVirtuosoft SmartSync Product Introduction
Virtuosoft SmartSync Product Introductionsanjaymuralee1
 
Persuasive E-commerce, Our Biased Brain @ Bikkeldag 2024
Persuasive E-commerce, Our Biased Brain @ Bikkeldag 2024Persuasive E-commerce, Our Biased Brain @ Bikkeldag 2024
Persuasive E-commerce, Our Biased Brain @ Bikkeldag 2024Guido X Jansen
 
Mapping the pubmed data under different suptopics using NLP.pptx
Mapping the pubmed data under different suptopics using NLP.pptxMapping the pubmed data under different suptopics using NLP.pptx
Mapping the pubmed data under different suptopics using NLP.pptxVenkatasubramani13
 
The Universal GTM - how we design GTM and dataLayer
The Universal GTM - how we design GTM and dataLayerThe Universal GTM - how we design GTM and dataLayer
The Universal GTM - how we design GTM and dataLayerPavel Šabatka
 
Strategic CX: A Deep Dive into Voice of the Customer Insights for Clarity
Strategic CX: A Deep Dive into Voice of the Customer Insights for ClarityStrategic CX: A Deep Dive into Voice of the Customer Insights for Clarity
Strategic CX: A Deep Dive into Voice of the Customer Insights for ClarityAggregage
 
TINJUAN PEMROSESAN TRANSAKSI DAN ERP.pptx
TINJUAN PEMROSESAN TRANSAKSI DAN ERP.pptxTINJUAN PEMROSESAN TRANSAKSI DAN ERP.pptx
TINJUAN PEMROSESAN TRANSAKSI DAN ERP.pptxDwiAyuSitiHartinah
 
Rock Songs common codes and conventions.pptx
Rock Songs common codes and conventions.pptxRock Songs common codes and conventions.pptx
Rock Songs common codes and conventions.pptxFinatron037
 
Optimal Decision Making - Cost Reduction in Logistics
Optimal Decision Making - Cost Reduction in LogisticsOptimal Decision Making - Cost Reduction in Logistics
Optimal Decision Making - Cost Reduction in LogisticsThinkInnovation
 
How is Real-Time Analytics Different from Traditional OLAP?
How is Real-Time Analytics Different from Traditional OLAP?How is Real-Time Analytics Different from Traditional OLAP?
How is Real-Time Analytics Different from Traditional OLAP?sonikadigital1
 
Elements of language learning - an analysis of how different elements of lang...
Elements of language learning - an analysis of how different elements of lang...Elements of language learning - an analysis of how different elements of lang...
Elements of language learning - an analysis of how different elements of lang...PrithaVashisht1
 
ChistaDATA Real-Time DATA Analytics Infrastructure
ChistaDATA Real-Time DATA Analytics InfrastructureChistaDATA Real-Time DATA Analytics Infrastructure
ChistaDATA Real-Time DATA Analytics Infrastructuresonikadigital1
 

Último (16)

Cash Is Still King: ATM market research '2023
Cash Is Still King: ATM market research '2023Cash Is Still King: ATM market research '2023
Cash Is Still King: ATM market research '2023
 
Master's Thesis - Data Science - Presentation
Master's Thesis - Data Science - PresentationMaster's Thesis - Data Science - Presentation
Master's Thesis - Data Science - Presentation
 
5 Ds to Define Data Archiving Best Practices
5 Ds to Define Data Archiving Best Practices5 Ds to Define Data Archiving Best Practices
5 Ds to Define Data Archiving Best Practices
 
CCS336-Cloud-Services-Management-Lecture-Notes-1.pptx
CCS336-Cloud-Services-Management-Lecture-Notes-1.pptxCCS336-Cloud-Services-Management-Lecture-Notes-1.pptx
CCS336-Cloud-Services-Management-Lecture-Notes-1.pptx
 
CI, CD -Tools to integrate without manual intervention
CI, CD -Tools to integrate without manual interventionCI, CD -Tools to integrate without manual intervention
CI, CD -Tools to integrate without manual intervention
 
Virtuosoft SmartSync Product Introduction
Virtuosoft SmartSync Product IntroductionVirtuosoft SmartSync Product Introduction
Virtuosoft SmartSync Product Introduction
 
Persuasive E-commerce, Our Biased Brain @ Bikkeldag 2024
Persuasive E-commerce, Our Biased Brain @ Bikkeldag 2024Persuasive E-commerce, Our Biased Brain @ Bikkeldag 2024
Persuasive E-commerce, Our Biased Brain @ Bikkeldag 2024
 
Mapping the pubmed data under different suptopics using NLP.pptx
Mapping the pubmed data under different suptopics using NLP.pptxMapping the pubmed data under different suptopics using NLP.pptx
Mapping the pubmed data under different suptopics using NLP.pptx
 
The Universal GTM - how we design GTM and dataLayer
The Universal GTM - how we design GTM and dataLayerThe Universal GTM - how we design GTM and dataLayer
The Universal GTM - how we design GTM and dataLayer
 
Strategic CX: A Deep Dive into Voice of the Customer Insights for Clarity
Strategic CX: A Deep Dive into Voice of the Customer Insights for ClarityStrategic CX: A Deep Dive into Voice of the Customer Insights for Clarity
Strategic CX: A Deep Dive into Voice of the Customer Insights for Clarity
 
TINJUAN PEMROSESAN TRANSAKSI DAN ERP.pptx
TINJUAN PEMROSESAN TRANSAKSI DAN ERP.pptxTINJUAN PEMROSESAN TRANSAKSI DAN ERP.pptx
TINJUAN PEMROSESAN TRANSAKSI DAN ERP.pptx
 
Rock Songs common codes and conventions.pptx
Rock Songs common codes and conventions.pptxRock Songs common codes and conventions.pptx
Rock Songs common codes and conventions.pptx
 
Optimal Decision Making - Cost Reduction in Logistics
Optimal Decision Making - Cost Reduction in LogisticsOptimal Decision Making - Cost Reduction in Logistics
Optimal Decision Making - Cost Reduction in Logistics
 
How is Real-Time Analytics Different from Traditional OLAP?
How is Real-Time Analytics Different from Traditional OLAP?How is Real-Time Analytics Different from Traditional OLAP?
How is Real-Time Analytics Different from Traditional OLAP?
 
Elements of language learning - an analysis of how different elements of lang...
Elements of language learning - an analysis of how different elements of lang...Elements of language learning - an analysis of how different elements of lang...
Elements of language learning - an analysis of how different elements of lang...
 
ChistaDATA Real-Time DATA Analytics Infrastructure
ChistaDATA Real-Time DATA Analytics InfrastructureChistaDATA Real-Time DATA Analytics Infrastructure
ChistaDATA Real-Time DATA Analytics Infrastructure
 

The Data Lake: Taking Big Data Beyond the Cloud]

  • 1. 1 The bigger that big data gets, the more it seems to elude our grasp. While it holds great potential for creating new opportunities in every field, big data is growing so fast that it is now outpacing the ability of our current tools to take full advantage of it. Much of the problem lies in the need to extensively prepare the data before it can be analyzed. Data must be converted into recognizable formats—a laborious, time-consuming process that becomes increasingly impractical as data collections grow larger. Although organizations are amassing impressive amounts of data, they simply do not have the time or resources to prepare it all in the traditional manner. This is particularly an issue with “unstructured” data that does not easily lend itself to formatting, such as photo- graphs, doctors’ examination notes, police accident reports, and posts on social media sites. Unstructured data accounts for much of the explosion in big data today, and is widely seen as holding the most promise for creating new areas of business growth and government efficiency. But because unstructured data is so difficult to prepare, its enormous value remains largely untapped. With such constraints, organizations are now reaching the limits of what they can do with big data. They are going as far as the current tools will take them, but no further. And as big data grows larger, organizations will only be increasingly inundated with information that they have only a narrow ability to use. It is like the line, “Water, water, everywhere…” What is needed is an entirely new approach to this overwhelming flood of data, one that can manage it and make it useful, no matter how big it grows. That is the concept behind Booz Allen Hamilton’s “data lake,” a groundbreaking invention that scales to an organization’s growing data, and makes it easily accessible. With the data lake, an organization’s repository of information—including structured and unstructured data—is consolidated in a single, large table. Every inquiry can make use of the entire body of information stored in the data lake— and it is all available at once. The data lake completely eliminates the current cumbersome data-preparation process. All types of data, including unstructureddata,aresmoothlyandrapidly “ingested” into the data lake. There is no longer any need for the rigid, regimented data structures—essentially data silos— that currently house most data. Such silos are difficult to connect, which has long hampered the ability of organizations to integrate and analyze their data. The data lake solves this problem by eliminating the silos altogether. With the data lake, it now becomes practical—in terms of time, cost, and analytic ability—to turn big data into The Data Lake: Taking Big Data Beyondthe Cloud by Mark Herman Executive Vice President Booz Allen Hamilton Michael Delurey Principal Booz Allen Hamilton ©2013 Booz Allen Hamilton Inc. All rights reserved. No part of this document may be reproduced without prior written permission of Booz Allen Hamilton.
  • 2. 2 MARCH 2013 opportunity. We can now ask more far-reaching and complex questions, and find the often-hidden patterns and relationships that can lead to game-changing knowledge and insight. More than the Cloud With the advent of cloud computing, business and governmentorganizationsarenowstoringandanalyzing far larger amounts of data than ever before. But simply bringing a great deal of data together in the cloud is not the same as creating a data lake. Organizations may have embraced the cloud, but if they continue to use conventional tools, they still must laboriously prepare the data and place it in its designated location (i.e., the silo). Despite its promise to revolutionize data analysis, the cloud does not truly integrate data—it simply makes the data silos taller and fatter. While the data lake relies on cloud computing, it represents a new and different mindset. Big data requires organizations to stop thinking in terms of data mining and data warehouses—the equivalent of industrial-era processes—and to begin considering how data can be more fluid and expansive, like in a data lake. Since with the conventional approach it is difficult to integrate data—even in the cloud—we tend to use the cloud mostly for storage, and remove portions of it for analysis. But no matter how powerful our analytics are, because we are applying them only to discrete datasets at any time, we never see the full picture. With the data lake, however, all of our data remains in the cloud, consoli- dated and connected. We can now apply our analytics to the whole of the data, and get far deeper insights. Organizations may be concerned that by consoli- dating their data, they might be making it more vulnerable. Just the opposite is true. The data lake incor- porates a granular level of data security and privacy not available in conventional cloud computing.1 The data lake was initially created to achieve a high-stakes goal. The US government needed a way to integrate many sources and types of intelligence data, in a secure manner, to search for terrorists and other threats. Booz Allen assisted the government in developing the data lake to achieve that goal, as part of a larger computing framework known as the Cloud Analytics Reference Architecture. The data lake and Cloud Analytics Reference Archi- tecture are now being adapted to the larger business 1 See Booz Allen Viewpoint “Enabling Cloud Analytics with Data-Level Security: Tapping the Full Value of Big Data and the Cloud,” http://www.boozallen.com/media/file/En- abling_Cloud_Analytics_with_Data-Level_Security.pdf and government communities, bringing with them a range of features that have been successfully tested in the most demanding situations. Building the Data Lake One of the biggest limitations of the conventional approach to data analysis is that analysts often need to spend the bulk of their time just readying the data for use. With each new line of inquiry, a specific data structure and analytic is custom-built. All information entered into the data structure must first be converted into a recognizable format, often a slow, painstaking task. For example, an analyst might be faced with merging several different data sources that each use different fields. The analyst must decide which fields to use and whether entirely new ones need to be created. The more complex the query, the more data sources that typically must be homogenized. At some organizations, analysts may spend as much as 80 percent of their time preparing the data, leaving just 20 percent for conducting actual analysis. Formatting also carries the risk of data- entry errors. With the data lake, there are no individual data structures—and so there is no need for formal data formatting. Data from a wide range of sources is smoothly and easily ingested into the data lake. One metaphor for the data lake might be a giant collection grid, like a spreadsheet—one with billions of rows and billions of columns available to hold data. Each cell of the grid contains a piece of data—a document, perhaps, or maybe a paragraph or even a single word from the document. Cells might contain names, photographs, incident reports, or Twitter feeds—anything and everything. It does not matter where in the grid each bit of information is located. It also makes no difference where the data comes from, whether it is formatted, or how it might relate to any other piece of information in the data lake. The data simply takes its place in the cell, and after minimal preparation is ready for use. The image of the grid helps describe the difference between data mining and the data lake. If we want to mine precious metals, we have to find where they are, then dig deep to retrieve them. But imagine if, when the Earth was formed, nuggets of precious metals were laid out in a big grid on top of the ground. We could just walk along, picking up what we wanted. The data lake makes information just as readily available. The process of placing the data in open cells as it comes in gives the ingest process remarkable speed. Large amounts of data that might take 3 weeks to prepare using conventional cloud computing can be placed into the data lake in as little as 3 hours. This
  • 3. 3MARCH 2013 enables organizations to achieve substantial savings in IT resources and manpower. Just as important, it frees analysts for the more important task of finding connec- tions and value in the data. Many organizations today are trying to “do more with less.” That is difficult with the conventional approach, but becomes possible, for the first time, with the data lake. Opening Up the Data The ingest process of the data lake also removes another disadvantage of the conventional approach— the need to pre-define our questions. With conventional computing techniques, we have to know in advance what kinds of answers we are looking for and where in the existing data the computer needs to look to answer the inquiry. Analysts do not really ask questions of the data—they form hypotheses well in advance of the actual analysis, and then create data structures and analytics that will enable them to test those hypotheses. The only results that come back are the ones that the custom-made databases and analytics happen to provide. What makes this exercise even more constraining is that the data supporting an analysis typically contains only a portion of the potentially available information. Because the process of formatting and structuring the data is so time-intensive, analysts have no choice but to cull the data by some method. One of the most prevalent techniques is to discount (and even ignore) unstructured data. This simplifies the data ingest, but it severely reduces the value of the data for analysis. Hampered by these severe limitations, analysts can pose only narrow questions of the data. And there is a risk that the data structures will become closed- loop systems—echo chambers that merely validate the original hypotheses. When we ask the system what is important, it points to the data that we happened to put in. The fact that a particular piece of data is included in a database tends to make it de facto significant—it is important only because the hypothesis sees it that way. With the data lake, data is ingested with a wide-open view as to the queries that may come later. Because there are no structures, we can get all of the data in— all 100 variables, or 500, or any other number, so that the data in its totality becomes available. Organizations may have a great deal of data stored in the cloud, but without the data lake they cannot easily connect it all, and discover the often-hidden relationships in the world around us. It is in those relationships that knowledge and insight—and opportunity—reside. Tagging the Data The data lake also radically differs from conven- tional cloud computing in the way the data itself is managed. When a piece of data is ingested, certain details, called metadata (or “data about the data”), are added so that the basic information can be quickly located and identified. For example, an investor’s portfolio balance (the data) might be stored with the name of the investor, the account number, the location of the account, the types of investments, the country the investor lives in, and so on. These metadata “tags” serve the same purpose as old-style card catalogues, which allow readers to find a book by searching the author, title, or subject. As with the card catalogues, tags enable us to find particular information from a number of different starting points—but with today’s tagging abilities, we can characterize data in nearly limitless ways. The more tags, the more complex and rich the analytics can become. With the tags, we can look not only for connec- tions and patterns in the data, but in the tags as well. To consider how this technology might be applied, imagine if a pharmaceutical company were able to fully integrate a wide range of public data to identify drug compounds with few adverse reactions, and a high likelihood of clinical and commercial success. Those sources might include social media and market data—to help determine the need—and clinical test data, chemical structure, disease analysis, even infor- mation about patents—to find where gaps might exist. In a sense, the pharmaceutical company is looking for a needle in a haystack, a prohibitively expensive and time- consuming task with conventional cloud computing. However, if the structured and unstructured data is appropriately tagged and placed in the data lake, it becomes cost-effective to find the essential connections in all that data, and make the needle stand out brightly. The data lake allows us to ask questions and search for patterns using either the data itself, the tags themselves, or a combination of both. We can begin our search with any piece of data or tag—for example, a market analysis or the existing patents on a type of drug—and pivot off of it in any direction to look for connections. While the process of tagging information is not new, the data lake uses it in a unique way—as the primary method of locating and managing the data. With the tags, the rigid data structures that so limit the conven- tional approach are no longer needed.
  • 4. 4 MARCH 2013 Along with the streamlined ingest process, tags help give the data lake its speed. When organizations need to update or search the data in new ways, they do not have to tear down and rebuild data structures, as in the conventional method. They can simply update the tags already in place. Tagging all of the data, and at a much more granular level than is possible in the conventional cloud approach, greatly expands the value that big data can provide. Information in the data lake is not random and chaotic, but rather is purposeful. The tags help make the data lake like a viscous medium that holds the data in place, and at the same time fosters connections. The tags also provide a strong new layer of security. We can tag each piece of data, down to the image or paragraph in a document, with the relevant restrictions, authorities, and security and privacy levels. Organizations can establish rules regarding which information can be shared, with whom, and under what circumstances. A New Way of Storing Data With the conventional approach, data storage is expensive—even in the cloud. The reason is that so much space is wasted. Imagine a spreadsheet combining two data sources, an original one with 100 fields and the other with 50. The process of combining means that we will be adding 50 new “columns” into the original spreadsheet. Rows from the original will hold no data for the new columns, and rows from the new source will hold no data from the original. The result will be a great deal of empty cells. This is wasted storage space, and creates the opportunity for a great many errors. In the data lake, however, every cell is filled—no space is wasted. This makes it possible to store vast amounts of data in far less space than would be required for even relatively small conventional cloud databases. As a result, the data lake can cost-effectively scale to an organiza- tion’s growing data, including multiple outside sources. The data lake’s almost limitless capacity enables organizations to store data in a variety of different forms, to aid in later analysis. A financial institution, for example, could store records of certain transac- tions converted into all of the world’s major currencies. Or, a company could translate every document on a particular subject into Chinese, and store it until it might be needed. One of the more transformative aspects of the data lake is that it stores every type of data equally— not just structured and unstructured, but also batch and streaming. Batch data is typically collected on an automated basis and then delivered for analysis en masse—for example, the utility meter readings from homes. Streaming data is information from a continuous feed, such as video surveillance. Formatting unstructured, batch, and streaming data inevitably strips it of much of its richness. And even if a portion of the information can be put into a conven- tional cloud database, we are still constrained by limited, pre-defined questions. The data lake holds no such constraints. When unstructured, batch, and streaming data are ingested, analytics can take advantage of the tagging approach to begin to look for patterns that naturally emerge. All types of data, and the value they hold, now become fully accessible. The US military is taking advantage of this capability to help track insurgents and others who are planting improvised explosive devices (IEDs) and other bombs. Many of the military’s data sources include unstruc- tured data, and using the conventional approach—with its extensive preparation—had proved unwieldy and time-consuming. With the data lake, the military is now able to quickly integrate and analyze its vast array of disparate data sources—including its unstructured data—giving military commanders unprecedented situational awareness. This is another example of why simply amassing large amounts of data does not create a data lake. The military was collecting an enormous quantity of data, but without the data lake could not make full use of it to try to stop IEDs. Commanders have reported that the current approach—which has the data lake as its centerpiece—is saving more lives, and at a lower operating cost than the traditional methods. Accessing the Data for Analytics One of the chief drawbacks of the conventional approach, which the cloud does not ameliorate, is that it essentially samples the data. When we have questions (or want to test hypotheses), we select a sample of the available data and apply analytics to it. The problem is that we are never quite sure we are pulling the right sample—that is, whether it is really representative of the whole. The data lake eliminates sampling. We no longer have to guess about which data to use, because we are using it all. With the data lake, our information is available for analysis on-demand, when the need arises. The conventional approach not only requires extensive data preparation, but it is difficult to change databases as queries change. Say the pharmaceutical company wants to add new data sources to identify promising
  • 5. 5MARCH 2013 drug compounds, or perhaps wants to change the type of financial analyses it uses. With the conventional approach, analysts would have to tear down the initial data and analytics structures, and re-engineer new ones. With the data lake, analysts would simply add the new data, and ask the new questions. Because it is not easy to change conventional data structures, the information they contain can become outdated and even obsolete fairly quickly. By contrast, we are able to add new information to the data lake the moment we need it. This ease in accessibility sets the stage for the advanced, high-powered analytics that can point the way to top-line business growth, and help government achieve its goals in innovative ways. Analytics that search for connections and look for patterns have long been hamstrung by being confined to limited, rigid datasets and databases. The data lake frees them to search for knowledge and insight across all of the data. In essence, it allows the analytics, for the first time, to reach their true potential. Because there is no need to continually engineer and re-engineer data structures, the data lake also becomes accessible to non-technical subject matter experts. They no longer need to rely on computer scientists and others to explore the data—they can ask the questions themselves. Subject matter experts best understand how big data can provide value to their businesses and agencies. The data lake helps put the answers directly in their hands. A New Mindset Virtually every aspect of the data lake creates cost savings and efficiencies, from freeing up analysts to its ability to easily and inexpensively scale to an organi- zation’s growing data. Because the data lake enables organizations to gather and analyze ever-greater amounts of data, it also gives them new opportunities for top-line revenue growth. The data lake enables both business and government to reach that tipping point at which data helps us to do things not just cheaper and better, but in ways we have not yet imagined. Organizations may believe that because they are now in the cloud and can put all their data in one place, they already have a version of the data lake. But greater amounts of data—no matter how large—will not necessarily yield more knowledge and insight. The trick is to connect the data and make it useful—essentially, to create the kinds of conditions that can turn big data into opportunity. The data lake and the larger Cloud Analytics Reference Architecture represent a revolu- tionary approach—and a new mindset—that make those conditions possible. Opportunity is out there, if we have the tools to look for it. FOR MORE INFORMATION Mark Herman herman_mark@bah.com 703-902-5986 Michael Delurey delurey_mike@bah.com 703-902-6858 www.boozallen.com/cloud This document is part of a collection of papers developed by Booz Allen Hamilton to introduce new concepts and ideas spanning cloud solutions, challenges, and opportunities across government and business. For media inquiries or more information on reproducing this document, please contact: James Fisher—Senior Manager, Media Relations, 703-377-7595, fisher_james_w@bah.com Carrie Lake—Manager, Media Relations, 703-377-7785, lake_carrie@bah.com 12.032.12P