With the data lake, an organizations repository of information - including structured and unstructured data - is consolidated in a single, large table. Every inquiry can make use of the entire body of information stored in the data lake - and it is all available at once.
2. 2 MARCH 2013
opportunity. We can now ask more far-reaching and
complex questions, and find the often-hidden patterns
and relationships that can lead to game-changing
knowledge and insight.
More than the Cloud
With the advent of cloud computing, business and
governmentorganizationsarenowstoringandanalyzing
far larger amounts of data than ever before. But simply
bringing a great deal of data together in the cloud is
not the same as creating a data lake. Organizations may
have embraced the cloud, but if they continue to use
conventional tools, they still must laboriously prepare
the data and place it in its designated location (i.e., the
silo). Despite its promise to revolutionize data analysis,
the cloud does not truly integrate data—it simply makes
the data silos taller and fatter.
While the data lake relies on cloud computing,
it represents a new and different mindset. Big data
requires organizations to stop thinking in terms of
data mining and data warehouses—the equivalent of
industrial-era processes—and to begin considering how
data can be more fluid and expansive, like in a data lake.
Since with the conventional approach it is difficult to
integrate data—even in the cloud—we tend to use the
cloud mostly for storage, and remove portions of it for
analysis. But no matter how powerful our analytics are,
because we are applying them only to discrete datasets at
any time, we never see the full picture. With the data lake,
however, all of our data remains in the cloud, consoli-
dated and connected. We can now apply our analytics to
the whole of the data, and get far deeper insights.
Organizations may be concerned that by consoli-
dating their data, they might be making it more
vulnerable. Just the opposite is true. The data lake incor-
porates a granular level of data security and privacy not
available in conventional cloud computing.1
The data lake was initially created to achieve a
high-stakes goal. The US government needed a way
to integrate many sources and types of intelligence
data, in a secure manner, to search for terrorists and
other threats. Booz Allen assisted the government in
developing the data lake to achieve that goal, as part
of a larger computing framework known as the Cloud
Analytics Reference Architecture.
The data lake and Cloud Analytics Reference Archi-
tecture are now being adapted to the larger business
1 See Booz Allen Viewpoint “Enabling Cloud Analytics with
Data-Level Security: Tapping the Full Value of Big Data and
the Cloud,” http://www.boozallen.com/media/file/En-
abling_Cloud_Analytics_with_Data-Level_Security.pdf
and government communities, bringing with them a
range of features that have been successfully tested in
the most demanding situations.
Building the Data Lake
One of the biggest limitations of the conventional
approach to data analysis is that analysts often need
to spend the bulk of their time just readying the data
for use. With each new line of inquiry, a specific data
structure and analytic is custom-built. All information
entered into the data structure must first be converted
into a recognizable format, often a slow, painstaking
task. For example, an analyst might be faced with
merging several different data sources that each use
different fields. The analyst must decide which fields to
use and whether entirely new ones need to be created.
The more complex the query, the more data sources that
typically must be homogenized. At some organizations,
analysts may spend as much as 80 percent of their time
preparing the data, leaving just 20 percent for conducting
actual analysis. Formatting also carries the risk of data-
entry errors. With the data lake, there are no individual
data structures—and so there is no need for formal
data formatting. Data from a wide range of sources is
smoothly and easily ingested into the data lake.
One metaphor for the data lake might be a giant
collection grid, like a spreadsheet—one with billions
of rows and billions of columns available to hold
data. Each cell of the grid contains a piece of data—a
document, perhaps, or maybe a paragraph or even a
single word from the document. Cells might contain
names, photographs, incident reports, or Twitter
feeds—anything and everything. It does not matter
where in the grid each bit of information is located. It
also makes no difference where the data comes from,
whether it is formatted, or how it might relate to any
other piece of information in the data lake. The data
simply takes its place in the cell, and after minimal
preparation is ready for use.
The image of the grid helps describe the difference
between data mining and the data lake. If we want to
mine precious metals, we have to find where they are,
then dig deep to retrieve them. But imagine if, when the
Earth was formed, nuggets of precious metals were laid
out in a big grid on top of the ground. We could just
walk along, picking up what we wanted. The data lake
makes information just as readily available.
The process of placing the data in open cells as it
comes in gives the ingest process remarkable speed.
Large amounts of data that might take 3 weeks to
prepare using conventional cloud computing can be
placed into the data lake in as little as 3 hours. This
3. 3MARCH 2013
enables organizations to achieve substantial savings in
IT resources and manpower. Just as important, it frees
analysts for the more important task of finding connec-
tions and value in the data. Many organizations today
are trying to “do more with less.” That is difficult with
the conventional approach, but becomes possible, for
the first time, with the data lake.
Opening Up the Data
The ingest process of the data lake also removes
another disadvantage of the conventional approach—
the need to pre-define our questions. With conventional
computing techniques, we have to know in advance
what kinds of answers we are looking for and where
in the existing data the computer needs to look to
answer the inquiry. Analysts do not really ask questions
of the data—they form hypotheses well in advance of
the actual analysis, and then create data structures and
analytics that will enable them to test those hypotheses.
The only results that come back are the ones that the
custom-made databases and analytics happen to provide.
What makes this exercise even more constraining is
that the data supporting an analysis typically contains
only a portion of the potentially available information.
Because the process of formatting and structuring the
data is so time-intensive, analysts have no choice but
to cull the data by some method. One of the most
prevalent techniques is to discount (and even ignore)
unstructured data. This simplifies the data ingest, but it
severely reduces the value of the data for analysis.
Hampered by these severe limitations, analysts can
pose only narrow questions of the data. And there
is a risk that the data structures will become closed-
loop systems—echo chambers that merely validate the
original hypotheses. When we ask the system what is
important, it points to the data that we happened to put
in. The fact that a particular piece of data is included
in a database tends to make it de facto significant—it is
important only because the hypothesis sees it that way.
With the data lake, data is ingested with a wide-open
view as to the queries that may come later. Because
there are no structures, we can get all of the data in—
all 100 variables, or 500, or any other number, so that
the data in its totality becomes available. Organizations
may have a great deal of data stored in the cloud, but
without the data lake they cannot easily connect it all,
and discover the often-hidden relationships in the world
around us. It is in those relationships that knowledge
and insight—and opportunity—reside.
Tagging the Data
The data lake also radically differs from conven-
tional cloud computing in the way the data itself is
managed. When a piece of data is ingested, certain
details, called metadata (or “data about the data”), are
added so that the basic information can be quickly
located and identified. For example, an investor’s
portfolio balance (the data) might be stored with the
name of the investor, the account number, the location
of the account, the types of investments, the country
the investor lives in, and so on. These metadata “tags”
serve the same purpose as old-style card catalogues,
which allow readers to find a book by searching the
author, title, or subject. As with the card catalogues, tags
enable us to find particular information from a number
of different starting points—but with today’s tagging
abilities, we can characterize data in nearly limitless
ways. The more tags, the more complex and rich the
analytics can become.
With the tags, we can look not only for connec-
tions and patterns in the data, but in the tags as well.
To consider how this technology might be applied,
imagine if a pharmaceutical company were able to
fully integrate a wide range of public data to identify
drug compounds with few adverse reactions, and a
high likelihood of clinical and commercial success.
Those sources might include social media and market
data—to help determine the need—and clinical test
data, chemical structure, disease analysis, even infor-
mation about patents—to find where gaps might exist.
In a sense, the pharmaceutical company is looking for a
needle in a haystack, a prohibitively expensive and time-
consuming task with conventional cloud computing.
However, if the structured and unstructured data
is appropriately tagged and placed in the data lake, it
becomes cost-effective to find the essential connections
in all that data, and make the needle stand out brightly.
The data lake allows us to ask questions and
search for patterns using either the data itself, the tags
themselves, or a combination of both. We can begin
our search with any piece of data or tag—for example,
a market analysis or the existing patents on a type of
drug—and pivot off of it in any direction to look for
connections.
While the process of tagging information is not new,
the data lake uses it in a unique way—as the primary
method of locating and managing the data. With the
tags, the rigid data structures that so limit the conven-
tional approach are no longer needed.
4. 4 MARCH 2013
Along with the streamlined ingest process, tags help
give the data lake its speed. When organizations need
to update or search the data in new ways, they do not
have to tear down and rebuild data structures, as in the
conventional method. They can simply update the tags
already in place.
Tagging all of the data, and at a much more
granular level than is possible in the conventional cloud
approach, greatly expands the value that big data can
provide. Information in the data lake is not random and
chaotic, but rather is purposeful. The tags help make
the data lake like a viscous medium that holds the data
in place, and at the same time fosters connections.
The tags also provide a strong new layer of security.
We can tag each piece of data, down to the image or
paragraph in a document, with the relevant restrictions,
authorities, and security and privacy levels. Organizations
can establish rules regarding which information can be
shared, with whom, and under what circumstances.
A New Way of Storing Data
With the conventional approach, data storage is
expensive—even in the cloud. The reason is that so
much space is wasted. Imagine a spreadsheet combining
two data sources, an original one with 100 fields and the
other with 50. The process of combining means that
we will be adding 50 new “columns” into the original
spreadsheet. Rows from the original will hold no data
for the new columns, and rows from the new source
will hold no data from the original. The result will be a
great deal of empty cells. This is wasted storage space,
and creates the opportunity for a great many errors.
In the data lake, however, every cell is filled—no space
is wasted. This makes it possible to store vast amounts of
data in far less space than would be required for even
relatively small conventional cloud databases. As a result,
the data lake can cost-effectively scale to an organiza-
tion’s growing data, including multiple outside sources.
The data lake’s almost limitless capacity enables
organizations to store data in a variety of different
forms, to aid in later analysis. A financial institution,
for example, could store records of certain transac-
tions converted into all of the world’s major currencies.
Or, a company could translate every document on
a particular subject into Chinese, and store it until it
might be needed.
One of the more transformative aspects of the
data lake is that it stores every type of data equally—
not just structured and unstructured, but also batch
and streaming. Batch data is typically collected on
an automated basis and then delivered for analysis en
masse—for example, the utility meter readings from
homes. Streaming data is information from a continuous
feed, such as video surveillance.
Formatting unstructured, batch, and streaming data
inevitably strips it of much of its richness. And even if
a portion of the information can be put into a conven-
tional cloud database, we are still constrained by limited,
pre-defined questions. The data lake holds no such
constraints. When unstructured, batch, and streaming
data are ingested, analytics can take advantage of the
tagging approach to begin to look for patterns that
naturally emerge. All types of data, and the value they
hold, now become fully accessible.
The US military is taking advantage of this capability
to help track insurgents and others who are planting
improvised explosive devices (IEDs) and other bombs.
Many of the military’s data sources include unstruc-
tured data, and using the conventional approach—with
its extensive preparation—had proved unwieldy and
time-consuming. With the data lake, the military is
now able to quickly integrate and analyze its vast array
of disparate data sources—including its unstructured
data—giving military commanders unprecedented
situational awareness. This is another example of why
simply amassing large amounts of data does not create
a data lake. The military was collecting an enormous
quantity of data, but without the data lake could not
make full use of it to try to stop IEDs. Commanders
have reported that the current approach—which has the
data lake as its centerpiece—is saving more lives, and at
a lower operating cost than the traditional methods.
Accessing the Data for Analytics
One of the chief drawbacks of the conventional
approach, which the cloud does not ameliorate, is that
it essentially samples the data. When we have questions
(or want to test hypotheses), we select a sample of the
available data and apply analytics to it. The problem is
that we are never quite sure we are pulling the right
sample—that is, whether it is really representative of
the whole. The data lake eliminates sampling. We no
longer have to guess about which data to use, because
we are using it all.
With the data lake, our information is available
for analysis on-demand, when the need arises. The
conventional approach not only requires extensive
data preparation, but it is difficult to change databases
as queries change. Say the pharmaceutical company
wants to add new data sources to identify promising
5. 5MARCH 2013
drug compounds, or perhaps wants to change the type
of financial analyses it uses. With the conventional
approach, analysts would have to tear down the initial
data and analytics structures, and re-engineer new ones.
With the data lake, analysts would simply add the new
data, and ask the new questions.
Because it is not easy to change conventional data
structures, the information they contain can become
outdated and even obsolete fairly quickly. By contrast,
we are able to add new information to the data lake the
moment we need it.
This ease in accessibility sets the stage for the
advanced, high-powered analytics that can point the
way to top-line business growth, and help government
achieve its goals in innovative ways. Analytics that search
for connections and look for patterns have long been
hamstrung by being confined to limited, rigid datasets
and databases. The data lake frees them to search for
knowledge and insight across all of the data. In essence,
it allows the analytics, for the first time, to reach their
true potential.
Because there is no need to continually engineer and
re-engineer data structures, the data lake also becomes
accessible to non-technical subject matter experts.
They no longer need to rely on computer scientists and
others to explore the data—they can ask the questions
themselves. Subject matter experts best understand
how big data can provide value to their businesses and
agencies. The data lake helps put the answers directly in
their hands.
A New Mindset
Virtually every aspect of the data lake creates cost
savings and efficiencies, from freeing up analysts to its
ability to easily and inexpensively scale to an organi-
zation’s growing data. Because the data lake enables
organizations to gather and analyze ever-greater
amounts of data, it also gives them new opportunities
for top-line revenue growth. The data lake enables both
business and government to reach that tipping point at
which data helps us to do things not just cheaper and
better, but in ways we have not yet imagined.
Organizations may believe that because they are
now in the cloud and can put all their data in one place,
they already have a version of the data lake. But greater
amounts of data—no matter how large—will not
necessarily yield more knowledge and insight. The trick
is to connect the data and make it useful—essentially,
to create the kinds of conditions that can turn big data
into opportunity. The data lake and the larger Cloud
Analytics Reference Architecture represent a revolu-
tionary approach—and a new mindset—that make
those conditions possible. Opportunity is out there, if
we have the tools to look for it.
FOR MORE INFORMATION
Mark Herman
herman_mark@bah.com
703-902-5986
Michael Delurey
delurey_mike@bah.com
703-902-6858
www.boozallen.com/cloud
This document is part of a collection of papers developed by Booz Allen Hamilton to introduce new concepts and ideas spanning cloud
solutions, challenges, and opportunities across government and business. For media inquiries or more information on reproducing this
document, please contact:
James Fisher—Senior Manager, Media Relations, 703-377-7595, fisher_james_w@bah.com
Carrie Lake—Manager, Media Relations, 703-377-7785, lake_carrie@bah.com
12.032.12P