With the data lake, an organizations repository of information - including structured and unstructured data - is consolidated in a single, large table. Every inquiry can make use of the entire body of information stored in the data lake - and it is all available at once.
3. Table of Contents
Introduction........................................................................................................................ 1
A New Mindset .................................................................................................................. 1
Ingesting Data into the Data Lake ....................................................................................... 1
Opening Up the Data.......................................................................................................... 2
Tagging the Data................................................................................................................ 3
A New Way of Storing Data.................................................................................................. 4
Accessing the Data for Analytics ......................................................................................... 4
Bottom-Line Savings and Top-Line Growth ............................................................................ 5
4.
5. 1
Introduction
Big data by itself does not create opportunity. The most
successful, competitive organizations will be the ones
with the ability to turn that data into game-changing
paths to new kinds of value, cost savings, revenue
growth, and operational effectiveness.
Organizations today are amassing so much information
so quickly that they are reaching a tipping point. They
are gaining remarkable potential to use big data in
new ways, and redefine the very nature of how they do
business. Yet none of this is guaranteed. Current tools
cannot easily integrate disparate data collections, or
fully use the kinds of “unstructured” data—such as
photographs, doctors’ examination notes, and social
media posts—that hold the most promise.
The bigger that data gets, the more impractical these
tools become, in terms of time, cost, and analytic
ability. The conventional approaches have, in effect,
created a glass ceiling with big data. Organizations
may be able to envision new opportunities for growth
and effectiveness, and yet have no method of reaching
them. There is, however, a way around the glass ceiling.
Booz Allen Hamilton has developed a revolutionary
approach known as the “data lake” that removes the
current constraints.
With the data lake, an organization’s repository of
information—structured and unstructured, along with
streaming and batch data—is consolidated in a single,
large table. The entire body of information in the data
lake is available for every inquiry, and all at once—a
capability that can create powerful new knowledge and
insight. And because the data lake simplifies virtually
every aspect of the loading, storing, and accessing
of data, it provides business and government with
substantial cost savings and efficiencies.
The data lake is now being used in a wide range of
business and government applications. For example, it
is helping a pharmaceutical company bring to market
successful new drug compounds up to three times
faster than was previously possible. It is enabling
hospitals to more quickly identify and treat life-
threatening infections. And it is helping the US military
integrate its intelligence sources to track insurgents
and others who are planting improvised explosive
devices (IEDs).
In these and other instances, the data lake is creating
the kinds of opportunities would have been prohibitively
expensive and time-consuming with conventional tools.
Instead of being left behind by big data, organizations
are now using it to compete and win in our digitally
enabled economy.
A New Mindset
Many organizations are now collecting large amounts
of data in the cloud. But the data lake is an entirely
different model—it does not just bring data together,
it helps connect and integrate the data so that its full
value can be realized. Even in the cloud, data is stored
in rigid, regimented data structures—essentially data
silos—that are difficult to connect, limiting our ability to
see the big picture. Despite its promise to revolutionize
data analysis, the cloud does not truly integrate data—
it simply makes the data silos taller and fatter.
The data lake is not an incremental advance, but rather
represents a completely new mindset. Big data requires
organizations to stop thinking in terms of data mining
and data warehouses—the equivalent of industrial
processes—and to consider how data can be more fluid
and expansive, like in a data lake.
Organizations may be concerned that by consolidating
and connecting their data, they might be making it more
vulnerable. Just the opposite is true. The data lake
incorporates a new, granular level of security and privacy
that is not available with conventional techniques.1
Ingesting Data into the Data Lake
As with much of the conventional approach, the process
of preparing the data for analysis, known as extract/
The Data Lake
Turning Big Data into Opportunity:
1
See Booz Allen Viewpoint “Enabling Cloud Analytics with Data-Level Security: Tapping the
Full Value of Big Data and the Cloud” http://www.boozallen.com/media/file/Enabling_
Cloud_Analytics_with_Data-Level_Security.pdf
6. 2
transform/load (ETL), tends to be highly inefficient in
terms of the resources used. At many organizations,
analysts may spend as much as 80 percent of their
time preparing the data, leaving just 20 percent for
conducting actual analysis. The reason is that with
each new line of inquiry, a specific data structure
and analytic is custom-built. All information entered
into the data structure must first be converted into a
recognizable format, often a slow, painstaking task.
For example, an analyst might be faced with merging
several different data sources that each use different
fields. The analyst must decide which fields to use
and whether new ones need to be created. The more
complex the query, the more data sources that typically
must be homogenized. Formatting also carries the risk
of data-entry errors. By contrast, data from a wide
range of sources is smoothly and easily ingested into
the data lake.
More importantly, there are no requirements for rigid
data structures—and so no need for formal data
formatting as the information is loaded. In a data lake,
indexing is not done en masse at the time of ingestion,
which is a time-consuming part of the traditional ETL
process. Instead, indices and relationships can be
derived to enrich the information base over time, and
executed at the time of the analysis to create “views”
that are tailored to the needs of a specific analysis,
reducing the time to operationalize data.
The data lake might be thought of as a giant collection
grid, like a spreadsheet—with billions of rows and
billions of columns available to hold data. Each cell
of the grid contains a piece of data—a document,
perhaps, or maybe a paragraph, or even a single
word from the document. Cells might contain names,
photographs, incident reports, or Twitter feeds—
anything and everything. It does not matter where in the
grid each bit of information is located. It also makes
no difference where the data comes from, whether it is
formatted, or how it might relate to any other piece of
information in the data lake. The data simply takes its
place in the cell, and after only minimal preparation by
analysts, is ready for use.
The image of the grid helps describe the difference
between data mining and the data lake. If we want to
mine precious metals, we have to find where they are,
then dig deep to retrieve them. But imagine if, when
the Earth was formed, nuggets of precious metals were
laid out in a big grid on top of the ground. We could just
walk along, picking up what we wanted. The data lake
makes information just as readily available.
The process of placing the data in open cells as it
comes in gives the ingest process remarkable speed.
Large amounts of data that might take 3 weeks to
prepare using conventional cloud computing can be
placed into the data lake in as little as 3 hours. This
enables organizations to achieve substantial savings
in IT resources and manpower. Just as important, it
frees analysts for the more important task of finding
connections and value in the data. Many organizations
today are trying to “do more with less.” That is difficult
with the conventional approach, but becomes possible,
for the first time, with the data lake.
Opening Up the Data
The ingest process of the data lake also removes
another disadvantage of the conventional approach—
the need to pre-define our questions. With conventional
computing techniques, we have to know in advance
what kinds of answers we are looking for and where
in the existing data the computer needs to look to
answer the inquiry. Analysts do not really ask questions
of the data—they form hypotheses well in advance of
the actual analysis, and then create data structures
and analytics that will enable them to test those
hypotheses. The only results that come back are the
ones that the custom-made databases and analytics
happen to provide.
What makes this exercise even more constraining is
that the data supporting an analysis typically contains
only a portion of the potentially available information.
Because the process of formatting and structuring
the data is so time-intensive, analysts have no choice
but to cull the data by some method. One of the most
prevalent techniques is to discount (and even ignore)
unstructured data. This simplifies the data ingest, but it
severely reduces the value of the data for analysis.
Hampered by these severe limitations, analysts can
pose only narrow questions of the data. And there is a
7. 3
risk that the data structures will become closed-loop
systems—echo chambers that merely validate the
original hypotheses. When we ask the system what is
important, it points to the data that we happened to put
in. The fact that a particular piece of data is included in
a database tends to make it de facto significant—it is
important only because the hypothesis sees it that way.
With the data lake, data is ingested with a wide-open
view as to the queries that may come later. Because
there are no structures, we can get all of the data in—
all 100 variables, or 500, or any other number, so that
the data in its totality becomes available. Organizations
may have a great deal of data stored in the cloud, but
without the data lake they cannot easily connect it all,
and discover the often-hidden relationships in the world
around us. It is in those relationships that knowledge
and insight—and opportunity—reside.
Tagging the Data
The data lake also provides organization with value in
the way the data itself is managed. When a piece of
data is ingested, certain details, called metadata (or
“data about the data”), are added so that the basic
information can be quickly located and identified. For
example, an investor’s portfolio balance (the data)
might be stored with the name of the investor, the
account number, the location of the account, the types
of investments, the country the investor lives in, and so
on. These metadata “tags” serve the same purpose as
old-style card catalogues, which allow readers to find a
book by searching the author, title, or subject. As with
the card catalogues, tags enable us to find particular
information from a number of different starting points—
but with today’s tagging abilities, we can characterize
data in nearly limitless ways. The more tags, the more
complex and rich the analytics can become.
With the tags, we can look not only for connections and
patterns in the data, but in the tags themselves. As an
example of how this technology can be applied, tags
were used to help a major pharmaceutical company find
connections in a wide range of public data sources to
identify drug compounds with few adverse reactions,
and a high likelihood of clinical and commercial
success. Those sources have included market and
social media data—to help determine the need—
as well as data on clinical development, structural
analysis, disease structures, and patents—to determine
where there might be a gap. Data from those sources
were tagged and ingested into a data lake, enabling the
pharmaceutical company to identify the most promising
compounds. With conventional techniques, those
compounds would have been needles in a haystack, but
tags and the data lake help them stand out brightly.
The data lake allows us to ask questions and search
for patterns using either the data itself, the tags
themselves, or a combination of both. We can begin our
search with any piece of data or tag—for example, a
market analysis or the existing patents on a type
of drug—and pivot off of it in any direction to look
for connections.
While the process of tagging information is not new,
the data lake uses it in a unique way—as the primary
method of locating and managing the data. With
the tags, the rigid data structures that so limit the
conventional approach are no longer needed.
Along with the streamlined ingest process, tags help
give the data lake its speed. When organizations need
to update or search the data in new ways, they do not
have to tear down and rebuild data structures, as in the
conventional method. They can simply update the tags
already in place.
Tagging all of the data, and at a much more granular
level than is possible in the conventional cloud
approach, greatly expands the value that big data can
provide. Information in the data lake is not random and
chaotic, but rather is purposeful. The tags help make
the data lake like a viscous medium that holds the data
in place, and at the same time fosters connections.
The tags also provide a strong new layer of security.
We can tag each piece of data, down to the image
or paragraph in a document, with the relevant
restrictions, authorities, and security and privacy
levels. Organizations can establish rules regarding
which information can be shared, with whom, and
under what circumstances. With the conventional
approach, the primary obstacle to information sharing
is not technology, but rather the concern that secure
8. 4
information will be compromised. The data lake,
by contrast, makes it possible for business and
government organizations to easily share information,
confident that security, privacy, and other rules
governing the data will be strictly maintained. The
security of data in the data lake has been proven
to work in very secure environments within the US
government, where the highest levels of precision
in security and privacy are required.
A New Way of Storing Data
With the conventional approach, data storage is
expensive—even in the cloud. The reason is that
so much space is wasted. Imagine a spreadsheet
combining two data sources, an original one with 100
fields and the other with 50. The process of combining
means that we will be adding 50 new ”columns” into
the original spreadsheet. Rows from the original will
hold no data for the new columns, and rows from the
new source will hold no data from the original. The
result will be a great deal of empty cells. This is wasted
storage space, and creates the opportunity for a great
many errors.
In the data lake, however, every cell is filled—no
space is wasted. This makes it possible to store
vast amounts of data in far less space than would
be required for even relatively small conventional
cloud databases. With the conventional approach,
organizations must continually reinvest in infrastructure
as analytic needs change. Connecting the data silos,
for example, typically requires reconfiguring and even
expanding the infrastructure. But with the data lake, the
infrastructure becomes a stable platform. Organizations
do not need to continually rebuild and reconfigure their
infrastructure. Their initial investment in infrastructure is
both enduring and cost-effective.
The data lake’s almost limitless capacity also
enables organizations to store data in a variety of
different forms, to aid in later analysis. A financial
institution, for example, could store records of certain
transactions converted into all of the world’s major
currencies. Or, a company could translate every
document on a particular subject into Chinese, and
store it until it might be needed.
One of the more transformative aspects of the data
lake is that it stores every type of data equally—not
just structured and unstructured, but also batch
and streaming. Batch data is typically collected on
an automated basis and then delivered for analysis
en masse—for example, the utility meter readings
from homes. Streaming data is information from a
continuous feed, such as video surveillance.
Formatting unstructured, batch, and streaming data
inevitably strips it of much of its richness. And even
if a portion of the information can be put into a
conventional cloud database, we are still constrained
by limited, pre-defined questions. The data lake
holds no such constraints. When unstructured, batch,
and streaming data are ingested, analytics can take
advantage of the tagging approach to begin to look for
patterns that naturally emerge. All types of data, and
the value they hold, now become fully accessible.
The US military is taking advantage of this capability
to help track insurgents and others who are planting
improvised explosive devices (IEDs) and other bombs.
Many of the military’s data sources include unstructured
data, and using the conventional approach—with
its extensive preparation—had proved unwieldy and
time-consuming. With the data lake, the military is now
able to quickly integrate and analyze its vast array of
disparate data sources—including its unstructured
data—giving military commanders unprecedented
situational awareness. This is another example of why
simply amassing large amounts of data does not create
a data lake. The military was collecting an enormous
quantity of data, but without the data lake could not
make full use of it to try to stop IEDs. Commanders
have reported that the current approach—which has the
data lake as its centerpiece—is saving more lives, and
at a lower operating cost than the traditional methods.
Accessing the Data for Analytics
One of the chief drawbacks of the conventional
approach, which the cloud does not ameliorate, is
that it essentially samples the data. When we have
questions (or want to test hypotheses), we select a
sample of the available data and apply analytics to it.
The problem is that we are never quite sure we are
9. 5
pulling the right sample—that is, whether it is really
representative of the whole. The data lake eliminates
sampling. We no longer have to guess about which data
to use, because we are using it all.
With the data lake, our information is available for
analysis on-demand, when the need arises. The
conventional approach not only requires extensive data
preparation, but it is difficult to change databases
as queries change. Say the pharmaceutical company
wants to add new data sources to identify promising
drug compounds, or perhaps wants to change the type
of financial analyses it uses. With the conventional
approach, analysts would have to tear down the initial
data and analytics structures, and re-engineer new
ones. With the data lake, analysts would simply add the
new data, and ask the new questions.
Because there is no need to continually engineer
and re-engineer data structures, the data lake also
becomes accessible to non-technical subject matter
experts. They no longer need to rely on computer
scientists and others to explore the data—they can ask
the questions themselves. Subject matter experts best
understand the needs and goals of their organizations,
and the data lake helps make it possible for them to
identify where a specific opportunity may lie. This might
entail pinpointing a promising area for revenue growth
that has been overlooked by competitors, or finding
ways to execute a government agency’s mission faster
and more effectively, as in the military’s search for
insurgents and IEDs.
The data lake sets the stage for the advanced, high-
powered analytics that can point the way to top-line
business growth, and help government agencies
achieve their mission goals in better ways. Analytics
that search for connections and look for patterns have
long been hamstrung by being confined to limited, rigid
datasets and databases. The data lake frees them to
search for knowledge and insight across all of the data.
In essence, it allows the analytics, for the first time, to
reach their true potential.
A version of the data lake, for example, helped
researchers from Booz Allen and a large hospital chain
in the Midwest gain surprising insights into severe
sepsis and septic shock, life-threatening conditions
brought on by serious infection. Using the data lake,
researchers consolidated the electronic medical records
of tens of thousands of past patients with sepsis, and
found unexpected patterns in how their conditions
progressed. Those insights prompted the hospital chain
to begin a program to more quickly identify and treat
current sepsis patients. The program was credited with
saving nearly 100 lives during just its first 9 months.
Bottom-Line Savings and Top-Line Growth
Virtually every aspect of the data lake creates cost
savings and efficiencies, from freeing up analysts
to the ability to easily and inexpensively scale to an
organization’s growing data. While the conventional
methods have worked in the past, they are simply too
costly and cumbersome in the age of big data. The data
lake gives organizations a reset, in a sense, allowing
them to distribute their resources to obtain optimal
efficiency and effectiveness. That is critical in today’s
economic climate. Organizations can address budgetary
constraints while significantly expanding, rather than
limiting, their data analysis.
At the same time, the data lake helps organizations
to reach and then exploit the tipping point of
opportunity. Ultimately, the real value of big data lies
in big analytics—the capacity to help us do things
not just cheaper and better, but in ways we have not
yet imagined. For government, this can mean new
paradigms for mission success. For business, it can
show the way to entire new areas of revenue growth.
As big data grows even larger in the coming years, it
will increasingly be used by organizations to differentiate
themselves and compete in the marketplace. The
winners will be the ones with the greatest ability to
extract knowledge and insight from that data, and use it
to remake their futures. The data lake opens that door.
11. 7
About Booz Allen Hamilton
ContactsBooz Allen Hamilton has been at the forefront of
strategy and technology consulting for nearly a century.
Today, Booz Allen is a leading provider of management
and technology consulting services to the US
government in defense, intelligence, and civil markets,
and to major corporations, institutions, and not-for-
profit organizations. In the commercial sector, the firm
focuses on leveraging its existing expertise for clients in
the financial services, healthcare, and energy markets,
and to international clients in the Middle East. Booz
Allen offers clients deep functional knowledge spanning
strategy and organization, engineering and operations,
technology, and analytics—which it combines with
specialized expertise in clients’ mission and domain
areas to help solve their toughest problems.
The firm’s management consulting heritage is the
basis for its unique collaborative culture and operating
model, enabling Booz Allen to anticipate needs and
opportunities, rapidly deploy talent and resources, and
deliver enduring results. By combining a consultant’s
problem-solving orientation with deep technical
knowledge and strong execution, Booz Allen helps
clients achieve success in their most critical missions—
as evidenced by the firm’s many client relationships that
span decades. Booz Allen helps shape thinking and
prepare for future developments in areas of national
importance, including cybersecurity, homeland security,
healthcare, and information technology.
Booz Allen is headquartered in McLean, Virginia,
employs approximately 25,000 people, and had revenue
of $5.86 billion for the 12 months ended March 31,
2012. For over a decade, Booz Allen’s high standing
as a business and an employer has been recognized
by dozens of organizations and publications, including
Fortune, Working Mother, G.I. Jobs, and DiversityInc.
More information is available at www.boozallen.com.
(NYSE: BAH)
Mark Herman
Executive Vice President
herman_mark@bah.com
703-902-5986
Michael Delurey
Principal
delurey_mike@bah.com
703-902-6858