SlideShare uma empresa Scribd logo
1 de 21
Baixar para ler offline
DecisionLab.Net
business intelligence is business performance
___________________________________________________________________________________________________________________________________________________________________________________
____________________________________________________________________________________________________________________________________________________________________________________
DecisionLab http://www.decisionlab.net dupton@decisionlab.net
Carlsbad, California, USA
Lean Data Warehouse
via
Data Vault
__________________________________________________________________________________________________________________________________________________________________________________
Page 2 of 21
Lean Data Warehouse via Data Vault
Written by Daniel Upton
Data Warehouse Architect
Certified ScrumMaster
DecisionLab.Net
Business Intelligence is Business Performance
dupton@decisionlab.net
linkedin.com/in/DanielUpton
Lean Data WarehouseTM is a trademark of Daniel C. Upton. All uses of it throughout this writing are protected by trademark law.
Without my (the writer’s) explicit written permission in advance, the only permissible reproduction or copying of this trademark, or of any
of this written material, is in the form of a review or a brief reference to a specific concept herein, either or which must clearly specify
this writing’s title, author (me), and this web address http://www.slideshare.net/DanielUpton/lean-data-warehouse-via-data-vault . For
permission to reproduce or copy any of this material other than what is specified above, just email me at the above address.
Data Vault is a term, unregistered as far as I know, which I believe was coined by Daniel Linstedt, and is a central term in books by
Daniel Linstedt, Kent Graziano and Hans Hultgren. References to these books can be found at the end of this writing. I am grateful to
these authors for the deeply insightful concepts expressed therein, and I make no attempt to co-opt, or express any authorship for the
term Data Vault. Many of the concepts I learned – from these books or from related social media -- are reflected here, but this material
was not reproduced nor copied from any other material. All aforementioned concepts are expressed exclusively in my own words and
illustrated using my own data and database. I have attempted to give credit to others where it is due.
__________________________________________________________________________________________________________________________________________________________________________________
Page 3 of 21
Lean Data Warehouse via Data Vault
Daniel Upton
Objective: Interpret the principles of a Lean as applied to Data Warehousing, then describes challenges associated with changes in the business, and finally
provides scenarios in which three methods to data warehouse modelling will need to adapt to accommodate those changes, with varying amounts of difficulty.
The three data modelling methods are 3rd
Normal Form, Star Schema (Dimensional) and Data Vault (Ensemble Model). For reference materials, see the end of
this article.
Takeaway: This piece demonstrates that data vault, being more easily adaptive to real-world changes, as well as continually imperfect knowledge within a
businesses, is the most Lean methodology of the three. This does not indicate that either of the other two mainstream methods are without value, but rather
only that Data Vault’s strengths validates its place among one of them in the field of data warehousing.
Bottom Line: A data warehouse that does not enjoy these Lean strengths – whether built around a traditional data warehouse design best practice or not – will
suffer from inflexibility and a high cost to support and extend over its lifecycle.
Lean Principles as Applied to Data Warehousing
 Focus on the Customer and Eliminate Waste (‘muda’ in Japanese)
o Differentiate efforts based on demand (business requirements) vs. supply (availability and characteristics of source data).
o Data Presentation Layer (downstream): Deliver what is demanded, no less, no more.
o Data Management Layer (upstream): Based on the selection of available source data and its potential for a basic integration (as will soon be
described using Data Vault), model and load that data, no less, no more.
o How Much Source Data is Enough?: Both Lean and Agile principles emphasize the importance of ‘just enough’ thinking. Just enough analysis,
design, development, testing, deployment, training, and support. Having said that, ‘just enough’ can be difficult to determine. A Data Vault model,
due to its modular design and associated loading logic, is easily chunked out into little independent deliverables than are traditional data warehouse
models, so that project managers and leaders have more options in deciding on the scope and frequency of potentially shippable increments (PSI’s).
Having said that, it is also true that choosing to integrate all, or at least all eventually desired, database tables, fields and records from each relevant
source system into a data management layer, such as Data Vault, is a responsible choice, because…
 The potential for this basic integration is highest when all relevant source data is integrated.
 A basic integration, such as Data Vault, is quicker to accomplish, than an advanced integration attempting a single version of the truth
(SVOT) for all attributes.
 Excessive task-switching is a well-known source of waste (muda) needing elimination. As such, allowing engineers to complete the design
and loading of a basic integration in its entirety -- while their mental energy is directed to a specific logical set of data subject areas -- is
sensible.
__________________________________________________________________________________________________________________________________________________________________________________
Page 4 of 21
 Historical changes of chosen tables will be tracked as soon as this layer goes into production, and excluding specific tables probably means
leaving holes in the otherwise historically tracked source data.
 Multiple simultaneous development tracks are easily established in order to efficiently put more resources onto a focused initiative to
instantiate the data warehouse.
 Plan for Change: Design-in…
o Loose Coupling: Little or no dependency between objects, so that change in one entity does not compromise a related entity. Loose coupling is
robust. Tight coupling is fragile. Data Vault establishes loose coupling among entities.
o High Cohesion: To support loose coupling, each entity is focused, self-sufficient, and not overlapping with related entities. Data Vault does this, too.
 Optimize the Whole
o Be willing to sub-optimize components a data model if necessary to optimize the scalability and durability of the model as a whole.
 How? Add associative tables to break entity dependencies, thereby allowing actual data records, not a fixed data model, to dictate
relative cardinality between related entities.
 Automate Processes and Deliver Fast
o Establish and re-use generic patterns to quickly, loosely integrate keys from source tables, without YET performing any data re-interpretation
In this piece, I will provide a simple example of the purely data-related challenges typical in the lifecycle of real-world Data Warehouses, interpreted here as
failures to adhere to Lean principles. As routine as this type of change it, one of the big challenges with traditional data warehousing is that these simple data
changes cause large structural problems in a traditional, rigidly integrated, data warehouse. Although process-related challenges of course, also exist, many of
the toughest ones are actually just challenging accommodations to these same real-world data challenges, large or small.
High-Level Summary of Data Vault Modeling Fundamentals and Resulting Value Proposition
o Note: Business Data Vault, not Raw Data Vault, will be described here. Whereas Raw Vault simply de-constructs each source entity and entity-
relationship into a Hub, Satellite and Link, Business Vault establishes a stronger basic integration by isolating and managing business keys as unique
records, which will be described immediately below.
 Generally, in a Business Data Vault, tables from selected source systems end up in the Business Vault de-constructed into Hubs, Satellites and Links, with all
attributes from multiple source entities within and across source systems now aligned around a common unique business key (Hub).
 Hub: A Hub table manages a unique business key, which is defined as a key (one or more data elements / fields) which are meaningful not just within a
single system, and not just to a database engineer, but also to the business people familiar with that business process. As an example, in Healthcare, an NPI
(National Provider Identifier) code, CPT (Current Procedural Terminology) code, or Lab Accession Code all have clear meaning to healthcare business people,
whereas a surrogate key or system key typically is only meaningful to the database engineer. Although a Hub includes a standard set of additional fields, as
will be shown below, they are all intended simply to manage the stored business key.
 Satellite: A Satellite table, which is always dependent on exactly one Hub, contains the payload of attribute fields which shared a table with the business key
in the source system, but which are now stored separately in order to track attribute changes, essentially with single Type 2 slowly changing dimension
(herein SCD2) function without causing duplication of the Hub’s business key.
__________________________________________________________________________________________________________________________________________________________________________________
Page 5 of 21
 Ensemble: In the common situation in which multiple source tables exist all with the granularity of the same Business Key, either within or across source
systems, those tables may end up in Data Vault as multiple Satellites, all as dependents to the same single Hub Table. This set of one Hub and one or more
Satellites is an Ensemble.
 Link: Link tables relate two or more Hubs together as an association. In its simplest form, a Link table is an associative table that joins what, in the source
system, was a dependent table and a parent table, whether this relationship was enforced with a foreign key constraint or not. As an associative table, the
Link gracefully handles not only the expected, one might say desirable, one-to-many relationship among actual data records from the two table, but also any
real-world many-to-many conditions in the relationship between actual records in the two source tables. As such, the Link affords a looser, more robust
form of referential integrity between Hubs, but importantly, does so without making either Hub actually dependent on other related Hubs, since Hubs are
never directly dependent on another table, but only by associative Links.
 Object Naming: The names of source system tables and fields all remain unchanged, except insofar as tables are usually appended with “_hub, _sat, or
_lnk” suffixes and additional surrogate keys, data stamps and audit fields are added. As such, the name of all business keys and attribute fields remain
unchanged from the source system. This transparency from source to data vault is consistent with the principle that, at the data vault stage, “interpretative”
data transformations, such as calculation, aggregation, filtering, grouping, has not YET occurred. In fact, those processes occur immediately downstream of
data vault.
Pros, Cons and Disclaimers based on Data Vault fundamentals and from multiple in-the-trench implementation experiences
 Disclaimers: Data Vault…
o Does not solve unsolvable challenges of data integration, as where two systems lack a set of sufficiently match-able keys in order to achieve a
satisfactory percentage of record matches.
o Does not eliminate the need for data profiling, data quality processing, nor the work of analytics/reporting-driven (conforming) data transformation
of non-key attributes, perhaps achieving a SVOT, but merely defers that phase until immediately after the data is landed, loosely integrated, and
historized.
o Does not eliminate, only defers, the decision-making associated with interpretive data transformations (ie. object renaming, selection of one source
field instead of another, calculation, aggregation, grouping, filtering, and data quality processing.
 Cons
o Learning Curve: At first, the Hub, Satellite, Link modeling pattern is unfamiliar and potentially intimidating. As such, it could be a source of
dissatisfaction to a DW team member who rejects the method, whether or not it is well understood.
o Adds tables, thus adds joins required for downstream ETL queries. Those who want to avoid a (non-presentational) data management layer of any
kind between source data and a BI data presentation layer cannot easily see how this one is better than any other one.
 Pros
o Establishes a clear distinction between demand-driven (data presentation) design vs. supply-driven (data management layer) design, and an
associated clear opportunity for multiple simultaneous development tracks.
__________________________________________________________________________________________________________________________________________________________________________________
Page 6 of 21
 Why? As soon as needed source tables are identified, but perhaps long before reporting / analytics requirements are fully understood (if
they ever really are), thus before Data Presentation Layer modelling is done, data vault design AND loading can be started and even
completed, so that historical source data is quickly manageable with full referential integrity in the data warehouse environment.
 As a result, data vault data is easily re-purposable for future requirements for data presentation, reporting, analytics.
o Data Profiling: Data Vault provides a crystal-clear end-point for data profiling. Profiling for data vault is complete when all relevant business keys
have been identified and key-match rates among, and across source systems have been measured.
o Modularization: In deferring interpretive data transformations, as described above, it effectively modularizes those processes and clearly
distinguishes them (downstream) from the process of loosely-coupling, loading and capturing historic changes within relevant source data, and
doing so with transparency.
o Inherently Extensible, Durable and Scalable.
 Extensible: Since new data vault features are very easily added onto an existing Vault with little or no refactoring of the existing solution,
multiple simultaneous tracks of data vault development within the same data warehouse solution are also straightforward, because of the
avoidance of entity dependencies within a data vault schema.
 Durable: Data Vault is exceptionally easy to re-factor to accommodate changes in source data schema.
 Scalable: As data volumes grow, each data vault ensemble is a robust, logically independent structure, each of which can reside on a specific
partitioning or node within a cluster or MPP architecture.
o Simplicity: Since Data Vault is a simple set of patterns, the learning curve is fast for willing learners: For design, inbound ETL, and outbound ETL, the
small set of generic patterns, once learned, quickly become familiar and are easily re-used for new data requirements.
o Transparency: Upstream and Downstream Facing
 Upstream Transparency: Views within the Data Vault are easily constructed as a mirror to source data, either as current snapshots or as
time-slices
 Downstream Transparency: Using generic data vault patterns, downstream-facing views, special-use tables or user-defined functions, are
also easily constructed to return results identical, with or without object renaming, to downstream structures such as large, complex,
denormalized Type 2 slowly changing dimensions, as well as fact tables ranging from transaction-grained to periodic and accumulating
snapshots. These downstream-facing structures may even be used to standardize on down-stream ETL patterns.
o Data Quality Process Transparency: Data Vault supports the following standardized data quality processing patterns and has an important bonus
feature:
 De-duplication: For a base Hub-Satellite ensemble, we can either create a sister ‘Golden Record’ ensemble, with the Satellite containing the
match-able attributes or, for simpler requirements, we might just add a Boolean ‘IsNotGolden’ field to a base Hub table to identify records
which have a duplicate that was chosen over them as being authoritative.
 Data Quality Satellites: Records in these tables corresponding to specific data quality issues among source attributes in the corresponding
base-Satellite. These DQ Satellites may contain cleansed values of base Satellite attributes, modified to conform to various standards, but
without altering values in the base Satellite record.
 Support for Data Quality Mart: If desired, direct support, using the above structures, for a custom Data Quality Data Mart, providing closed
loop reporting, covering all other data marts, back to data stewards about data quality details issues discovered and resolved (or not
__________________________________________________________________________________________________________________________________________________________________________________
Page 7 of 21
resolved) over time. See Kimball’s ETL Toolkit, Chapter 4, ‘Data Quality Design’, which is easily supplied with data by the aforementioned
structures.
 Bonus ‘Data Quality Lifecycle’ Feature: Since data vault stores and historizes largely untransformed source data history, data quality project
work can be performed before, during, or after other project work, and will always have all historic data available. In contrast, performing
data quality work only on temporarily staged data pre-supposes that ALL data quality rules are fully defined, not to mention fully coded,
prior to rolling the data warehouse into production. How likely is that?
-------------------------------------------------------------------------------------------------------------------------------------------------------------
__________________________________________________________________________________________________________________________________________________________________________________
Page 8 of 21
Data Vault*
 What it is not: A presentation layer, which is optimized, often de-normalized, to both simplify and accelerate analytic queries for a wide-spectrum of users.
 What it is:
o An enterprise data management layer, optimized… wherein operational data is loaded, permanently stored, historized, and loosely coupled by keys,
but without yet requiring significant data interpretation (object renaming, selection of certain attributes over others, grouping, filtering, de-
duplication or cleansing. As such the potential for data misinterpretation, loading of wrong data, loss of transparency with source data, data loss, or
inability to re-load historic data are, as project risks, reduced or eliminated.
o Although all of the above not-done yet efforts remain vital, they are all done immediately downstream of a data vault, and hence the complexity and
effort to deliver this layer is a fraction of a classic EDW, and the data vault can serve as the common enterprise data source for building, re-
populating, or refactoring any other data layer, be it a Star Schema, another reporting or analytics presentation layer, a Data Quality processing
layer, or a closed-loop Master Data Management layer.
 Note: As the number of tables increases, colors depicting two levels of abstraction become crucial for visualization, communication and sanity with Data
Vault.
o Color by data subject area. Demonstrated in the next set of figures.
o Color by table type (Hub, Satellite, Link). Demonstrated later, when additional tables are shown.
Ultra-Simple High-Level Visual Demo: Student and Major Tables in Student Records System
 Assume records in both tables are regularly updated, at which time historical data, while needed for B.I. reporting, is overwritten.
Figure 1: Student and Major Tables in Student Record System (from University OLTP and EDW – Week 1 Day 1.dez)
__________________________________________________________________________________________________________________________________________________________________________________
Page 9 of 21
---------------------------------------------------------------------------------------------------------------------------------------------------------------
Figure 2: Data Vault Ensembles for Student and Major, and associated Link (without, then with, color visual by data topic area)
(from University OLTP and EDW – Week 1 Day 1.dez)
__________________________________________________________________________________________________________________________________________________________________________________
Page 10 of 21
Scenario: Four related OLTP source entities, existing in two separate systems, become a set of tables in either a third-normal-form EDW model, a
dimensional model, or a data vault.
Step 1: Profile in Source Systems
 Four Tables (Entities), from Student Records system and Fundraising system
 Three Student Records Tables: Student, Major, Department
 One Fundraising Table: Dept_Fundraising_Participation
 Notes: The one-to-many relationship between Major and Student is not enforced by a foreign key. Also, the Dept_Fundraiser_Participaton table uses a
different key than the Department table. Lastly, although it’s granularity is ‘one Department’, the subject area of the table is actually more related to
fundraising activity, aggregated at the department level, than it is to the Department itself. As such, these two department-related tables have a weak
context, by themselves, for a tight integration aimed at an SVOT.
Source Table Screenshots (Figures 4a and 4b) correctly capture their associated business processes.
Figure 3a: Three Tables in Student Enrollment System Fig. 3b: Dept_Fundraiser_Participation in Fundraiser System
(both figures are from University OLTP and EDW – Week 1 Day 1.dez)
__________________________________________________________________________________________________________________________________________________________________________________
Page 11 of 21
--------------------------------------------------------------------------------------------------------------------------------------------------------------
Step 2a: Challenges with a Data Model for EDW in 3rd
Normal Form
 Note the rigid dependencies between entities. Also, questions remain about a proper SVOT integration
Figure 4: Third Normal Form of Student, Major, Department and Department Fundraiser Participation
(from University 3NF EDW and Dimensional - Week 1.dez)
…?...
__________________________________________________________________________________________________________________________________________________________________________________
Page 12 of 21
Step 2b: Challenges with Data Model for Star Schema
 Notes: Two kids of rigid dependencies are here: Fact-to-dimension, and within de-normalized dimension. Lastly, of course,
Dim_Department_Fundraiser_Profile lacks a Fact Table to provide context for integration.
Figure 5: Dimensional Model / Star Schema of Student, Major, Department and Dept Fundraiser Participation
(from University 3NF EDW and Dimensional - Week 1.dez)
__________________________________________________________________________________________________________________________________________________________________________________
Page 13 of 21
Step 2c: Data Model for Data Vault
 Notes:
o Hub Independence: Hubs, with their business keys, are never schema-dependent on another table.
o Link handles the real world data-dependencies, but does so loosely (loose coupling) as an associative table.
o Satellites, with their many attributes, become loosely coupled with related attributes in other tables and systems, with alignment via Business
Keys, and with history-tracking changes captured as versioned rows just like Type 2 slowly-changing dimensions.
 Satellites with a single Ensemble may have very different loading frequency.
 Colors for table types. In keeping with a general data vault standard, Hubs are depicted in blue in capable modeling tools, Satellites are yellow, and
Links are red. This, in addition to colors by subject area, as also shown in this figure, greatly simplify visual navigation and communication among
involved team members, especially when the number of tables extends to real-world numbers.
Figure 6: Data Vault Model of Student, Major, Department, and Department Fundraiser Participation
(from Univ… OLTP and EDW - Week 2 with MD5 NoSQL Schema.dez)
__________________________________________________________________________________________________________________________________________________________________________________
Page 14 of 21
Step 3: Business process change involves a source system schema change that will compromise the data warehouse if not quickly adapted.
 Change: University policy changes so that students may now have double majors or a major and a minor, rather than just one major per student, which
was the old business rule.
 The new source schema shown here correctly stores data from this new business process in the Student Records system, in contrast to the original
schema in Figure 4a.
o Note: The retention of the old ‘MajorID’ field was chosen by the OLTP data architect in addition to directing a re-load of all historical foreign key
values into the new “StudentMajorsMinor (v2)’ table. Note: All (v2), etc. characters are for illustration only.
Figure 7: Student Records System Change: Double Majors and Minors
(from University OLTP and EDW - Week 6 with MD5 NoSQL Schema.dez)
__________________________________________________________________________________________________________________________________________________________________________________
Page 15 of 21
Step 4a: Schema change for EDW in 3rd Normal Form
 Modified data model
o Note: The ‘Fundraiser’ table, although not part of our initial four tables, was included in this figure simply for explanation, in order to provide
business context to the fields associated with departmental fundraiser participation.
Figure 8: Modified 3rd
Normal Form EDW Model
…?...
__________________________________________________________________________________________________________________________________________________________________________________
Page 16 of 21
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Modified upstream ETL: Is the required effort for this small change acceptable?
o Department_edw_3nf ETL processing now dependent on changes to either Student Records or Fundraising systems. If either source system
delivers records with keys that are not matchable to the other system, the two choices are both complex.
 One choice is to either discard the unmatched record and load it in a subsequent increment or hold that record in a control table.
 The other choice is to allow for most fields to remain NULL and establish a complex ETL rule that does two things
 Initially insert records with fields available from one system, while leaving all fields from the other system as NULLs
 Revisiting all such records later with updates from the late-arriving data from the other system.
 Modified downstream schema and ETL or queries for downstream reports, dashboards, perhaps Star Schema
o Is this required effort also acceptable for this small source schema change?
o Problem: This set of unfortunate choices has a simple cause, which is the optimistic notion that the EDW schema should not only capture a
given business process correctly when first implemented and loaded, but that this process, as well as related business processes, all now fixed
into a transformed, tightly integrated schema, also will not change much over time. Whether or not that notion of relatively unchanging
business processes was valid back in the 90’s, when third-normal EDW’s were conceived and popularized, the relevant question is really this: Is
it a valid assumption now in 2015?
o Solution: Design Lean data warehouse models that are, along with their associated ETL code, easily adaptive to changing business processes,
even including those business processes that may not reasonably be deeply understood at the time when source data is needed in the data
warehouse, due to a lack of consistent stakeholder participation or authoritative, up to date system documentation.
 With data vault, on the other hand, once we understand just the business keys, we are already past the 50 yard line.
__________________________________________________________________________________________________________________________________________________________________________________
Page 17 of 21
Step 4b: Schema change for Star Schema
Disclaimer: Dimensional models are less-often chosen for the purpose of data management per se. As such, this section is more for the purpose of reference
than critique.
 Modified data model
 Modified upstream ETL
 Modified downstream queries or ETL for downstream objects.
Figure 9: Star Schema after change to Student Records System to support double majors and minors. (from University 3NF EDW and Dimensional - Week 1.dez)
__________________________________________________________________________________________________________________________________________________________________________________
Page 18 of 21
Step 4c: Schema Changes for Data Vault. See Figure 10
 No existing schema objects were modified.
 Existing ETL is completely untouched, except insofar as the loading of Sat_Student_Progress is discontinued, while the table will remain in use for
historical data prior to the business process change.
 Need One New Schema Object and it’s ETL Code: Added a Sat_Lnk_Student_Major as a dependent to Lnk_Student_Major_Department, which, as an
associative table, already had the flexibility to supports the current, or any future combinations of granularity between students, majors and minors.
Figure 10: Data Vault After Source System Mods (from Univ… OLTP and EDW - Week 6 Clean.dez)
__________________________________________________________________________________________________________________________________________________________________________________
Page 19 of 21
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Where do the following related processes fit with Data Vault and a Lean Data Warehouse?
 Conforming Dimensions and Facts: Downstream of Data Vault as a stored Dimensional Model, repurpose-able, reloadable at any time from it.
Virtualized dimensional models based on stored data vaults are an interesting, much discussed topic, but I have not (as of early March 2015) seen a
sensible virtual architecture that captures dimensional modelling best practice.
 Master Data Management
o Closed-loop MDM, with data vault (as the main MDM data source) handily recording the good, bad, and the ugly exactly as it exists, and as it
is either improved over time, or not improved, in any given source system.
 Predictive Analytics: Data Vault’s fidelity to source data, as well as it’s historization of it, should appeal to predictive analysts, who often avoid
highly-transformed data warehouses as data sources. The inclusion in Data Vault of the upstream-facing views, may also assist the predictive analyst
accustomed to source system data.
 BigData / NoSQL: Data Vault 2.0 (by D. Linstedt, and not otherwise mentioned in this article), with its MD5 Hash fields as primary keys, offers an
opportunity for massive parallel Data Vault table loading without the typical lookups that to surrogate keys in a target’s parent tables, which
otherwise force sequential table loading.
o Assumption: To do this, BigData records must have granularity loosely defined such that a Business Key does define each record, whether or
not the remainder of attributes are -- from an RDBMS perspective – oddly de-normalized or pivoted. As such, the BigData being referred to
here is more akin to an Apache Cassandra or Hive table than to a Hadoop file.
o One trade-off with an MD5 primary key is, of course, that with the MD5 fields size and datatype (VarBinary in SQL Server), downstream ETL
or queries will be slower-performing than with an Integer primary key used for joins. This, however, may be a worthwhile trade-off in
accommodating the parallel loading of massive BigData.
o If no BigData / NoSQL integration is expected to be required, Data Vault 2.0’s downstream ETL/query performance hit from the required
joins on the lengthy Hash PK’s make little sense.
__________________________________________________________________________________________________________________________________________________________________________________
Page 20 of 21
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Future Topics:
 How to Load: Detailed demonstration of data vault loading-transformation logic.
 After Load:
o Detailed demonstration of Patterns for data vault views and queries for current or historical snapshots of OLTP data.
o Detailed Demonstration of Patterns for data vault views, queries and point-in-time support tables to support downstream Star-Schema
loading transformations.
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Data Vault*Reference Books:
 Super Charge Your Data Warehouse, Daniel Linstedt, 2008-2011, CreateSpace Publishing
 The New Business SuperModel: The Business of Data Vault Data Modeling, 2nd
Edition, Daniel Linstedt, Kent Graziano, Hans Hultgren, 2008-2009,
Daniel Linstedt
 Modeling the Agile Data Warehouse with Data Vault, Hans Hultgren, 2012, Brighton Hamilton
__________________________________________________________________________________________________________________________________________________________________________________
Page 21 of 21
DecisionLab.Net
_____________________________________________________
Thank you. I value your feedback.
___________________________________________________________________________
Daniel Upton
DecisionLab http://www.decisionlab.net dupton@decisionlab.net
Direct 760.525.3268 Carlsbad, California, USA

Mais conteúdo relacionado

Mais procurados

Agile Data Warehouse Modeling: Introduction to Data Vault Data Modeling
Agile Data Warehouse Modeling: Introduction to Data Vault Data ModelingAgile Data Warehouse Modeling: Introduction to Data Vault Data Modeling
Agile Data Warehouse Modeling: Introduction to Data Vault Data ModelingKent Graziano
 
Agile Data Mining with Data Vault 2.0 (english)
Agile Data Mining with Data Vault 2.0 (english)Agile Data Mining with Data Vault 2.0 (english)
Agile Data Mining with Data Vault 2.0 (english)Michael Olschimke
 
IRM UK - 2009: DV Modeling And Methodology
IRM UK - 2009: DV Modeling And MethodologyIRM UK - 2009: DV Modeling And Methodology
IRM UK - 2009: DV Modeling And MethodologyEmpowered Holdings, LLC
 
Introduction To Data Vault - DAMA Oregon 2012
Introduction To Data Vault - DAMA Oregon 2012Introduction To Data Vault - DAMA Oregon 2012
Introduction To Data Vault - DAMA Oregon 2012Empowered Holdings, LLC
 
Data Warehouse Design and Best Practices
Data Warehouse Design and Best PracticesData Warehouse Design and Best Practices
Data Warehouse Design and Best PracticesIvo Andreev
 
(OTW13) Agile Data Warehousing: Introduction to Data Vault Modeling
(OTW13) Agile Data Warehousing: Introduction to Data Vault Modeling(OTW13) Agile Data Warehousing: Introduction to Data Vault Modeling
(OTW13) Agile Data Warehousing: Introduction to Data Vault ModelingKent Graziano
 
Agile Data Warehouse Design for Big Data Presentation
Agile Data Warehouse Design for Big Data PresentationAgile Data Warehouse Design for Big Data Presentation
Agile Data Warehouse Design for Big Data PresentationVishal Kumar
 
Best Practices: Data Admin & Data Management
Best Practices: Data Admin & Data ManagementBest Practices: Data Admin & Data Management
Best Practices: Data Admin & Data ManagementEmpowered Holdings, LLC
 
Data Vault 2.0: Using MD5 Hashes for Change Data Capture
Data Vault 2.0: Using MD5 Hashes for Change Data CaptureData Vault 2.0: Using MD5 Hashes for Change Data Capture
Data Vault 2.0: Using MD5 Hashes for Change Data CaptureKent Graziano
 
Guru4Pro Data Vault Best Practices
Guru4Pro Data Vault Best PracticesGuru4Pro Data Vault Best Practices
Guru4Pro Data Vault Best PracticesCGI
 
Agile data warehouse
Agile data warehouseAgile data warehouse
Agile data warehouseDao Vo
 
Data warehouse design
Data warehouse designData warehouse design
Data warehouse designines beltaief
 
The technology of the business data lake
The technology of the business data lakeThe technology of the business data lake
The technology of the business data lakeCapgemini
 
Data Warehouse Tutorial For Beginners | Data Warehouse Concepts | Data Wareho...
Data Warehouse Tutorial For Beginners | Data Warehouse Concepts | Data Wareho...Data Warehouse Tutorial For Beginners | Data Warehouse Concepts | Data Wareho...
Data Warehouse Tutorial For Beginners | Data Warehouse Concepts | Data Wareho...Edureka!
 

Mais procurados (20)

Data Vault Overview
Data Vault OverviewData Vault Overview
Data Vault Overview
 
Agile Data Warehouse Modeling: Introduction to Data Vault Data Modeling
Agile Data Warehouse Modeling: Introduction to Data Vault Data ModelingAgile Data Warehouse Modeling: Introduction to Data Vault Data Modeling
Agile Data Warehouse Modeling: Introduction to Data Vault Data Modeling
 
Why Data Vault?
Why Data Vault? Why Data Vault?
Why Data Vault?
 
Data Vault Introduction
Data Vault IntroductionData Vault Introduction
Data Vault Introduction
 
Agile Data Mining with Data Vault 2.0 (english)
Agile Data Mining with Data Vault 2.0 (english)Agile Data Mining with Data Vault 2.0 (english)
Agile Data Mining with Data Vault 2.0 (english)
 
IRM UK - 2009: DV Modeling And Methodology
IRM UK - 2009: DV Modeling And MethodologyIRM UK - 2009: DV Modeling And Methodology
IRM UK - 2009: DV Modeling And Methodology
 
Introduction To Data Vault - DAMA Oregon 2012
Introduction To Data Vault - DAMA Oregon 2012Introduction To Data Vault - DAMA Oregon 2012
Introduction To Data Vault - DAMA Oregon 2012
 
Data vault what's Next: Part 2
Data vault what's Next: Part 2Data vault what's Next: Part 2
Data vault what's Next: Part 2
 
Data vault: What's Next
Data vault: What's NextData vault: What's Next
Data vault: What's Next
 
Data Warehouse Design and Best Practices
Data Warehouse Design and Best PracticesData Warehouse Design and Best Practices
Data Warehouse Design and Best Practices
 
Data vault modeling et retour d'expérience
Data vault modeling et retour d'expérienceData vault modeling et retour d'expérience
Data vault modeling et retour d'expérience
 
(OTW13) Agile Data Warehousing: Introduction to Data Vault Modeling
(OTW13) Agile Data Warehousing: Introduction to Data Vault Modeling(OTW13) Agile Data Warehousing: Introduction to Data Vault Modeling
(OTW13) Agile Data Warehousing: Introduction to Data Vault Modeling
 
Agile Data Warehouse Design for Big Data Presentation
Agile Data Warehouse Design for Big Data PresentationAgile Data Warehouse Design for Big Data Presentation
Agile Data Warehouse Design for Big Data Presentation
 
Best Practices: Data Admin & Data Management
Best Practices: Data Admin & Data ManagementBest Practices: Data Admin & Data Management
Best Practices: Data Admin & Data Management
 
Data Vault 2.0: Using MD5 Hashes for Change Data Capture
Data Vault 2.0: Using MD5 Hashes for Change Data CaptureData Vault 2.0: Using MD5 Hashes for Change Data Capture
Data Vault 2.0: Using MD5 Hashes for Change Data Capture
 
Guru4Pro Data Vault Best Practices
Guru4Pro Data Vault Best PracticesGuru4Pro Data Vault Best Practices
Guru4Pro Data Vault Best Practices
 
Agile data warehouse
Agile data warehouseAgile data warehouse
Agile data warehouse
 
Data warehouse design
Data warehouse designData warehouse design
Data warehouse design
 
The technology of the business data lake
The technology of the business data lakeThe technology of the business data lake
The technology of the business data lake
 
Data Warehouse Tutorial For Beginners | Data Warehouse Concepts | Data Wareho...
Data Warehouse Tutorial For Beginners | Data Warehouse Concepts | Data Wareho...Data Warehouse Tutorial For Beginners | Data Warehouse Concepts | Data Wareho...
Data Warehouse Tutorial For Beginners | Data Warehouse Concepts | Data Wareho...
 

Semelhante a Lean Data Warehouse via Data Vault

Data Vault: What is it? Where does it fit? SQL Saturday #249
Data Vault: What is it?  Where does it fit?  SQL Saturday #249Data Vault: What is it?  Where does it fit?  SQL Saturday #249
Data Vault: What is it? Where does it fit? SQL Saturday #249Daniel Upton
 
Rando Veizi: Data warehouse and Pentaho suite
Rando Veizi: Data warehouse and Pentaho suiteRando Veizi: Data warehouse and Pentaho suite
Rando Veizi: Data warehouse and Pentaho suiteCarlo Vaccari
 
oracle-adw-melts snowflake-report.pdf
oracle-adw-melts snowflake-report.pdforacle-adw-melts snowflake-report.pdf
oracle-adw-melts snowflake-report.pdfssuserf8f9b2
 
Evaluation of Data Auditability, Traceability and Agility leveraging Data Vau...
Evaluation of Data Auditability, Traceability and Agility leveraging Data Vau...Evaluation of Data Auditability, Traceability and Agility leveraging Data Vau...
Evaluation of Data Auditability, Traceability and Agility leveraging Data Vau...IRJET Journal
 
Secure Transaction Model for NoSQL Database Systems: Review
Secure Transaction Model for NoSQL Database Systems: ReviewSecure Transaction Model for NoSQL Database Systems: Review
Secure Transaction Model for NoSQL Database Systems: Reviewrahulmonikasharma
 
Make compliance fulfillment count double
Make compliance fulfillment count doubleMake compliance fulfillment count double
Make compliance fulfillment count doubleDirk Ortloff
 
Building a Big Data Analytics Platform- Impetus White Paper
Building a Big Data Analytics Platform- Impetus White PaperBuilding a Big Data Analytics Platform- Impetus White Paper
Building a Big Data Analytics Platform- Impetus White PaperImpetus Technologies
 
DataWarehousingandAbInitioConcepts.ppt
DataWarehousingandAbInitioConcepts.pptDataWarehousingandAbInitioConcepts.ppt
DataWarehousingandAbInitioConcepts.pptPurnenduMaity2
 
Dw hk-white paper
Dw hk-white paperDw hk-white paper
Dw hk-white paperjuly12jana
 
Traditional BI vs. Business Data Lake – A Comparison
Traditional BI vs. Business Data Lake – A ComparisonTraditional BI vs. Business Data Lake – A Comparison
Traditional BI vs. Business Data Lake – A ComparisonCapgemini
 
Data Modeling for Integration of NoSQL with a Data Warehouse
Data Modeling for Integration of NoSQL with a Data WarehouseData Modeling for Integration of NoSQL with a Data Warehouse
Data Modeling for Integration of NoSQL with a Data WarehouseDaniel Upton
 
SALES BASED DATA EXTRACTION FOR BUSINESS INTELLIGENCE
SALES BASED DATA EXTRACTION FOR BUSINESS INTELLIGENCESALES BASED DATA EXTRACTION FOR BUSINESS INTELLIGENCE
SALES BASED DATA EXTRACTION FOR BUSINESS INTELLIGENCEcscpconf
 
Data warehouse-dimensional-modeling-and-design
Data warehouse-dimensional-modeling-and-designData warehouse-dimensional-modeling-and-design
Data warehouse-dimensional-modeling-and-designSarita Kataria
 
Top 60+ Data Warehouse Interview Questions and Answers.pdf
Top 60+ Data Warehouse Interview Questions and Answers.pdfTop 60+ Data Warehouse Interview Questions and Answers.pdf
Top 60+ Data Warehouse Interview Questions and Answers.pdfDatacademy.ai
 
Data Warehouse Project Report
Data Warehouse Project Report Data Warehouse Project Report
Data Warehouse Project Report Tom Donoghue
 
Building The Agile Database
Building The Agile DatabaseBuilding The Agile Database
Building The Agile Databaseelliando dias
 

Semelhante a Lean Data Warehouse via Data Vault (20)

Data Vault: What is it? Where does it fit? SQL Saturday #249
Data Vault: What is it?  Where does it fit?  SQL Saturday #249Data Vault: What is it?  Where does it fit?  SQL Saturday #249
Data Vault: What is it? Where does it fit? SQL Saturday #249
 
Rando Veizi: Data warehouse and Pentaho suite
Rando Veizi: Data warehouse and Pentaho suiteRando Veizi: Data warehouse and Pentaho suite
Rando Veizi: Data warehouse and Pentaho suite
 
Course Outline Ch 2
Course Outline Ch 2Course Outline Ch 2
Course Outline Ch 2
 
oracle-adw-melts snowflake-report.pdf
oracle-adw-melts snowflake-report.pdforacle-adw-melts snowflake-report.pdf
oracle-adw-melts snowflake-report.pdf
 
Evaluation of Data Auditability, Traceability and Agility leveraging Data Vau...
Evaluation of Data Auditability, Traceability and Agility leveraging Data Vau...Evaluation of Data Auditability, Traceability and Agility leveraging Data Vau...
Evaluation of Data Auditability, Traceability and Agility leveraging Data Vau...
 
Secure Transaction Model for NoSQL Database Systems: Review
Secure Transaction Model for NoSQL Database Systems: ReviewSecure Transaction Model for NoSQL Database Systems: Review
Secure Transaction Model for NoSQL Database Systems: Review
 
gn-160406200425 (1).pdf
gn-160406200425 (1).pdfgn-160406200425 (1).pdf
gn-160406200425 (1).pdf
 
Make compliance fulfillment count double
Make compliance fulfillment count doubleMake compliance fulfillment count double
Make compliance fulfillment count double
 
Building a Big Data Analytics Platform- Impetus White Paper
Building a Big Data Analytics Platform- Impetus White PaperBuilding a Big Data Analytics Platform- Impetus White Paper
Building a Big Data Analytics Platform- Impetus White Paper
 
ETL QA
ETL QAETL QA
ETL QA
 
DataWarehousingandAbInitioConcepts.ppt
DataWarehousingandAbInitioConcepts.pptDataWarehousingandAbInitioConcepts.ppt
DataWarehousingandAbInitioConcepts.ppt
 
Dw hk-white paper
Dw hk-white paperDw hk-white paper
Dw hk-white paper
 
Traditional BI vs. Business Data Lake – A Comparison
Traditional BI vs. Business Data Lake – A ComparisonTraditional BI vs. Business Data Lake – A Comparison
Traditional BI vs. Business Data Lake – A Comparison
 
Date Analysis .pdf
Date Analysis .pdfDate Analysis .pdf
Date Analysis .pdf
 
Data Modeling for Integration of NoSQL with a Data Warehouse
Data Modeling for Integration of NoSQL with a Data WarehouseData Modeling for Integration of NoSQL with a Data Warehouse
Data Modeling for Integration of NoSQL with a Data Warehouse
 
SALES BASED DATA EXTRACTION FOR BUSINESS INTELLIGENCE
SALES BASED DATA EXTRACTION FOR BUSINESS INTELLIGENCESALES BASED DATA EXTRACTION FOR BUSINESS INTELLIGENCE
SALES BASED DATA EXTRACTION FOR BUSINESS INTELLIGENCE
 
Data warehouse-dimensional-modeling-and-design
Data warehouse-dimensional-modeling-and-designData warehouse-dimensional-modeling-and-design
Data warehouse-dimensional-modeling-and-design
 
Top 60+ Data Warehouse Interview Questions and Answers.pdf
Top 60+ Data Warehouse Interview Questions and Answers.pdfTop 60+ Data Warehouse Interview Questions and Answers.pdf
Top 60+ Data Warehouse Interview Questions and Answers.pdf
 
Data Warehouse Project Report
Data Warehouse Project Report Data Warehouse Project Report
Data Warehouse Project Report
 
Building The Agile Database
Building The Agile DatabaseBuilding The Agile Database
Building The Agile Database
 

Último

Presentation of project of business person who are success
Presentation of project of business person who are successPresentation of project of business person who are success
Presentation of project of business person who are successPratikSingh115843
 
Decision Making Under Uncertainty - Is It Better Off Joining a Partnership or...
Decision Making Under Uncertainty - Is It Better Off Joining a Partnership or...Decision Making Under Uncertainty - Is It Better Off Joining a Partnership or...
Decision Making Under Uncertainty - Is It Better Off Joining a Partnership or...ThinkInnovation
 
Bank Loan Approval Analysis: A Comprehensive Data Analysis Project
Bank Loan Approval Analysis: A Comprehensive Data Analysis ProjectBank Loan Approval Analysis: A Comprehensive Data Analysis Project
Bank Loan Approval Analysis: A Comprehensive Data Analysis ProjectBoston Institute of Analytics
 
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...Dr Arash Najmaei ( Phd., MBA, BSc)
 
why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...
why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...
why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...Jack Cole
 
Predictive Analysis - Using Insight-informed Data to Plan Inventory in Next 6...
Predictive Analysis - Using Insight-informed Data to Plan Inventory in Next 6...Predictive Analysis - Using Insight-informed Data to Plan Inventory in Next 6...
Predictive Analysis - Using Insight-informed Data to Plan Inventory in Next 6...ThinkInnovation
 
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...Boston Institute of Analytics
 
Statistics For Management by Richard I. Levin 8ed.pdf
Statistics For Management by Richard I. Levin 8ed.pdfStatistics For Management by Richard I. Levin 8ed.pdf
Statistics For Management by Richard I. Levin 8ed.pdfnikeshsingh56
 
Digital Indonesia Report 2024 by We Are Social .pdf
Digital Indonesia Report 2024 by We Are Social .pdfDigital Indonesia Report 2024 by We Are Social .pdf
Digital Indonesia Report 2024 by We Are Social .pdfNicoChristianSunaryo
 
DATA ANALYSIS using various data sets like shoping data set etc
DATA ANALYSIS using various data sets like shoping data set etcDATA ANALYSIS using various data sets like shoping data set etc
DATA ANALYSIS using various data sets like shoping data set etclalithasri22
 
Decoding Movie Sentiments: Analyzing Reviews with Data Analysis model
Decoding Movie Sentiments: Analyzing Reviews with Data Analysis modelDecoding Movie Sentiments: Analyzing Reviews with Data Analysis model
Decoding Movie Sentiments: Analyzing Reviews with Data Analysis modelBoston Institute of Analytics
 
Role of Consumer Insights in business transformation
Role of Consumer Insights in business transformationRole of Consumer Insights in business transformation
Role of Consumer Insights in business transformationAnnie Melnic
 
IBEF report on the Insurance market in India
IBEF report on the Insurance market in IndiaIBEF report on the Insurance market in India
IBEF report on the Insurance market in IndiaManalVerma4
 

Último (16)

Presentation of project of business person who are success
Presentation of project of business person who are successPresentation of project of business person who are success
Presentation of project of business person who are success
 
Decision Making Under Uncertainty - Is It Better Off Joining a Partnership or...
Decision Making Under Uncertainty - Is It Better Off Joining a Partnership or...Decision Making Under Uncertainty - Is It Better Off Joining a Partnership or...
Decision Making Under Uncertainty - Is It Better Off Joining a Partnership or...
 
Bank Loan Approval Analysis: A Comprehensive Data Analysis Project
Bank Loan Approval Analysis: A Comprehensive Data Analysis ProjectBank Loan Approval Analysis: A Comprehensive Data Analysis Project
Bank Loan Approval Analysis: A Comprehensive Data Analysis Project
 
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
 
why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...
why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...
why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...
 
Data Analysis Project: Stroke Prediction
Data Analysis Project: Stroke PredictionData Analysis Project: Stroke Prediction
Data Analysis Project: Stroke Prediction
 
Insurance Churn Prediction Data Analysis Project
Insurance Churn Prediction Data Analysis ProjectInsurance Churn Prediction Data Analysis Project
Insurance Churn Prediction Data Analysis Project
 
Predictive Analysis - Using Insight-informed Data to Plan Inventory in Next 6...
Predictive Analysis - Using Insight-informed Data to Plan Inventory in Next 6...Predictive Analysis - Using Insight-informed Data to Plan Inventory in Next 6...
Predictive Analysis - Using Insight-informed Data to Plan Inventory in Next 6...
 
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
 
Statistics For Management by Richard I. Levin 8ed.pdf
Statistics For Management by Richard I. Levin 8ed.pdfStatistics For Management by Richard I. Levin 8ed.pdf
Statistics For Management by Richard I. Levin 8ed.pdf
 
Digital Indonesia Report 2024 by We Are Social .pdf
Digital Indonesia Report 2024 by We Are Social .pdfDigital Indonesia Report 2024 by We Are Social .pdf
Digital Indonesia Report 2024 by We Are Social .pdf
 
DATA ANALYSIS using various data sets like shoping data set etc
DATA ANALYSIS using various data sets like shoping data set etcDATA ANALYSIS using various data sets like shoping data set etc
DATA ANALYSIS using various data sets like shoping data set etc
 
Decoding Movie Sentiments: Analyzing Reviews with Data Analysis model
Decoding Movie Sentiments: Analyzing Reviews with Data Analysis modelDecoding Movie Sentiments: Analyzing Reviews with Data Analysis model
Decoding Movie Sentiments: Analyzing Reviews with Data Analysis model
 
2023 Survey Shows Dip in High School E-Cigarette Use
2023 Survey Shows Dip in High School E-Cigarette Use2023 Survey Shows Dip in High School E-Cigarette Use
2023 Survey Shows Dip in High School E-Cigarette Use
 
Role of Consumer Insights in business transformation
Role of Consumer Insights in business transformationRole of Consumer Insights in business transformation
Role of Consumer Insights in business transformation
 
IBEF report on the Insurance market in India
IBEF report on the Insurance market in IndiaIBEF report on the Insurance market in India
IBEF report on the Insurance market in India
 

Lean Data Warehouse via Data Vault

  • 1. DecisionLab.Net business intelligence is business performance ___________________________________________________________________________________________________________________________________________________________________________________ ____________________________________________________________________________________________________________________________________________________________________________________ DecisionLab http://www.decisionlab.net dupton@decisionlab.net Carlsbad, California, USA Lean Data Warehouse via Data Vault
  • 2. __________________________________________________________________________________________________________________________________________________________________________________ Page 2 of 21 Lean Data Warehouse via Data Vault Written by Daniel Upton Data Warehouse Architect Certified ScrumMaster DecisionLab.Net Business Intelligence is Business Performance dupton@decisionlab.net linkedin.com/in/DanielUpton Lean Data WarehouseTM is a trademark of Daniel C. Upton. All uses of it throughout this writing are protected by trademark law. Without my (the writer’s) explicit written permission in advance, the only permissible reproduction or copying of this trademark, or of any of this written material, is in the form of a review or a brief reference to a specific concept herein, either or which must clearly specify this writing’s title, author (me), and this web address http://www.slideshare.net/DanielUpton/lean-data-warehouse-via-data-vault . For permission to reproduce or copy any of this material other than what is specified above, just email me at the above address. Data Vault is a term, unregistered as far as I know, which I believe was coined by Daniel Linstedt, and is a central term in books by Daniel Linstedt, Kent Graziano and Hans Hultgren. References to these books can be found at the end of this writing. I am grateful to these authors for the deeply insightful concepts expressed therein, and I make no attempt to co-opt, or express any authorship for the term Data Vault. Many of the concepts I learned – from these books or from related social media -- are reflected here, but this material was not reproduced nor copied from any other material. All aforementioned concepts are expressed exclusively in my own words and illustrated using my own data and database. I have attempted to give credit to others where it is due.
  • 3. __________________________________________________________________________________________________________________________________________________________________________________ Page 3 of 21 Lean Data Warehouse via Data Vault Daniel Upton Objective: Interpret the principles of a Lean as applied to Data Warehousing, then describes challenges associated with changes in the business, and finally provides scenarios in which three methods to data warehouse modelling will need to adapt to accommodate those changes, with varying amounts of difficulty. The three data modelling methods are 3rd Normal Form, Star Schema (Dimensional) and Data Vault (Ensemble Model). For reference materials, see the end of this article. Takeaway: This piece demonstrates that data vault, being more easily adaptive to real-world changes, as well as continually imperfect knowledge within a businesses, is the most Lean methodology of the three. This does not indicate that either of the other two mainstream methods are without value, but rather only that Data Vault’s strengths validates its place among one of them in the field of data warehousing. Bottom Line: A data warehouse that does not enjoy these Lean strengths – whether built around a traditional data warehouse design best practice or not – will suffer from inflexibility and a high cost to support and extend over its lifecycle. Lean Principles as Applied to Data Warehousing  Focus on the Customer and Eliminate Waste (‘muda’ in Japanese) o Differentiate efforts based on demand (business requirements) vs. supply (availability and characteristics of source data). o Data Presentation Layer (downstream): Deliver what is demanded, no less, no more. o Data Management Layer (upstream): Based on the selection of available source data and its potential for a basic integration (as will soon be described using Data Vault), model and load that data, no less, no more. o How Much Source Data is Enough?: Both Lean and Agile principles emphasize the importance of ‘just enough’ thinking. Just enough analysis, design, development, testing, deployment, training, and support. Having said that, ‘just enough’ can be difficult to determine. A Data Vault model, due to its modular design and associated loading logic, is easily chunked out into little independent deliverables than are traditional data warehouse models, so that project managers and leaders have more options in deciding on the scope and frequency of potentially shippable increments (PSI’s). Having said that, it is also true that choosing to integrate all, or at least all eventually desired, database tables, fields and records from each relevant source system into a data management layer, such as Data Vault, is a responsible choice, because…  The potential for this basic integration is highest when all relevant source data is integrated.  A basic integration, such as Data Vault, is quicker to accomplish, than an advanced integration attempting a single version of the truth (SVOT) for all attributes.  Excessive task-switching is a well-known source of waste (muda) needing elimination. As such, allowing engineers to complete the design and loading of a basic integration in its entirety -- while their mental energy is directed to a specific logical set of data subject areas -- is sensible.
  • 4. __________________________________________________________________________________________________________________________________________________________________________________ Page 4 of 21  Historical changes of chosen tables will be tracked as soon as this layer goes into production, and excluding specific tables probably means leaving holes in the otherwise historically tracked source data.  Multiple simultaneous development tracks are easily established in order to efficiently put more resources onto a focused initiative to instantiate the data warehouse.  Plan for Change: Design-in… o Loose Coupling: Little or no dependency between objects, so that change in one entity does not compromise a related entity. Loose coupling is robust. Tight coupling is fragile. Data Vault establishes loose coupling among entities. o High Cohesion: To support loose coupling, each entity is focused, self-sufficient, and not overlapping with related entities. Data Vault does this, too.  Optimize the Whole o Be willing to sub-optimize components a data model if necessary to optimize the scalability and durability of the model as a whole.  How? Add associative tables to break entity dependencies, thereby allowing actual data records, not a fixed data model, to dictate relative cardinality between related entities.  Automate Processes and Deliver Fast o Establish and re-use generic patterns to quickly, loosely integrate keys from source tables, without YET performing any data re-interpretation In this piece, I will provide a simple example of the purely data-related challenges typical in the lifecycle of real-world Data Warehouses, interpreted here as failures to adhere to Lean principles. As routine as this type of change it, one of the big challenges with traditional data warehousing is that these simple data changes cause large structural problems in a traditional, rigidly integrated, data warehouse. Although process-related challenges of course, also exist, many of the toughest ones are actually just challenging accommodations to these same real-world data challenges, large or small. High-Level Summary of Data Vault Modeling Fundamentals and Resulting Value Proposition o Note: Business Data Vault, not Raw Data Vault, will be described here. Whereas Raw Vault simply de-constructs each source entity and entity- relationship into a Hub, Satellite and Link, Business Vault establishes a stronger basic integration by isolating and managing business keys as unique records, which will be described immediately below.  Generally, in a Business Data Vault, tables from selected source systems end up in the Business Vault de-constructed into Hubs, Satellites and Links, with all attributes from multiple source entities within and across source systems now aligned around a common unique business key (Hub).  Hub: A Hub table manages a unique business key, which is defined as a key (one or more data elements / fields) which are meaningful not just within a single system, and not just to a database engineer, but also to the business people familiar with that business process. As an example, in Healthcare, an NPI (National Provider Identifier) code, CPT (Current Procedural Terminology) code, or Lab Accession Code all have clear meaning to healthcare business people, whereas a surrogate key or system key typically is only meaningful to the database engineer. Although a Hub includes a standard set of additional fields, as will be shown below, they are all intended simply to manage the stored business key.  Satellite: A Satellite table, which is always dependent on exactly one Hub, contains the payload of attribute fields which shared a table with the business key in the source system, but which are now stored separately in order to track attribute changes, essentially with single Type 2 slowly changing dimension (herein SCD2) function without causing duplication of the Hub’s business key.
  • 5. __________________________________________________________________________________________________________________________________________________________________________________ Page 5 of 21  Ensemble: In the common situation in which multiple source tables exist all with the granularity of the same Business Key, either within or across source systems, those tables may end up in Data Vault as multiple Satellites, all as dependents to the same single Hub Table. This set of one Hub and one or more Satellites is an Ensemble.  Link: Link tables relate two or more Hubs together as an association. In its simplest form, a Link table is an associative table that joins what, in the source system, was a dependent table and a parent table, whether this relationship was enforced with a foreign key constraint or not. As an associative table, the Link gracefully handles not only the expected, one might say desirable, one-to-many relationship among actual data records from the two table, but also any real-world many-to-many conditions in the relationship between actual records in the two source tables. As such, the Link affords a looser, more robust form of referential integrity between Hubs, but importantly, does so without making either Hub actually dependent on other related Hubs, since Hubs are never directly dependent on another table, but only by associative Links.  Object Naming: The names of source system tables and fields all remain unchanged, except insofar as tables are usually appended with “_hub, _sat, or _lnk” suffixes and additional surrogate keys, data stamps and audit fields are added. As such, the name of all business keys and attribute fields remain unchanged from the source system. This transparency from source to data vault is consistent with the principle that, at the data vault stage, “interpretative” data transformations, such as calculation, aggregation, filtering, grouping, has not YET occurred. In fact, those processes occur immediately downstream of data vault. Pros, Cons and Disclaimers based on Data Vault fundamentals and from multiple in-the-trench implementation experiences  Disclaimers: Data Vault… o Does not solve unsolvable challenges of data integration, as where two systems lack a set of sufficiently match-able keys in order to achieve a satisfactory percentage of record matches. o Does not eliminate the need for data profiling, data quality processing, nor the work of analytics/reporting-driven (conforming) data transformation of non-key attributes, perhaps achieving a SVOT, but merely defers that phase until immediately after the data is landed, loosely integrated, and historized. o Does not eliminate, only defers, the decision-making associated with interpretive data transformations (ie. object renaming, selection of one source field instead of another, calculation, aggregation, grouping, filtering, and data quality processing.  Cons o Learning Curve: At first, the Hub, Satellite, Link modeling pattern is unfamiliar and potentially intimidating. As such, it could be a source of dissatisfaction to a DW team member who rejects the method, whether or not it is well understood. o Adds tables, thus adds joins required for downstream ETL queries. Those who want to avoid a (non-presentational) data management layer of any kind between source data and a BI data presentation layer cannot easily see how this one is better than any other one.  Pros o Establishes a clear distinction between demand-driven (data presentation) design vs. supply-driven (data management layer) design, and an associated clear opportunity for multiple simultaneous development tracks.
  • 6. __________________________________________________________________________________________________________________________________________________________________________________ Page 6 of 21  Why? As soon as needed source tables are identified, but perhaps long before reporting / analytics requirements are fully understood (if they ever really are), thus before Data Presentation Layer modelling is done, data vault design AND loading can be started and even completed, so that historical source data is quickly manageable with full referential integrity in the data warehouse environment.  As a result, data vault data is easily re-purposable for future requirements for data presentation, reporting, analytics. o Data Profiling: Data Vault provides a crystal-clear end-point for data profiling. Profiling for data vault is complete when all relevant business keys have been identified and key-match rates among, and across source systems have been measured. o Modularization: In deferring interpretive data transformations, as described above, it effectively modularizes those processes and clearly distinguishes them (downstream) from the process of loosely-coupling, loading and capturing historic changes within relevant source data, and doing so with transparency. o Inherently Extensible, Durable and Scalable.  Extensible: Since new data vault features are very easily added onto an existing Vault with little or no refactoring of the existing solution, multiple simultaneous tracks of data vault development within the same data warehouse solution are also straightforward, because of the avoidance of entity dependencies within a data vault schema.  Durable: Data Vault is exceptionally easy to re-factor to accommodate changes in source data schema.  Scalable: As data volumes grow, each data vault ensemble is a robust, logically independent structure, each of which can reside on a specific partitioning or node within a cluster or MPP architecture. o Simplicity: Since Data Vault is a simple set of patterns, the learning curve is fast for willing learners: For design, inbound ETL, and outbound ETL, the small set of generic patterns, once learned, quickly become familiar and are easily re-used for new data requirements. o Transparency: Upstream and Downstream Facing  Upstream Transparency: Views within the Data Vault are easily constructed as a mirror to source data, either as current snapshots or as time-slices  Downstream Transparency: Using generic data vault patterns, downstream-facing views, special-use tables or user-defined functions, are also easily constructed to return results identical, with or without object renaming, to downstream structures such as large, complex, denormalized Type 2 slowly changing dimensions, as well as fact tables ranging from transaction-grained to periodic and accumulating snapshots. These downstream-facing structures may even be used to standardize on down-stream ETL patterns. o Data Quality Process Transparency: Data Vault supports the following standardized data quality processing patterns and has an important bonus feature:  De-duplication: For a base Hub-Satellite ensemble, we can either create a sister ‘Golden Record’ ensemble, with the Satellite containing the match-able attributes or, for simpler requirements, we might just add a Boolean ‘IsNotGolden’ field to a base Hub table to identify records which have a duplicate that was chosen over them as being authoritative.  Data Quality Satellites: Records in these tables corresponding to specific data quality issues among source attributes in the corresponding base-Satellite. These DQ Satellites may contain cleansed values of base Satellite attributes, modified to conform to various standards, but without altering values in the base Satellite record.  Support for Data Quality Mart: If desired, direct support, using the above structures, for a custom Data Quality Data Mart, providing closed loop reporting, covering all other data marts, back to data stewards about data quality details issues discovered and resolved (or not
  • 7. __________________________________________________________________________________________________________________________________________________________________________________ Page 7 of 21 resolved) over time. See Kimball’s ETL Toolkit, Chapter 4, ‘Data Quality Design’, which is easily supplied with data by the aforementioned structures.  Bonus ‘Data Quality Lifecycle’ Feature: Since data vault stores and historizes largely untransformed source data history, data quality project work can be performed before, during, or after other project work, and will always have all historic data available. In contrast, performing data quality work only on temporarily staged data pre-supposes that ALL data quality rules are fully defined, not to mention fully coded, prior to rolling the data warehouse into production. How likely is that? -------------------------------------------------------------------------------------------------------------------------------------------------------------
  • 8. __________________________________________________________________________________________________________________________________________________________________________________ Page 8 of 21 Data Vault*  What it is not: A presentation layer, which is optimized, often de-normalized, to both simplify and accelerate analytic queries for a wide-spectrum of users.  What it is: o An enterprise data management layer, optimized… wherein operational data is loaded, permanently stored, historized, and loosely coupled by keys, but without yet requiring significant data interpretation (object renaming, selection of certain attributes over others, grouping, filtering, de- duplication or cleansing. As such the potential for data misinterpretation, loading of wrong data, loss of transparency with source data, data loss, or inability to re-load historic data are, as project risks, reduced or eliminated. o Although all of the above not-done yet efforts remain vital, they are all done immediately downstream of a data vault, and hence the complexity and effort to deliver this layer is a fraction of a classic EDW, and the data vault can serve as the common enterprise data source for building, re- populating, or refactoring any other data layer, be it a Star Schema, another reporting or analytics presentation layer, a Data Quality processing layer, or a closed-loop Master Data Management layer.  Note: As the number of tables increases, colors depicting two levels of abstraction become crucial for visualization, communication and sanity with Data Vault. o Color by data subject area. Demonstrated in the next set of figures. o Color by table type (Hub, Satellite, Link). Demonstrated later, when additional tables are shown. Ultra-Simple High-Level Visual Demo: Student and Major Tables in Student Records System  Assume records in both tables are regularly updated, at which time historical data, while needed for B.I. reporting, is overwritten. Figure 1: Student and Major Tables in Student Record System (from University OLTP and EDW – Week 1 Day 1.dez)
  • 9. __________________________________________________________________________________________________________________________________________________________________________________ Page 9 of 21 --------------------------------------------------------------------------------------------------------------------------------------------------------------- Figure 2: Data Vault Ensembles for Student and Major, and associated Link (without, then with, color visual by data topic area) (from University OLTP and EDW – Week 1 Day 1.dez)
  • 10. __________________________________________________________________________________________________________________________________________________________________________________ Page 10 of 21 Scenario: Four related OLTP source entities, existing in two separate systems, become a set of tables in either a third-normal-form EDW model, a dimensional model, or a data vault. Step 1: Profile in Source Systems  Four Tables (Entities), from Student Records system and Fundraising system  Three Student Records Tables: Student, Major, Department  One Fundraising Table: Dept_Fundraising_Participation  Notes: The one-to-many relationship between Major and Student is not enforced by a foreign key. Also, the Dept_Fundraiser_Participaton table uses a different key than the Department table. Lastly, although it’s granularity is ‘one Department’, the subject area of the table is actually more related to fundraising activity, aggregated at the department level, than it is to the Department itself. As such, these two department-related tables have a weak context, by themselves, for a tight integration aimed at an SVOT. Source Table Screenshots (Figures 4a and 4b) correctly capture their associated business processes. Figure 3a: Three Tables in Student Enrollment System Fig. 3b: Dept_Fundraiser_Participation in Fundraiser System (both figures are from University OLTP and EDW – Week 1 Day 1.dez)
  • 11. __________________________________________________________________________________________________________________________________________________________________________________ Page 11 of 21 -------------------------------------------------------------------------------------------------------------------------------------------------------------- Step 2a: Challenges with a Data Model for EDW in 3rd Normal Form  Note the rigid dependencies between entities. Also, questions remain about a proper SVOT integration Figure 4: Third Normal Form of Student, Major, Department and Department Fundraiser Participation (from University 3NF EDW and Dimensional - Week 1.dez) …?...
  • 12. __________________________________________________________________________________________________________________________________________________________________________________ Page 12 of 21 Step 2b: Challenges with Data Model for Star Schema  Notes: Two kids of rigid dependencies are here: Fact-to-dimension, and within de-normalized dimension. Lastly, of course, Dim_Department_Fundraiser_Profile lacks a Fact Table to provide context for integration. Figure 5: Dimensional Model / Star Schema of Student, Major, Department and Dept Fundraiser Participation (from University 3NF EDW and Dimensional - Week 1.dez)
  • 13. __________________________________________________________________________________________________________________________________________________________________________________ Page 13 of 21 Step 2c: Data Model for Data Vault  Notes: o Hub Independence: Hubs, with their business keys, are never schema-dependent on another table. o Link handles the real world data-dependencies, but does so loosely (loose coupling) as an associative table. o Satellites, with their many attributes, become loosely coupled with related attributes in other tables and systems, with alignment via Business Keys, and with history-tracking changes captured as versioned rows just like Type 2 slowly-changing dimensions.  Satellites with a single Ensemble may have very different loading frequency.  Colors for table types. In keeping with a general data vault standard, Hubs are depicted in blue in capable modeling tools, Satellites are yellow, and Links are red. This, in addition to colors by subject area, as also shown in this figure, greatly simplify visual navigation and communication among involved team members, especially when the number of tables extends to real-world numbers. Figure 6: Data Vault Model of Student, Major, Department, and Department Fundraiser Participation (from Univ… OLTP and EDW - Week 2 with MD5 NoSQL Schema.dez)
  • 14. __________________________________________________________________________________________________________________________________________________________________________________ Page 14 of 21 Step 3: Business process change involves a source system schema change that will compromise the data warehouse if not quickly adapted.  Change: University policy changes so that students may now have double majors or a major and a minor, rather than just one major per student, which was the old business rule.  The new source schema shown here correctly stores data from this new business process in the Student Records system, in contrast to the original schema in Figure 4a. o Note: The retention of the old ‘MajorID’ field was chosen by the OLTP data architect in addition to directing a re-load of all historical foreign key values into the new “StudentMajorsMinor (v2)’ table. Note: All (v2), etc. characters are for illustration only. Figure 7: Student Records System Change: Double Majors and Minors (from University OLTP and EDW - Week 6 with MD5 NoSQL Schema.dez)
  • 15. __________________________________________________________________________________________________________________________________________________________________________________ Page 15 of 21 Step 4a: Schema change for EDW in 3rd Normal Form  Modified data model o Note: The ‘Fundraiser’ table, although not part of our initial four tables, was included in this figure simply for explanation, in order to provide business context to the fields associated with departmental fundraiser participation. Figure 8: Modified 3rd Normal Form EDW Model …?...
  • 16. __________________________________________________________________________________________________________________________________________________________________________________ Page 16 of 21 ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ Modified upstream ETL: Is the required effort for this small change acceptable? o Department_edw_3nf ETL processing now dependent on changes to either Student Records or Fundraising systems. If either source system delivers records with keys that are not matchable to the other system, the two choices are both complex.  One choice is to either discard the unmatched record and load it in a subsequent increment or hold that record in a control table.  The other choice is to allow for most fields to remain NULL and establish a complex ETL rule that does two things  Initially insert records with fields available from one system, while leaving all fields from the other system as NULLs  Revisiting all such records later with updates from the late-arriving data from the other system.  Modified downstream schema and ETL or queries for downstream reports, dashboards, perhaps Star Schema o Is this required effort also acceptable for this small source schema change? o Problem: This set of unfortunate choices has a simple cause, which is the optimistic notion that the EDW schema should not only capture a given business process correctly when first implemented and loaded, but that this process, as well as related business processes, all now fixed into a transformed, tightly integrated schema, also will not change much over time. Whether or not that notion of relatively unchanging business processes was valid back in the 90’s, when third-normal EDW’s were conceived and popularized, the relevant question is really this: Is it a valid assumption now in 2015? o Solution: Design Lean data warehouse models that are, along with their associated ETL code, easily adaptive to changing business processes, even including those business processes that may not reasonably be deeply understood at the time when source data is needed in the data warehouse, due to a lack of consistent stakeholder participation or authoritative, up to date system documentation.  With data vault, on the other hand, once we understand just the business keys, we are already past the 50 yard line.
  • 17. __________________________________________________________________________________________________________________________________________________________________________________ Page 17 of 21 Step 4b: Schema change for Star Schema Disclaimer: Dimensional models are less-often chosen for the purpose of data management per se. As such, this section is more for the purpose of reference than critique.  Modified data model  Modified upstream ETL  Modified downstream queries or ETL for downstream objects. Figure 9: Star Schema after change to Student Records System to support double majors and minors. (from University 3NF EDW and Dimensional - Week 1.dez)
  • 18. __________________________________________________________________________________________________________________________________________________________________________________ Page 18 of 21 Step 4c: Schema Changes for Data Vault. See Figure 10  No existing schema objects were modified.  Existing ETL is completely untouched, except insofar as the loading of Sat_Student_Progress is discontinued, while the table will remain in use for historical data prior to the business process change.  Need One New Schema Object and it’s ETL Code: Added a Sat_Lnk_Student_Major as a dependent to Lnk_Student_Major_Department, which, as an associative table, already had the flexibility to supports the current, or any future combinations of granularity between students, majors and minors. Figure 10: Data Vault After Source System Mods (from Univ… OLTP and EDW - Week 6 Clean.dez)
  • 19. __________________________________________________________________________________________________________________________________________________________________________________ Page 19 of 21 ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- Where do the following related processes fit with Data Vault and a Lean Data Warehouse?  Conforming Dimensions and Facts: Downstream of Data Vault as a stored Dimensional Model, repurpose-able, reloadable at any time from it. Virtualized dimensional models based on stored data vaults are an interesting, much discussed topic, but I have not (as of early March 2015) seen a sensible virtual architecture that captures dimensional modelling best practice.  Master Data Management o Closed-loop MDM, with data vault (as the main MDM data source) handily recording the good, bad, and the ugly exactly as it exists, and as it is either improved over time, or not improved, in any given source system.  Predictive Analytics: Data Vault’s fidelity to source data, as well as it’s historization of it, should appeal to predictive analysts, who often avoid highly-transformed data warehouses as data sources. The inclusion in Data Vault of the upstream-facing views, may also assist the predictive analyst accustomed to source system data.  BigData / NoSQL: Data Vault 2.0 (by D. Linstedt, and not otherwise mentioned in this article), with its MD5 Hash fields as primary keys, offers an opportunity for massive parallel Data Vault table loading without the typical lookups that to surrogate keys in a target’s parent tables, which otherwise force sequential table loading. o Assumption: To do this, BigData records must have granularity loosely defined such that a Business Key does define each record, whether or not the remainder of attributes are -- from an RDBMS perspective – oddly de-normalized or pivoted. As such, the BigData being referred to here is more akin to an Apache Cassandra or Hive table than to a Hadoop file. o One trade-off with an MD5 primary key is, of course, that with the MD5 fields size and datatype (VarBinary in SQL Server), downstream ETL or queries will be slower-performing than with an Integer primary key used for joins. This, however, may be a worthwhile trade-off in accommodating the parallel loading of massive BigData. o If no BigData / NoSQL integration is expected to be required, Data Vault 2.0’s downstream ETL/query performance hit from the required joins on the lengthy Hash PK’s make little sense.
  • 20. __________________________________________________________________________________________________________________________________________________________________________________ Page 20 of 21 --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- Future Topics:  How to Load: Detailed demonstration of data vault loading-transformation logic.  After Load: o Detailed demonstration of Patterns for data vault views and queries for current or historical snapshots of OLTP data. o Detailed Demonstration of Patterns for data vault views, queries and point-in-time support tables to support downstream Star-Schema loading transformations. -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- Data Vault*Reference Books:  Super Charge Your Data Warehouse, Daniel Linstedt, 2008-2011, CreateSpace Publishing  The New Business SuperModel: The Business of Data Vault Data Modeling, 2nd Edition, Daniel Linstedt, Kent Graziano, Hans Hultgren, 2008-2009, Daniel Linstedt  Modeling the Agile Data Warehouse with Data Vault, Hans Hultgren, 2012, Brighton Hamilton
  • 21. __________________________________________________________________________________________________________________________________________________________________________________ Page 21 of 21 DecisionLab.Net _____________________________________________________ Thank you. I value your feedback. ___________________________________________________________________________ Daniel Upton DecisionLab http://www.decisionlab.net dupton@decisionlab.net Direct 760.525.3268 Carlsbad, California, USA