SlideShare uma empresa Scribd logo
1 de 63
Baixar para ler offline
Introduction to Data Vault
        Modeling
              Kent Graziano
    Data Vault Master and Oracle ACE
          TrueBridge Resources
               OOW 2011
             Session #05923
My Bio
• Kent Graziano
   – Certified Data Vault Master
   – Oracle ACE (BI/DW)
   – Data Architecture and Data Warehouse Specialist
       • 30 years in IT
       • 20 years of Oracle-related work
       • 15+ years of data warehousing experience
   – Co-Author of
       • The Business of Data Vault Modeling (2008)
       • The Data Model Resource Book (1st Edition)
       • Oracle Designer: A Template for Developing an Enterprise
         Standards Document
   – Past-President of Oracle Development Tools User Group
     (ODTUG) and Rocky Mountain Oracle User Group
   – Co-Chair BIDW SIG for ODTUG

                              (C) Kent Graziano
Membership Special: Join by October
15 to become a member for only $99!
What Is a Data Warehouse?

“A subject-oriented, integrated, time-variant,
non-volatile collection of data in support of
management’s decision making process.”
                       W.H. Inmon

“The data warehouse is where we publish
used data.”
                     Ralph Kimball

                  (C) Kent Graziano
Inmon’s Definition
• Subject oriented
   – Developed around logical data groupings (subject areas)
     not business functions
• Integrated
   – Common definitions and formats from multiple systems
• Time-variant
   – Contains historical view of data
• Non-volatile
   – Does not change over time
   – No updates

                           (C) Kent Graziano
Data Vault Definition
The Data Vault is a detail oriented, historical
tracking and uniquely linked set of normalized
tables that support one or more functional areas
of business.

It is a hybrid approach encompassing the best of
breed between 3rd normal form (3NF) and star
schema. The design is flexible, scalable, consistent,
and adaptable to the needs of the enterprise. It is a
data model that is architected specifically to meet
the needs of today’s enterprise data warehouses.

                       Dan Linstedt: Defining the Data Vault
                       TDAN.com Article



                      (C) TeachDataVault.com
Why Bother With Something New?
       Old Chinese proverb:
       'Unless you change direction, you're apt to
       end up where you're headed.'




                (C) TeachDataVault.com
Why do we need it?

• We have seen issues in constructing (and
  managing) an enterprise data warehouse model
  using 3rd normal form, or Star Schema.
   – 3NF – Complex PKs when cascading snapshot
     dates (time-driven PKs)
   – Star – difficult to re-engineer fact tables for
     granularity changes
• These issues lead to break downs in
  flexibility, adaptability, and even scalability

                        (C) Kent Graziano
Data Vault Time Line
E.F. Codd invented        1976 Dr Peter Chen                       1990 – Dan Linstedt
relational modeling       Created E-R                              Begins R&D on Data
                          Diagramming                              Vault Modeling
  Chris Date and
  Hugh Darwen               Mid 70’s AC Nielsen
  Maintained and            Popularized
  Refined Modeling          Dimension & Fact Terms



1960              1970                    1980                     1990             2000
                                                       Late 80’s – Barry Devlin
                      Early 70’s Bill Inmon            and Dr Kimball Release
                      Began Discussing                 “Business Data
                      Data Warehousing                 Warehouse”

                                                    Mid 80’s Bill Inmon
                                                    Popularizes Data
       Mid 60’s Dimension & Fact                    Warehousing
       Modeling presented by General                                                2000 – Dan Linstedt
       Mills and Dartmouth University             Mid – Late 80’s Dr Kimball        releases first 5 articles
                                                  Popularizes Star Schema           on Data Vault Modeling
                                          (C) TeachDataVault.com
Data Vault Evolution
• The work on the Data Vault approach began in the early
  1990s, and completed around 1999.
• Throughout 1999, 2000, and 2001, the Data Vault design was
  tested, refined, and deployed into specific customer sites.
• In 2002, the industry thought leaders were asked to review
  the architecture.
   – This is when I attend my first DV seminar in Denver and met Dan!
• In 2003, Dan began teaching the modeling techniques to the
  mass public.




                              (C) Kent Graziano
Data Vault Modeling…




      (C) TeachDataVault.com
Where does a Data Vault Fit?




          (C) TeachDataVault.com
Where does a Data Vault Fit?
Oracle’s Next Generation Data Warehouse Reference Architecture




                           Data Vault goes here

                         (C) Oracle Corp
3 Simple Structures




     (C) TeachDataVault.com
Hub and Spoke = Scalability




                      http://www.nature.com/ng/journal/v29/n2/full/ng1001-105.html

If nature uses Hub & Spoke, why shouldn’t we?
        Genetics scale to billions of cells,
    the Data Vault scales to Billions of records
                   (C) TeachDataVault.com                                   15
Hubs = Neurons

                                 Hub




 Very similar to a neural network,
The Hubs create the base structure

        (C) TeachDataVault.com
Links = Dendrite + Synapse




             In neural networks,
Dendrites & Synapses fire to pass messages,
The Links dictate associations, connections
                (C) TeachDataVault.com
Satellites = Memories




       Perception, understanding and processing
            These all describe the memory
Satellites house descriptors that can change over time

                  (C) TeachDataVault.com
National Drug Codes + Orange Book of Drug Patent Applications

A WORKING EXAMPLE
            http://www.accessdata.fda.gov/scripts/cder/ndc/default.cfm
            http://www.fda.gov/Drugs/InformationOnDrugs/ucm129662.htm

                           (C) TeachDataVault.com
1. Hub = Business Keys
                                                    Product Number
Drug Label Code

                               NDA Application #
           Firm Name
                                                    Dose Form Code

                  Drug Listing
                                              Patent Number
   Patent Use Code


       Hubs = Unique Lists of Business Keys
           Business Keys are used to
       TRACK and IDENTIFY key information

                     (C) TeachDataVault.com
Business Keys = Ontology
       Firm Name
                                               Business Keys should be
           Drug Listing                         arranged in an ontology
                                                  In order to learn the
               Product Number
                                               dependencies of the data
               Dose Form Code                              set
               NDA Application #

                   Drug Label Code

                          Patent Number

                              Patent Use Code

NOTE: Different Ontologies represent different views of the data!
                            (C) TeachDataVault.com
Hub Entity
          A Hub is a list of unique business keys.
         Hub Structure                                            Hub Product
           Primary Key                                          Product Sequence ID
                                    Unique Index
         <Business Key>                                           Product Number
                                   (Primary Index)
            Load DTS                                             Product Load DTS
          Record Source                                         Prod Record Source

Note:
• A Hub’s Business Key is a unique index.
• A Hub’s Load Date represents the FIRST TIME the EDW saw the data.
• A Hub’s Record Source represents: First – the “Master” data source (on collisions), if
  not available, it holds the origination source of the actual key.



                                   (C) TeachDataVault.com
Business Keys
• What exactly are Business Keys?
  – Example 1:
       • Siebel has a “system generated” customer key
       • Oracle Financials has a “system generated” customer key
       • These are not business keys. These are keys used by each respective
         system to track records.
   – Example 2:
       • Siebel Tracks customer name, and address as unique elements.
       • Oracle Financials tracks name, and address as unique elements.
       • These are business keys.
• What we want in the hub, are sets of natural business keys
  that uniquely identify the data – across systems.
• Stay away from “system generated” keys if possible.
   – System Generated keys will cause damage in the integration cycle if they are
     not unique across the enterprise.

                                (C) TeachDataVault.com
Hub Definition
• What Makes a Hub Key?
   – A Hub is based on an identifiable business key.
   – An identifiable business key is an attribute that is used in
     the source systems to locate data.
   – The business key has a very low propensity to change, and
     usually is not editable on the source systems.
   – The business key has the same semantic meaning, and the
     same granularity across the company, but not necessarily
     the same format.
• Attributes and Ordering
   – All attributes are mandatory.
   – Sequence ID 1st, Busn. Key 2nd , Load Date 3rd ,Record
     Source Last (4th).
   – All attributes in the Business Key form a UNIQUE Index.


                         (C) TeachDataVault.com
The technical objective of the Hub is to:
• Uniquely list all possible business keys, good, bad, or indifferent of
  where they originated.
• Tie the business keys in a 1:1 ratio with surrogate keys (giving
  meaning to the surrogate generated sequences).
• Provide a consolidation and attribution layer for clear horizontal
  definition of the business functionality.
• Track the arrival of data, the first time it appears in the warehouse.
• Provide right-time / real-time systems the ability to load
  transactions without descriptive data.




                             (C) TeachDataVault.com
Hub Table Structures




               SQN = Sequence (insertion order)
   LDTS = Load Date (when the Warehouse first sees the data)
RSRC = Record Source (System + App where the data ORIGINATED)
                    (C) TeachDataVault.com
Sample Hub Product
         ID    PRODUCT #              LOAD DTS            RCRD SRC
          1    MFG-PRD123456          6-1-2000            MANUFACT
          2    P1235                  6-2-2000            CONTRACTS
          3    *P1235                 2-15-2001           CONTRACTS
          4    MFG-1235               5-17-2001           MANUFACT
          5    1235-MFG               7-14-2001           FINANCE
          6    1235                   10-13-2001          FINANCE
          7    PRD128582              4-12-2002           MANUFACT
          8    PRD125826              4-12-2002           MANUFACT
          9    PRD128256              4-12-2002           MANUFACT
          10   PRD929929-*            4-12-2002           MANUFACT

                        Unique
                        Index
Notes:
• ID is the surrogate sequence number (Primary Key)
• What does the load date tell you?
• Do you notice any overloaded uses for the product number?
• Are there similar keys from different systems?
• Can you spot entry errors?
• Are any patterns visually present?

                                 (C) TeachDataVault.com
2. Links = Associations
 Firms Generate                      Firms Generate
     Labels                          Product Listings


                                                    Listings Contain
       Firms Manufacture                             Labeler Codes
            Products


                             Listings for Products
                           are in NDA Applications



    Links = Transactions and Associations
   They are used to hook together multiple
        sets of information (i.e., Hubs)

                  (C) TeachDataVault.com
Associations = Ontological Hooks
   Firm Name

    Firms Generate
    Product Listings              Drug Listing

                         Firms Manufacture
                                                 Product Number
                              Products

                         Listings for Products
                                                 NDA Application #
                       are in NDA Applications


    Business Keys are associated by many
    linking factors, these links comprise the
          associations in the hierarchy.


                (C) TeachDataVault.com
Link Definitions

• What Makes a Link?
   – A Link is based on identifiable business element
     relationships.
      • Otherwise known as a foreign key,
      • AKA a business event or transaction between business keys,
   – The relationship shouldn’t change over time
      • It is established as a fact that occurred at a specific point in time and will
        remain that way forever.
   – The link table may also represent a hierarchy.
• Attributes
   – All attributes are mandatory


                                (C) TeachDataVault.com
Link Entity
            A Link is an intersection of business keys.
         It can contain Hub Keys and Other Link Keys.
     Link Structure                               Link Line-Item
      Primary Key                                                Link Line Item Sequence ID
                                     Unique Index
{Hub Surrogate Keys 1..N}                                        Hub Product Sequence ID
                                    (Primary Index)
         Load DTS                                                 Hub Order Sequence ID
       Record Source                                                      Load DTS
                                                                        Record Source

Note:
• A Link’s Business Key is a Composite Unique Index
• A Link’s Load Date represents the FIRST TIME the EDW saw the relationship.
• A Link’s Record Source represents: First – the “Master” data source (on collisions), if
  not available, it holds the origination source of the actual key.


                                      (C) TeachDataVault.com
Modeling Links - 1:1 or 1:M?
• Today:
   – Relationship is a 1:1 so why model a Link?
• Tomorrow:
   – The business rule can change to a 1:M.
   – You discover new data later.
• With a Link in the Data Vault:
   – No need to change the EDW structure.
   – Existing data is fine.
   – New data is added.


                              (C) Kent Graziano
Link Table Structures




               SQN = Sequence (insertion order)
   LDTS = Load Date (when the Warehouse first sees the data)
RSRC = Record Source (System + App where the data ORIGINATED)
                    (C) TeachDataVault.com
Sample Link Entity - Relationship
                  Hub Customer                                 Order
CSID     CUST #        LOAD DTS         RCRD SRC              Satellite
  1      ABC123456     10-12-2000       MFG                                            Hub Order
  2      DKEF          1-25-2001        CONTRACTS              OrdID          ORDER #               LOAD DTS       RCRD SRC

                                                                    1         ORD0001               10-12-2000     MFG
                                                                    2         ORD0002               10-2-2000      CONTRACTS

LSEQID     CSID    OrdID   LOAD DTS     RCRD SRC
  1000       1     1       10-14-2000   FINANCE
  1001       1     2       10-14-2000   FINANCE
                                                                                    Link Order-Details
                                                           LSEQID         OrdID       PID     LIT     LOAD DTS      RCRD SRC
           Link Cust Order                                   1000             1       100     1       10-14-2000    FINANCE
                                                             1001             1       101     2       10-14-2000    FINANCE
                               Order Details
                                 Satellite



                                                                                                  Hub Product
                                                                    PID           PRODUCT #         LOAD DTS       RCRD SRC
                                              Product                   100       PRD128582         10-14-2000     MFG
                                              Satellite                 101       PRD128256         10-14-2000     MFG



                                                     (C) Kent Graziano
Sample Link Entity - Hierarchy
                                                        Hub Customer
Link Customer Rollup
                                                  ID    CUSTOMER #    LOAD DTS     RCRD SRC
From   To     LOAD DTS     RCRD SRC
CSID                                               1    ABC123456     10-12-2000   MANUFACT
       CSID
 1     NULL   10-14-2000   FINANCE                 2    ABC925_24FN   10-22-2000   CONTRACTS
                                                   3    DKEF          1-25-2001    CONTRACTS
 2     1      10-22-2000   FINANCE
                                                   4    KKO92854_dd   3-7-2001     CONTRACTS
 3     1      2-15-2001    FINANCE
                                                   5    LLOA_82J5J    6-4-2001     SALES
 4     2      4-3-2001     HR
                                                   6    HUJI_BFIOQ    8-3-2001     SALES
 5     2      6-4-2001     SALES
                                                   7    PPRU_3259     2-2-2002     FINANCE
                                                   8    PAFJG2895     2-2-2002     CONTRACTS
                                                   9    929ABC2985    2-2-2002     CONTRACTS
                                                   10   93KFLLA       2-2-2002     CONTRACTS

 Note:
 • If you have logic – you can roll together customers, or companies, or sub-assemblies,
   bill of materials, etc..
 • We do not want to disturb the facts (underlying data in the hub), but we do want to re-
   arrange hierarchies at different points over time.



                                        (C) Kent Graziano
Link To Link (Link Sale Component)
                                                                  Sat Totals
                                                 Hub Invoice
  Link
                                                                  Sat Dates
 Product
Hierarchy


             Hub                                   Link Sale                     Hub
            Product                                Line Item                   Customer

    Sat
  Product                    Link Sale               Sat                 Sat            Sat
   Desc.                    Component              Quantity           Cust Active     Address
                                                  Sub-Totals

Note:
• Link Sale Component provides a shift in grain.
• Link Sale Component allows for configurable options of products tracked on a single line-item
   product sold.
• Link Sale Component provides for sub-assembly tracking.


                                         (C) Kent Graziano
3. Satellites = Descriptors
     Firm                                               Patent
   Locations                                        Expiration Info
                           Listing
                        Formulation


                                   Listing Medication
      Product                            Dosages
    Ingredients
                                                        Drug Packaging
                                                            Types



            Satellites = Descriptors
 These data provide context for the keys (Hubs)
        And for the associations (Links)

                  (C) TeachDataVault.com
Satellite Definitions
• What Makes a Satellite?
   – A Satellite is based on an non-identifying business elements.
       • Attributes that are descriptive data, often in the source systems known as
         descriptions, or free-form entry, or computed elements.
   – The Satellite data changes, sometimes rapidly, sometimes
     slowly.
       • The Satellites are separated by type of information and rate of change.
   – The Satellite is dependent on the Hub or Link key as a parent,
       • Satellites are never dependent on more than one parent table.
       • The Satellite is never a parent table to any other table (no snow flaking).

• Attributes and Ordering
   – All attributes are mandatory – EXCEPT END DATE.
   – Parent ID 1st, Load Date 2nd, Load End Date 3rd,Record Source
     Last.


                                 (C) TeachDataVault.com
Descriptors = Context
                           Firm
Firm Name
                         Locations

Firms Generate                                     Listing
Product Listings                Drug Listing
                                                Formulation

                     Firms Manufacture
                                               Product Number
                          Products
                                                   Product
                        Start & End of           Ingredients
                        manufacturing



      Context specific point in time
         warehousing portion

                   (C) TeachDataVault.com
Satellite Entity
A Satellite is a time-dimensional table housing detailed information
               about the Hub’s or Link’s business keys.


  Hub Primary Key         Customer #                • Satellites are defined by
     Load DTS             Load DTS                    TYPE of data and RATE OF
    Extract DTS           Extract DTS                 CHANGE
    Load End Date        Load End Date
        Detail          Customer Name
                                                    • Mathematically – this reduces
    Business Data       Customer Addr1
                        Customer Addr2
                                                      redundancy and decreases
  <Aggregation Data>                                  storage requirements over
    {Update User}        {Update User}
    {Update DTS}         {Update DTS}                 time (compared to a Star
                                                      Schema)
    Record Source        Record Source



                           (C) TeachDataVault.com
Satellite Entity- Details
• A Satellite has only 1 foreign key; it is dependent on the
  parent table (Hub or Link)
• A Satellite may or may not have an “Item Numbering”
  attribute.
• A Satellite’s Load Date represents the date the EDW saw
  the data (must be a delta set).
   – This is not Effective Date from the Source!
• A Satellite’s Record Source represents the actual source
  of the row (unit of work).
• To avoid Outer Joins, you must ensure that every
  satellite has at least 1 entry for every Hub Key.


                               (C) TeachDataVault.com
Satellite Table Structures




           SQN = Sequence (parent identity number)
   LDTS = Load Date (when the Warehouse first sees the data)
        LEDTS = End of lifecycle for superseded record
RSRC = Record Source (System + App where the data ORIGINATED)
                      (C) TeachDataVault.com
Satellite Entity – Hub Related
             Hub Customer          ID      CUSTOMER #            LOAD DTS     RCRD SRC
                                       0   N/A                   10-12-2000   SYSTEM
                                       1   ABC123456             10-12-2000   MANUFACT
                                       2   ABC925_24FN           10-2-2000    CONTRACTS

                                       3   ABC5525-25            10-1-2000    FINANCE



                                       CUSTOMER NAME SATELLITE
                    CSID   LOAD DTS         NAME                               RCRD SRC
                      0    10-12-2000       N/A                                SYSTEM
                      1    10-12-2000       ABC Suppliers                      MANUFACT
                      1    10-14-2000       ABC Suppliers, Inc                 MANUFACT
                      1    10-31-2000       ABC Worldwide Suppliers, Inc       MANUFACT
Dummy satellite
                      1    12-2-2000        ABC DEF Incorporated               CONTRACTS
record eliminates
need for outer        2    10-2-2000        WorldPart                          CONTRACTS
joins during          2    10-14-2000       Worldwide Suppliers Inc            CONTRACTS
extract.              3    10-1-2000        N/A                                FINANCE



                                       (C) Kent Graziano
Satellite Entity – Link Related
        Link Order Details          ID    Product ID      OrdID    LOAD DTS     RCRD SRC
                                     0    0               0        10-12-2000   SYSTEM
                                     1    PRD102          1        10-12-2000   MANUFACT
                                     2    PRD103          1        10-2-2000    CONTRACTS



                                                    Satellite Order Totals
                    ID       LOAD DTS         Tax        Total    RCRD SRC
                         0   10-12-2000       <NULL>     <NULL>   SYSTEM
                         1   10-12-2000       3.00       0.00     MANUFACT
Dummy satellite
                         1   10-14-2000       4.00       12.00    MANUFACT
record eliminates
need for outer           1   10-31-2000       3.69       14.02    MANUFACT
joins during             1   12-2-2000        4.69       13.69    CONTRACTS
extract.
                         2   10-2-2000        2.45       10.00    CONTRACTS
                         2   10-14-2000       1.22       14.00    CONTRACTS




                                     (C) Kent Graziano
Satellite Splits – Type of Information
                                                     ID    CUSTOMER #           LOAD DTS        RCRD SRC
                    Hub Customer                      0    N/A                  10-12-2000      SYSTEM
                                                      1    ABC123456            10-12-2000      MANUFACT
                                                      2    ABC925_24FN          10-2-2000       CONTRACTS
                                                      3    ABC5525-25           10-1-2000       FINANCE




                                                     CUSTOMER SATELLITE
CSID   LOAD DTS       NAME                                Contact   Sales Rgn   Cust Score   RCRD SRC
  0    10-12-2000     N/A                                 N/A       N/A         0            SYSTEM

  1    10-12-2000     ABC Suppliers                       Jen F.    SE          102          MANUFACT
  1    10-14-2000     ABC Suppliers, Inc                  Jen F.    SE          120          MANUFACT

  1    10-31-2000     ABC Worldwide Suppliers, Inc        Jen F.    SE          130          MANUFACT
  1    12-2-2000      ABC DEF Incorporated                Jack J.   SC          85           CONTRACTS

  2    10-2-2000      WorldPart                           Jenny     SE          99           CONTRACTS

  2    10-14-2000     Worldwide Suppliers Inc             Jenny     SE          102          CONTRACTS

  3    10-1-2000      N/A                                 N/A       N/A         0            FINANCE




                                                      (C) Kent Graziano
Satellite Splits – Type of Information
                                ID   CUSTOMER #      LOAD DTS     RCRD SRC
             Hub Customer        0   N/A             10-12-2000   SYSTEM
                                 1   ABC123456       10-12-2000   MANUFACT
                                 2   ABC925_24FN     10-2-2000    CONTRACTS
                                 3   ABC5525-25      10-1-2000    FINANCE




   Customer Name Satellite   Customer Sales Satellite
        (name Info)               (Sales Info)

• Because of the type of information is different, we split the logical groups
  into multiple Satellites.
• This provides sheer flexibility in representation of the information.
• We may have one more problem with Rate Of Change…



                                 (C) Kent Graziano
Satellite Splits – Rate of Change
                                                     ID    CUSTOMER #           LOAD DTS        RCRD SRC
                    Hub Customer                      0    N/A                  10-12-2000      SYSTEM
                                                      1    ABC123456            10-12-2000      MANUFACT
                                                      2    ABC925_24FN          10-2-2000       CONTRACTS
                                                      3    ABC5525-25           10-1-2000       FINANCE




                                                     CUSTOMER SATELLITE
CSID   LOAD DTS       NAME                                Contact   Sales Rgn   Cust Score   RCRD SRC
  0    10-12-2000     N/A                                 N/A       N/A         0            SYSTEM

  1    10-12-2000     ABC Suppliers                       Jen F.    SE          102          MANUFACT
  1    10-14-2000     ABC Suppliers, Inc                  Jen F.    SE          120          MANUFACT

  1    10-31-2000     ABC Worldwide Suppliers, Inc        Jen F.    SE          130          MANUFACT
  1    12-2-2000      ABC DEF Incorporated                Jack J.   SC          85           CONTRACTS

  2    10-2-2000      WorldPart                           Jenny     SE          99           CONTRACTS

  2    10-14-2000     Worldwide Suppliers Inc             Jenny     SE          102          CONTRACTS

  3    10-1-2000      N/A                                 N/A       N/A         0            FINANCE




                                                      (C) Kent Graziano
Satellite Splits – Rate of Change
                              ID   CUSTOMER #      LOAD DTS     RCRD SRC
Customer Name Satellite
                               0   N/A             10-12-2000   SYSTEM
     (name Info)
                               1   ABC123456       10-12-2000   MANUFACT
                               2   ABC925_24FN     10-2-2000    CONTRACTS
Customer Sales Satellite       3   ABC5525-25      10-1-2000    FINANCE
     (Sales Info)
                                                          Hub Customer
   Customer Scoring
       Satellite


• Assume the data to score customers begins arriving in the warehouse
  every 5 minutes… We then separate the scoring information from the
  rest of the satellites.
• IF we end up with data that (over time) doesn’t change as much as we
  thought, we can always re-combine Satellites to eliminate joins.

                               (C) Kent Graziano
Satellites Split By Source System
SAT_SALES_CUST               SAT_FINANCE_CUST                 SAT_CONTRACTS_CUST
PARENT SEQUENCE               PARENT SEQUENCE                 PARENT SEQUENCE
LOAD DATE                     LOAD DATE                       LOAD DATE
<LOAD-END-DATE>               <LOAD-END-DATE>                 <LOAD-END-DATE>
<RECORD-SOURCE>               <RECORD-SOURCE>                 <RECORD-SOURCE>
Name                          First Name                      Contact Name
Phone Number                  Last Name                       Contact Email
Best time of day to reach     Guardian Full Name              Contact Phone Number
Do Not Call Flag              Co-Signer Full Name
                              Phone Number
                              Address
                              City
                              State/Province
                              Zip Code



                            Satellite Structure
                            PARENT SEQUENCE                        Primary
                            LOAD DATE                              Key
                            <LOAD-END-DATE>
                            <RECORD-SOURCE>
                            {user defined descriptive data}
                            {or temporal based timelines}

                               (C) TeachDataVault.com                                49
Introduction to Data Vault Modeling
Worlds Smallest Data Vault
   Hub Customer
   Hub_Cust_Seq_ID        • The Data Vault doesn’t have to be “BIG”.
   Hub_Cust_Num           • An Data Vault can be built incrementally.
   Hub_Cust_Load_DTS
   Hub_Cust_Rec_Src
                          • Reverse engineering one component of the
                            existing models is not uncommon.
                          • Building one part of the Data Vault, then
Satellite Customer Name
 Hub_Cust_Seq_ID
                            changing the marts to feed from that vault
 Sat_Cust_Load_DTS
                            is a best practice.
 Sat_Cust_Load_End_DTS
 Sat_Cust_Name
 Sat_Cust_Rec_Src
                          • The smallest Enterprise Data Warehouse
                            consists of two tables:
                             – One Hub,
                             – One Satellite




                               (C) TeachDataVault.com
Top 10 Rules for DV Modeling
Business keys with a low propensity for change become Hub keys.
Transactions and integrated keys become Link tables.
Descriptive data always fits in a Satellite.
1.    A Hub table always migrates its’ primary key outwards.
2.    Hub to Hub relationships are allowed only through a link structure.
3.    Recursive relationships are resolved through a link table.
4.    A Link structure must have at least 2 FK relationships.
5.    A Link structure can have a surrogate key representation.
6.    A Link structure has no limit to the number of hubs it integrates.
7.    A Link to Link relationship is allowed.
8.    A Satellite can be dependent on a link table.
9.    A Satellite can only have one parent table.
10. A Satellite cannot have any foreign key relationships except the primary key to
      the parent table (hub or link).


                                 (C) TeachDataVault.com
NOTE: Automating the Build
• DV is a repeatable methodology with rules and standards
• Standard templates exist for:
    – Loading DV tables
    – Extracting data from DV tables
• RapidAce (www.rapidace.com – now Open Source)
    – Software that applies these rules to:
        • Convert 3NF models to DV
        • Convert DV to Star Schema
• This could save us lots of time and $$




                                (C) Kent Graziano
In Review…
• Data Vault is…
   – A Data Warehouse Modeling Technique (&
     Methodology)
   – Hub and Spoke Design
   – Simple, Easy, Repeatable Structures
   – Comprised of Standards, Rules & Procedures
   – Made up of Ontological Metadata
   – AUTOMATABLE!!!
• Hubs = Business Keys
• Links = Associations / Transactions
• Satellites = Descriptors
                      (C) TeachDataVault.com
The Experts Say…
  “The Data Vault is the optimal choice
  for modeling the EDW in the DW 2.0
  framework.” Bill Inmon

   “The Data Vault is foundationally
   strong and exceptionally scalable
   architecture.”   Stephen Brobst



       “The Data Vault is a technique which some industry
       experts have predicted may spark a revolution as the
       next big thing in data modeling for enterprise
       warehousing....”                    Doug Laney
More Notables…

   “This enables organizations to take control of
   their data warehousing destiny, supporting
   better and more relevant data warehouses in
   less time than before.” Howard Dresner



  “[The Data Vault] captures a practical body of
  knowledge for data warehouse development
  which both agile and traditional practitioners
  will benefit from..” Scott Ambler
Who’s Using It?
Growing Adoption…
• The number of Data Vault users in the US
  surpassed 500 in 2010 and grows rapidly
  (http://danlinstedt.com/about/dv-
  customers/)




                    (C) Kent Graziano
Conclusion?




  Changing the direction of the river
takes less effort than stopping the flow
                of water
              (C) TeachDataVault.com
Introduction to Data Vault Modeling
Where To Learn More
     The Technical Modeling Book: http://LearnDataVault.com

      On YouTube: http://www.youtube.com/LearnDataVault

          On Facebook: www.facebook.com/learndatavault

                 Dan’s Blog: www.danlinstedt.com

The Discussion Forums: http://LinkedIn.com – Data Vault Discussions

       World wide User Group (Free): http://dvusergroup.com

              The Business of Data Vault Modeling
          by Dan Linstedt, Kent Graziano, Hans Hultgren
                  (available at www.lulu.com )
                                                                      61
Introduction to Data Vault Modeling
Contact Information


     Kent Graziano
 Kent.graziano@att.net

Mais conteúdo relacionado

Mais procurados

Data Mesh for Dinner
Data Mesh for DinnerData Mesh for Dinner
Data Mesh for DinnerKent Graziano
 
Data Warehousing Trends, Best Practices, and Future Outlook
Data Warehousing Trends, Best Practices, and Future OutlookData Warehousing Trends, Best Practices, and Future Outlook
Data Warehousing Trends, Best Practices, and Future OutlookJames Serra
 
Agile Data Warehouse Modeling: Introduction to Data Vault Data Modeling
Agile Data Warehouse Modeling: Introduction to Data Vault Data ModelingAgile Data Warehouse Modeling: Introduction to Data Vault Data Modeling
Agile Data Warehouse Modeling: Introduction to Data Vault Data ModelingKent Graziano
 
Data Vault 2.0 DeMystified with Dan Linstedt and WhereScape
Data Vault 2.0 DeMystified with Dan Linstedt and WhereScapeData Vault 2.0 DeMystified with Dan Linstedt and WhereScape
Data Vault 2.0 DeMystified with Dan Linstedt and WhereScapeWhereScape
 
Introduction to Data Virtualization (session 1 from Packed Lunch Webinar Series)
Introduction to Data Virtualization (session 1 from Packed Lunch Webinar Series)Introduction to Data Virtualization (session 1 from Packed Lunch Webinar Series)
Introduction to Data Virtualization (session 1 from Packed Lunch Webinar Series)Denodo
 
Data Catalog for Better Data Discovery and Governance
Data Catalog for Better Data Discovery and GovernanceData Catalog for Better Data Discovery and Governance
Data Catalog for Better Data Discovery and GovernanceDenodo
 
Time to Talk about Data Mesh
Time to Talk about Data MeshTime to Talk about Data Mesh
Time to Talk about Data MeshLibbySchulze
 
Modern Data architecture Design
Modern Data architecture DesignModern Data architecture Design
Modern Data architecture DesignKujambu Murugesan
 
Data Mesh in Practice - How Europe's Leading Online Platform for Fashion Goes...
Data Mesh in Practice - How Europe's Leading Online Platform for Fashion Goes...Data Mesh in Practice - How Europe's Leading Online Platform for Fashion Goes...
Data Mesh in Practice - How Europe's Leading Online Platform for Fashion Goes...Dr. Arif Wider
 
Data Catalog as a Business Enabler
Data Catalog as a Business EnablerData Catalog as a Business Enabler
Data Catalog as a Business EnablerSrinivasan Sankar
 
Five Things to Consider About Data Mesh and Data Governance
Five Things to Consider About Data Mesh and Data GovernanceFive Things to Consider About Data Mesh and Data Governance
Five Things to Consider About Data Mesh and Data GovernanceDATAVERSITY
 
Agile Data Engineering: Introduction to Data Vault 2.0 (2018)
Agile Data Engineering: Introduction to Data Vault 2.0 (2018)Agile Data Engineering: Introduction to Data Vault 2.0 (2018)
Agile Data Engineering: Introduction to Data Vault 2.0 (2018)Kent Graziano
 
Building an Effective Data Warehouse Architecture
Building an Effective Data Warehouse ArchitectureBuilding an Effective Data Warehouse Architecture
Building an Effective Data Warehouse ArchitectureJames Serra
 
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...Databricks
 
DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDatabricks
 
Activate Data Governance Using the Data Catalog
Activate Data Governance Using the Data CatalogActivate Data Governance Using the Data Catalog
Activate Data Governance Using the Data CatalogDATAVERSITY
 
Architect’s Open-Source Guide for a Data Mesh Architecture
Architect’s Open-Source Guide for a Data Mesh ArchitectureArchitect’s Open-Source Guide for a Data Mesh Architecture
Architect’s Open-Source Guide for a Data Mesh ArchitectureDatabricks
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Databricks
 
MDM Strategy & Roadmap
MDM Strategy & RoadmapMDM Strategy & Roadmap
MDM Strategy & Roadmapvictorlbrown
 

Mais procurados (20)

Why Data Vault?
Why Data Vault? Why Data Vault?
Why Data Vault?
 
Data Mesh for Dinner
Data Mesh for DinnerData Mesh for Dinner
Data Mesh for Dinner
 
Data Warehousing Trends, Best Practices, and Future Outlook
Data Warehousing Trends, Best Practices, and Future OutlookData Warehousing Trends, Best Practices, and Future Outlook
Data Warehousing Trends, Best Practices, and Future Outlook
 
Agile Data Warehouse Modeling: Introduction to Data Vault Data Modeling
Agile Data Warehouse Modeling: Introduction to Data Vault Data ModelingAgile Data Warehouse Modeling: Introduction to Data Vault Data Modeling
Agile Data Warehouse Modeling: Introduction to Data Vault Data Modeling
 
Data Vault 2.0 DeMystified with Dan Linstedt and WhereScape
Data Vault 2.0 DeMystified with Dan Linstedt and WhereScapeData Vault 2.0 DeMystified with Dan Linstedt and WhereScape
Data Vault 2.0 DeMystified with Dan Linstedt and WhereScape
 
Introduction to Data Virtualization (session 1 from Packed Lunch Webinar Series)
Introduction to Data Virtualization (session 1 from Packed Lunch Webinar Series)Introduction to Data Virtualization (session 1 from Packed Lunch Webinar Series)
Introduction to Data Virtualization (session 1 from Packed Lunch Webinar Series)
 
Data Catalog for Better Data Discovery and Governance
Data Catalog for Better Data Discovery and GovernanceData Catalog for Better Data Discovery and Governance
Data Catalog for Better Data Discovery and Governance
 
Time to Talk about Data Mesh
Time to Talk about Data MeshTime to Talk about Data Mesh
Time to Talk about Data Mesh
 
Modern Data architecture Design
Modern Data architecture DesignModern Data architecture Design
Modern Data architecture Design
 
Data Mesh in Practice - How Europe's Leading Online Platform for Fashion Goes...
Data Mesh in Practice - How Europe's Leading Online Platform for Fashion Goes...Data Mesh in Practice - How Europe's Leading Online Platform for Fashion Goes...
Data Mesh in Practice - How Europe's Leading Online Platform for Fashion Goes...
 
Data Catalog as a Business Enabler
Data Catalog as a Business EnablerData Catalog as a Business Enabler
Data Catalog as a Business Enabler
 
Five Things to Consider About Data Mesh and Data Governance
Five Things to Consider About Data Mesh and Data GovernanceFive Things to Consider About Data Mesh and Data Governance
Five Things to Consider About Data Mesh and Data Governance
 
Agile Data Engineering: Introduction to Data Vault 2.0 (2018)
Agile Data Engineering: Introduction to Data Vault 2.0 (2018)Agile Data Engineering: Introduction to Data Vault 2.0 (2018)
Agile Data Engineering: Introduction to Data Vault 2.0 (2018)
 
Building an Effective Data Warehouse Architecture
Building an Effective Data Warehouse ArchitectureBuilding an Effective Data Warehouse Architecture
Building an Effective Data Warehouse Architecture
 
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
 
DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
 
Activate Data Governance Using the Data Catalog
Activate Data Governance Using the Data CatalogActivate Data Governance Using the Data Catalog
Activate Data Governance Using the Data Catalog
 
Architect’s Open-Source Guide for a Data Mesh Architecture
Architect’s Open-Source Guide for a Data Mesh ArchitectureArchitect’s Open-Source Guide for a Data Mesh Architecture
Architect’s Open-Source Guide for a Data Mesh Architecture
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
 
MDM Strategy & Roadmap
MDM Strategy & RoadmapMDM Strategy & Roadmap
MDM Strategy & Roadmap
 

Semelhante a Introduction to Data Vault Modeling

Introduction To Data Vault - DAMA Oregon 2012
Introduction To Data Vault - DAMA Oregon 2012Introduction To Data Vault - DAMA Oregon 2012
Introduction To Data Vault - DAMA Oregon 2012Empowered Holdings, LLC
 
Dbms and it infrastructure
Dbms and  it infrastructureDbms and  it infrastructure
Dbms and it infrastructureprojectandppt
 
Agile Data Rationalization for Operational Intelligence
Agile Data Rationalization for Operational IntelligenceAgile Data Rationalization for Operational Intelligence
Agile Data Rationalization for Operational IntelligenceInside Analysis
 
Data Warehousing, Data Mining & Data Visualisation
Data Warehousing, Data Mining & Data VisualisationData Warehousing, Data Mining & Data Visualisation
Data Warehousing, Data Mining & Data VisualisationSunderland City Council
 
Kent-Graziano-Intro-to-Datavault_short.pdf
Kent-Graziano-Intro-to-Datavault_short.pdfKent-Graziano-Intro-to-Datavault_short.pdf
Kent-Graziano-Intro-to-Datavault_short.pdfabhaybansal43
 
All Grown Up: Maturation of Analytics in the Cloud
All Grown Up: Maturation of Analytics in the CloudAll Grown Up: Maturation of Analytics in the Cloud
All Grown Up: Maturation of Analytics in the CloudInside Analysis
 
Getting Cloud Architecture Right the First Time Ver 2
Getting Cloud Architecture Right the First Time Ver 2Getting Cloud Architecture Right the First Time Ver 2
Getting Cloud Architecture Right the First Time Ver 2David Linthicum
 
Cis 555 Week 4 Assignment 2 Automated Teller Machine (Atm)...
Cis 555 Week 4 Assignment 2 Automated Teller Machine (Atm)...Cis 555 Week 4 Assignment 2 Automated Teller Machine (Atm)...
Cis 555 Week 4 Assignment 2 Automated Teller Machine (Atm)...Karen Thompson
 
Oracle data integrator project
Oracle data integrator projectOracle data integrator project
Oracle data integrator projectAmit Sharma
 
Trends in Data Modeling
Trends in Data ModelingTrends in Data Modeling
Trends in Data ModelingDATAVERSITY
 
Anzo Smart Data Lake 4.0 - a Data Lake Platform for the Enterprise Informatio...
Anzo Smart Data Lake 4.0 - a Data Lake Platform for the Enterprise Informatio...Anzo Smart Data Lake 4.0 - a Data Lake Platform for the Enterprise Informatio...
Anzo Smart Data Lake 4.0 - a Data Lake Platform for the Enterprise Informatio...Cambridge Semantics
 
From Business Intelligence to Big Data - hack/reduce Dec 2014
From Business Intelligence to Big Data - hack/reduce Dec 2014From Business Intelligence to Big Data - hack/reduce Dec 2014
From Business Intelligence to Big Data - hack/reduce Dec 2014Adam Ferrari
 
Metadata Use Cases You Can Use
Metadata Use Cases You Can UseMetadata Use Cases You Can Use
Metadata Use Cases You Can Usedmurph4
 
Metadata Use Cases
Metadata Use CasesMetadata Use Cases
Metadata Use Casesdmurph4
 
Technically Speaking: How Self-Service Analytics Fosters Collaboration
Technically Speaking: How Self-Service Analytics Fosters CollaborationTechnically Speaking: How Self-Service Analytics Fosters Collaboration
Technically Speaking: How Self-Service Analytics Fosters CollaborationInside Analysis
 
Self Service Analytics enabled by Data Virtualization from Denodo
Self Service Analytics enabled by Data Virtualization from DenodoSelf Service Analytics enabled by Data Virtualization from Denodo
Self Service Analytics enabled by Data Virtualization from DenodoDenodo
 
Data Virtualization to Survive a Multi and Hybrid Cloud World
Data Virtualization to Survive a Multi and Hybrid Cloud WorldData Virtualization to Survive a Multi and Hybrid Cloud World
Data Virtualization to Survive a Multi and Hybrid Cloud WorldDenodo
 

Semelhante a Introduction to Data Vault Modeling (20)

Introduction To Data Vault - DAMA Oregon 2012
Introduction To Data Vault - DAMA Oregon 2012Introduction To Data Vault - DAMA Oregon 2012
Introduction To Data Vault - DAMA Oregon 2012
 
Dbms and it infrastructure
Dbms and  it infrastructureDbms and  it infrastructure
Dbms and it infrastructure
 
Agile Data Rationalization for Operational Intelligence
Agile Data Rationalization for Operational IntelligenceAgile Data Rationalization for Operational Intelligence
Agile Data Rationalization for Operational Intelligence
 
Data vault what's Next: Part 2
Data vault what's Next: Part 2Data vault what's Next: Part 2
Data vault what's Next: Part 2
 
Data Warehousing, Data Mining & Data Visualisation
Data Warehousing, Data Mining & Data VisualisationData Warehousing, Data Mining & Data Visualisation
Data Warehousing, Data Mining & Data Visualisation
 
Kent-Graziano-Intro-to-Datavault_short.pdf
Kent-Graziano-Intro-to-Datavault_short.pdfKent-Graziano-Intro-to-Datavault_short.pdf
Kent-Graziano-Intro-to-Datavault_short.pdf
 
An Introduction To BI
An Introduction To BIAn Introduction To BI
An Introduction To BI
 
All Grown Up: Maturation of Analytics in the Cloud
All Grown Up: Maturation of Analytics in the CloudAll Grown Up: Maturation of Analytics in the Cloud
All Grown Up: Maturation of Analytics in the Cloud
 
Getting Cloud Architecture Right the First Time Ver 2
Getting Cloud Architecture Right the First Time Ver 2Getting Cloud Architecture Right the First Time Ver 2
Getting Cloud Architecture Right the First Time Ver 2
 
Cis 555 Week 4 Assignment 2 Automated Teller Machine (Atm)...
Cis 555 Week 4 Assignment 2 Automated Teller Machine (Atm)...Cis 555 Week 4 Assignment 2 Automated Teller Machine (Atm)...
Cis 555 Week 4 Assignment 2 Automated Teller Machine (Atm)...
 
Oracle data integrator project
Oracle data integrator projectOracle data integrator project
Oracle data integrator project
 
Trends in Data Modeling
Trends in Data ModelingTrends in Data Modeling
Trends in Data Modeling
 
Anzo Smart Data Lake 4.0 - a Data Lake Platform for the Enterprise Informatio...
Anzo Smart Data Lake 4.0 - a Data Lake Platform for the Enterprise Informatio...Anzo Smart Data Lake 4.0 - a Data Lake Platform for the Enterprise Informatio...
Anzo Smart Data Lake 4.0 - a Data Lake Platform for the Enterprise Informatio...
 
From Business Intelligence to Big Data - hack/reduce Dec 2014
From Business Intelligence to Big Data - hack/reduce Dec 2014From Business Intelligence to Big Data - hack/reduce Dec 2014
From Business Intelligence to Big Data - hack/reduce Dec 2014
 
Metadata Use Cases You Can Use
Metadata Use Cases You Can UseMetadata Use Cases You Can Use
Metadata Use Cases You Can Use
 
Metadata Use Cases
Metadata Use CasesMetadata Use Cases
Metadata Use Cases
 
Data mining & column stores
Data mining & column storesData mining & column stores
Data mining & column stores
 
Technically Speaking: How Self-Service Analytics Fosters Collaboration
Technically Speaking: How Self-Service Analytics Fosters CollaborationTechnically Speaking: How Self-Service Analytics Fosters Collaboration
Technically Speaking: How Self-Service Analytics Fosters Collaboration
 
Self Service Analytics enabled by Data Virtualization from Denodo
Self Service Analytics enabled by Data Virtualization from DenodoSelf Service Analytics enabled by Data Virtualization from Denodo
Self Service Analytics enabled by Data Virtualization from Denodo
 
Data Virtualization to Survive a Multi and Hybrid Cloud World
Data Virtualization to Survive a Multi and Hybrid Cloud WorldData Virtualization to Survive a Multi and Hybrid Cloud World
Data Virtualization to Survive a Multi and Hybrid Cloud World
 

Mais de Kent Graziano

Balance agility and governance with #TrueDataOps and The Data Cloud
Balance agility and governance with #TrueDataOps and The Data CloudBalance agility and governance with #TrueDataOps and The Data Cloud
Balance agility and governance with #TrueDataOps and The Data CloudKent Graziano
 
HOW TO SAVE PILEs of $$$ BY CREATING THE BEST DATA MODEL THE FIRST TIME (Ksc...
HOW TO SAVE  PILEs of $$$BY CREATING THE BEST DATA MODEL THE FIRST TIME (Ksc...HOW TO SAVE  PILEs of $$$BY CREATING THE BEST DATA MODEL THE FIRST TIME (Ksc...
HOW TO SAVE PILEs of $$$ BY CREATING THE BEST DATA MODEL THE FIRST TIME (Ksc...Kent Graziano
 
Rise of the Data Cloud
Rise of the Data CloudRise of the Data Cloud
Rise of the Data CloudKent Graziano
 
Delivering Data Democratization in the Cloud with Snowflake
Delivering Data Democratization in the Cloud with SnowflakeDelivering Data Democratization in the Cloud with Snowflake
Delivering Data Democratization in the Cloud with SnowflakeKent Graziano
 
Demystifying Data Warehousing as a Service (GLOC 2019)
Demystifying Data Warehousing as a Service (GLOC 2019)Demystifying Data Warehousing as a Service (GLOC 2019)
Demystifying Data Warehousing as a Service (GLOC 2019)Kent Graziano
 
Making Sense of Schema on Read
Making Sense of Schema on ReadMaking Sense of Schema on Read
Making Sense of Schema on ReadKent Graziano
 
Demystifying Data Warehousing as a Service - DFW
Demystifying Data Warehousing as a Service - DFWDemystifying Data Warehousing as a Service - DFW
Demystifying Data Warehousing as a Service - DFWKent Graziano
 
Extreme BI: Creating Virtualized Hybrid Type 1+2 Dimensions
Extreme BI: Creating Virtualized Hybrid Type 1+2 DimensionsExtreme BI: Creating Virtualized Hybrid Type 1+2 Dimensions
Extreme BI: Creating Virtualized Hybrid Type 1+2 DimensionsKent Graziano
 
Demystifying Data Warehouse as a Service (DWaaS)
Demystifying Data Warehouse as a Service (DWaaS)Demystifying Data Warehouse as a Service (DWaaS)
Demystifying Data Warehouse as a Service (DWaaS)Kent Graziano
 
Agile Data Warehousing: Using SDDM to Build a Virtualized ODS
Agile Data Warehousing: Using SDDM to Build a Virtualized ODSAgile Data Warehousing: Using SDDM to Build a Virtualized ODS
Agile Data Warehousing: Using SDDM to Build a Virtualized ODSKent Graziano
 
Agile Methods and Data Warehousing (2016 update)
Agile Methods and Data Warehousing (2016 update)Agile Methods and Data Warehousing (2016 update)
Agile Methods and Data Warehousing (2016 update)Kent Graziano
 
Data Warehousing 2016
Data Warehousing 2016Data Warehousing 2016
Data Warehousing 2016Kent Graziano
 
Worst Practices in Data Warehouse Design
Worst Practices in Data Warehouse DesignWorst Practices in Data Warehouse Design
Worst Practices in Data Warehouse DesignKent Graziano
 
Data Vault 2.0: Using MD5 Hashes for Change Data Capture
Data Vault 2.0: Using MD5 Hashes for Change Data CaptureData Vault 2.0: Using MD5 Hashes for Change Data Capture
Data Vault 2.0: Using MD5 Hashes for Change Data CaptureKent Graziano
 
Agile Methods and Data Warehousing
Agile Methods and Data WarehousingAgile Methods and Data Warehousing
Agile Methods and Data WarehousingKent Graziano
 
Top Five Cool Features in Oracle SQL Developer Data Modeler
Top Five Cool Features in Oracle SQL Developer Data ModelerTop Five Cool Features in Oracle SQL Developer Data Modeler
Top Five Cool Features in Oracle SQL Developer Data ModelerKent Graziano
 
(OTW13) Agile Data Warehousing: Introduction to Data Vault Modeling
(OTW13) Agile Data Warehousing: Introduction to Data Vault Modeling(OTW13) Agile Data Warehousing: Introduction to Data Vault Modeling
(OTW13) Agile Data Warehousing: Introduction to Data Vault ModelingKent Graziano
 
Using OBIEE and Data Vault to Virtualize Your BI Environment: An Agile Approach
Using OBIEE and Data Vault to Virtualize Your BI Environment: An Agile ApproachUsing OBIEE and Data Vault to Virtualize Your BI Environment: An Agile Approach
Using OBIEE and Data Vault to Virtualize Your BI Environment: An Agile ApproachKent Graziano
 

Mais de Kent Graziano (18)

Balance agility and governance with #TrueDataOps and The Data Cloud
Balance agility and governance with #TrueDataOps and The Data CloudBalance agility and governance with #TrueDataOps and The Data Cloud
Balance agility and governance with #TrueDataOps and The Data Cloud
 
HOW TO SAVE PILEs of $$$ BY CREATING THE BEST DATA MODEL THE FIRST TIME (Ksc...
HOW TO SAVE  PILEs of $$$BY CREATING THE BEST DATA MODEL THE FIRST TIME (Ksc...HOW TO SAVE  PILEs of $$$BY CREATING THE BEST DATA MODEL THE FIRST TIME (Ksc...
HOW TO SAVE PILEs of $$$ BY CREATING THE BEST DATA MODEL THE FIRST TIME (Ksc...
 
Rise of the Data Cloud
Rise of the Data CloudRise of the Data Cloud
Rise of the Data Cloud
 
Delivering Data Democratization in the Cloud with Snowflake
Delivering Data Democratization in the Cloud with SnowflakeDelivering Data Democratization in the Cloud with Snowflake
Delivering Data Democratization in the Cloud with Snowflake
 
Demystifying Data Warehousing as a Service (GLOC 2019)
Demystifying Data Warehousing as a Service (GLOC 2019)Demystifying Data Warehousing as a Service (GLOC 2019)
Demystifying Data Warehousing as a Service (GLOC 2019)
 
Making Sense of Schema on Read
Making Sense of Schema on ReadMaking Sense of Schema on Read
Making Sense of Schema on Read
 
Demystifying Data Warehousing as a Service - DFW
Demystifying Data Warehousing as a Service - DFWDemystifying Data Warehousing as a Service - DFW
Demystifying Data Warehousing as a Service - DFW
 
Extreme BI: Creating Virtualized Hybrid Type 1+2 Dimensions
Extreme BI: Creating Virtualized Hybrid Type 1+2 DimensionsExtreme BI: Creating Virtualized Hybrid Type 1+2 Dimensions
Extreme BI: Creating Virtualized Hybrid Type 1+2 Dimensions
 
Demystifying Data Warehouse as a Service (DWaaS)
Demystifying Data Warehouse as a Service (DWaaS)Demystifying Data Warehouse as a Service (DWaaS)
Demystifying Data Warehouse as a Service (DWaaS)
 
Agile Data Warehousing: Using SDDM to Build a Virtualized ODS
Agile Data Warehousing: Using SDDM to Build a Virtualized ODSAgile Data Warehousing: Using SDDM to Build a Virtualized ODS
Agile Data Warehousing: Using SDDM to Build a Virtualized ODS
 
Agile Methods and Data Warehousing (2016 update)
Agile Methods and Data Warehousing (2016 update)Agile Methods and Data Warehousing (2016 update)
Agile Methods and Data Warehousing (2016 update)
 
Data Warehousing 2016
Data Warehousing 2016Data Warehousing 2016
Data Warehousing 2016
 
Worst Practices in Data Warehouse Design
Worst Practices in Data Warehouse DesignWorst Practices in Data Warehouse Design
Worst Practices in Data Warehouse Design
 
Data Vault 2.0: Using MD5 Hashes for Change Data Capture
Data Vault 2.0: Using MD5 Hashes for Change Data CaptureData Vault 2.0: Using MD5 Hashes for Change Data Capture
Data Vault 2.0: Using MD5 Hashes for Change Data Capture
 
Agile Methods and Data Warehousing
Agile Methods and Data WarehousingAgile Methods and Data Warehousing
Agile Methods and Data Warehousing
 
Top Five Cool Features in Oracle SQL Developer Data Modeler
Top Five Cool Features in Oracle SQL Developer Data ModelerTop Five Cool Features in Oracle SQL Developer Data Modeler
Top Five Cool Features in Oracle SQL Developer Data Modeler
 
(OTW13) Agile Data Warehousing: Introduction to Data Vault Modeling
(OTW13) Agile Data Warehousing: Introduction to Data Vault Modeling(OTW13) Agile Data Warehousing: Introduction to Data Vault Modeling
(OTW13) Agile Data Warehousing: Introduction to Data Vault Modeling
 
Using OBIEE and Data Vault to Virtualize Your BI Environment: An Agile Approach
Using OBIEE and Data Vault to Virtualize Your BI Environment: An Agile ApproachUsing OBIEE and Data Vault to Virtualize Your BI Environment: An Agile Approach
Using OBIEE and Data Vault to Virtualize Your BI Environment: An Agile Approach
 

Último

Empowering Africa's Next Generation: The AI Leadership Blueprint
Empowering Africa's Next Generation: The AI Leadership BlueprintEmpowering Africa's Next Generation: The AI Leadership Blueprint
Empowering Africa's Next Generation: The AI Leadership BlueprintMahmoud Rabie
 
Machine Learning Model Validation (Aijun Zhang 2024).pdf
Machine Learning Model Validation (Aijun Zhang 2024).pdfMachine Learning Model Validation (Aijun Zhang 2024).pdf
Machine Learning Model Validation (Aijun Zhang 2024).pdfAijun Zhang
 
The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...
The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...
The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...Aggregage
 
Basic Building Blocks of Internet of Things.
Basic Building Blocks of Internet of Things.Basic Building Blocks of Internet of Things.
Basic Building Blocks of Internet of Things.YounusS2
 
Crea il tuo assistente AI con lo Stregatto (open source python framework)
Crea il tuo assistente AI con lo Stregatto (open source python framework)Crea il tuo assistente AI con lo Stregatto (open source python framework)
Crea il tuo assistente AI con lo Stregatto (open source python framework)Commit University
 
Building Your Own AI Instance (TBLC AI )
Building Your Own AI Instance (TBLC AI )Building Your Own AI Instance (TBLC AI )
Building Your Own AI Instance (TBLC AI )Brian Pichman
 
Designing A Time bound resource download URL
Designing A Time bound resource download URLDesigning A Time bound resource download URL
Designing A Time bound resource download URLRuncy Oommen
 
Nanopower In Semiconductor Industry.pdf
Nanopower  In Semiconductor Industry.pdfNanopower  In Semiconductor Industry.pdf
Nanopower In Semiconductor Industry.pdfPedro Manuel
 
Anypoint Code Builder , Google Pub sub connector and MuleSoft RPA
Anypoint Code Builder , Google Pub sub connector and MuleSoft RPAAnypoint Code Builder , Google Pub sub connector and MuleSoft RPA
Anypoint Code Builder , Google Pub sub connector and MuleSoft RPAshyamraj55
 
Using IESVE for Loads, Sizing and Heat Pump Modeling to Achieve Decarbonization
Using IESVE for Loads, Sizing and Heat Pump Modeling to Achieve DecarbonizationUsing IESVE for Loads, Sizing and Heat Pump Modeling to Achieve Decarbonization
Using IESVE for Loads, Sizing and Heat Pump Modeling to Achieve DecarbonizationIES VE
 
KubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCost
KubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCostKubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCost
KubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCostMatt Ray
 
IaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdf
IaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdfIaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdf
IaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdfDaniel Santiago Silva Capera
 
UiPath Studio Web workshop series - Day 8
UiPath Studio Web workshop series - Day 8UiPath Studio Web workshop series - Day 8
UiPath Studio Web workshop series - Day 8DianaGray10
 
Bird eye's view on Camunda open source ecosystem
Bird eye's view on Camunda open source ecosystemBird eye's view on Camunda open source ecosystem
Bird eye's view on Camunda open source ecosystemAsko Soukka
 
UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...
UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...
UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...UbiTrack UK
 
Cybersecurity Workshop #1.pptx
Cybersecurity Workshop #1.pptxCybersecurity Workshop #1.pptx
Cybersecurity Workshop #1.pptxGDSC PJATK
 
NIST Cybersecurity Framework (CSF) 2.0 Workshop
NIST Cybersecurity Framework (CSF) 2.0 WorkshopNIST Cybersecurity Framework (CSF) 2.0 Workshop
NIST Cybersecurity Framework (CSF) 2.0 WorkshopBachir Benyammi
 
AI You Can Trust - Ensuring Success with Data Integrity Webinar
AI You Can Trust - Ensuring Success with Data Integrity WebinarAI You Can Trust - Ensuring Success with Data Integrity Webinar
AI You Can Trust - Ensuring Success with Data Integrity WebinarPrecisely
 

Último (20)

Empowering Africa's Next Generation: The AI Leadership Blueprint
Empowering Africa's Next Generation: The AI Leadership BlueprintEmpowering Africa's Next Generation: The AI Leadership Blueprint
Empowering Africa's Next Generation: The AI Leadership Blueprint
 
Machine Learning Model Validation (Aijun Zhang 2024).pdf
Machine Learning Model Validation (Aijun Zhang 2024).pdfMachine Learning Model Validation (Aijun Zhang 2024).pdf
Machine Learning Model Validation (Aijun Zhang 2024).pdf
 
The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...
The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...
The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...
 
Basic Building Blocks of Internet of Things.
Basic Building Blocks of Internet of Things.Basic Building Blocks of Internet of Things.
Basic Building Blocks of Internet of Things.
 
Crea il tuo assistente AI con lo Stregatto (open source python framework)
Crea il tuo assistente AI con lo Stregatto (open source python framework)Crea il tuo assistente AI con lo Stregatto (open source python framework)
Crea il tuo assistente AI con lo Stregatto (open source python framework)
 
201610817 - edge part1
201610817 - edge part1201610817 - edge part1
201610817 - edge part1
 
Building Your Own AI Instance (TBLC AI )
Building Your Own AI Instance (TBLC AI )Building Your Own AI Instance (TBLC AI )
Building Your Own AI Instance (TBLC AI )
 
Designing A Time bound resource download URL
Designing A Time bound resource download URLDesigning A Time bound resource download URL
Designing A Time bound resource download URL
 
Nanopower In Semiconductor Industry.pdf
Nanopower  In Semiconductor Industry.pdfNanopower  In Semiconductor Industry.pdf
Nanopower In Semiconductor Industry.pdf
 
Anypoint Code Builder , Google Pub sub connector and MuleSoft RPA
Anypoint Code Builder , Google Pub sub connector and MuleSoft RPAAnypoint Code Builder , Google Pub sub connector and MuleSoft RPA
Anypoint Code Builder , Google Pub sub connector and MuleSoft RPA
 
Using IESVE for Loads, Sizing and Heat Pump Modeling to Achieve Decarbonization
Using IESVE for Loads, Sizing and Heat Pump Modeling to Achieve DecarbonizationUsing IESVE for Loads, Sizing and Heat Pump Modeling to Achieve Decarbonization
Using IESVE for Loads, Sizing and Heat Pump Modeling to Achieve Decarbonization
 
KubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCost
KubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCostKubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCost
KubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCost
 
IaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdf
IaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdfIaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdf
IaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdf
 
UiPath Studio Web workshop series - Day 8
UiPath Studio Web workshop series - Day 8UiPath Studio Web workshop series - Day 8
UiPath Studio Web workshop series - Day 8
 
Bird eye's view on Camunda open source ecosystem
Bird eye's view on Camunda open source ecosystemBird eye's view on Camunda open source ecosystem
Bird eye's view on Camunda open source ecosystem
 
UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...
UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...
UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...
 
Cybersecurity Workshop #1.pptx
Cybersecurity Workshop #1.pptxCybersecurity Workshop #1.pptx
Cybersecurity Workshop #1.pptx
 
NIST Cybersecurity Framework (CSF) 2.0 Workshop
NIST Cybersecurity Framework (CSF) 2.0 WorkshopNIST Cybersecurity Framework (CSF) 2.0 Workshop
NIST Cybersecurity Framework (CSF) 2.0 Workshop
 
AI You Can Trust - Ensuring Success with Data Integrity Webinar
AI You Can Trust - Ensuring Success with Data Integrity WebinarAI You Can Trust - Ensuring Success with Data Integrity Webinar
AI You Can Trust - Ensuring Success with Data Integrity Webinar
 
20230104 - machine vision
20230104 - machine vision20230104 - machine vision
20230104 - machine vision
 

Introduction to Data Vault Modeling

  • 1. Introduction to Data Vault Modeling Kent Graziano Data Vault Master and Oracle ACE TrueBridge Resources OOW 2011 Session #05923
  • 2. My Bio • Kent Graziano – Certified Data Vault Master – Oracle ACE (BI/DW) – Data Architecture and Data Warehouse Specialist • 30 years in IT • 20 years of Oracle-related work • 15+ years of data warehousing experience – Co-Author of • The Business of Data Vault Modeling (2008) • The Data Model Resource Book (1st Edition) • Oracle Designer: A Template for Developing an Enterprise Standards Document – Past-President of Oracle Development Tools User Group (ODTUG) and Rocky Mountain Oracle User Group – Co-Chair BIDW SIG for ODTUG (C) Kent Graziano
  • 3. Membership Special: Join by October 15 to become a member for only $99!
  • 4. What Is a Data Warehouse? “A subject-oriented, integrated, time-variant, non-volatile collection of data in support of management’s decision making process.” W.H. Inmon “The data warehouse is where we publish used data.” Ralph Kimball (C) Kent Graziano
  • 5. Inmon’s Definition • Subject oriented – Developed around logical data groupings (subject areas) not business functions • Integrated – Common definitions and formats from multiple systems • Time-variant – Contains historical view of data • Non-volatile – Does not change over time – No updates (C) Kent Graziano
  • 6. Data Vault Definition The Data Vault is a detail oriented, historical tracking and uniquely linked set of normalized tables that support one or more functional areas of business. It is a hybrid approach encompassing the best of breed between 3rd normal form (3NF) and star schema. The design is flexible, scalable, consistent, and adaptable to the needs of the enterprise. It is a data model that is architected specifically to meet the needs of today’s enterprise data warehouses. Dan Linstedt: Defining the Data Vault TDAN.com Article (C) TeachDataVault.com
  • 7. Why Bother With Something New? Old Chinese proverb: 'Unless you change direction, you're apt to end up where you're headed.' (C) TeachDataVault.com
  • 8. Why do we need it? • We have seen issues in constructing (and managing) an enterprise data warehouse model using 3rd normal form, or Star Schema. – 3NF – Complex PKs when cascading snapshot dates (time-driven PKs) – Star – difficult to re-engineer fact tables for granularity changes • These issues lead to break downs in flexibility, adaptability, and even scalability (C) Kent Graziano
  • 9. Data Vault Time Line E.F. Codd invented 1976 Dr Peter Chen 1990 – Dan Linstedt relational modeling Created E-R Begins R&D on Data Diagramming Vault Modeling Chris Date and Hugh Darwen Mid 70’s AC Nielsen Maintained and Popularized Refined Modeling Dimension & Fact Terms 1960 1970 1980 1990 2000 Late 80’s – Barry Devlin Early 70’s Bill Inmon and Dr Kimball Release Began Discussing “Business Data Data Warehousing Warehouse” Mid 80’s Bill Inmon Popularizes Data Mid 60’s Dimension & Fact Warehousing Modeling presented by General 2000 – Dan Linstedt Mills and Dartmouth University Mid – Late 80’s Dr Kimball releases first 5 articles Popularizes Star Schema on Data Vault Modeling (C) TeachDataVault.com
  • 10. Data Vault Evolution • The work on the Data Vault approach began in the early 1990s, and completed around 1999. • Throughout 1999, 2000, and 2001, the Data Vault design was tested, refined, and deployed into specific customer sites. • In 2002, the industry thought leaders were asked to review the architecture. – This is when I attend my first DV seminar in Denver and met Dan! • In 2003, Dan began teaching the modeling techniques to the mass public. (C) Kent Graziano
  • 11. Data Vault Modeling… (C) TeachDataVault.com
  • 12. Where does a Data Vault Fit? (C) TeachDataVault.com
  • 13. Where does a Data Vault Fit? Oracle’s Next Generation Data Warehouse Reference Architecture Data Vault goes here (C) Oracle Corp
  • 14. 3 Simple Structures (C) TeachDataVault.com
  • 15. Hub and Spoke = Scalability http://www.nature.com/ng/journal/v29/n2/full/ng1001-105.html If nature uses Hub & Spoke, why shouldn’t we? Genetics scale to billions of cells, the Data Vault scales to Billions of records (C) TeachDataVault.com 15
  • 16. Hubs = Neurons Hub Very similar to a neural network, The Hubs create the base structure (C) TeachDataVault.com
  • 17. Links = Dendrite + Synapse In neural networks, Dendrites & Synapses fire to pass messages, The Links dictate associations, connections (C) TeachDataVault.com
  • 18. Satellites = Memories Perception, understanding and processing These all describe the memory Satellites house descriptors that can change over time (C) TeachDataVault.com
  • 19. National Drug Codes + Orange Book of Drug Patent Applications A WORKING EXAMPLE http://www.accessdata.fda.gov/scripts/cder/ndc/default.cfm http://www.fda.gov/Drugs/InformationOnDrugs/ucm129662.htm (C) TeachDataVault.com
  • 20. 1. Hub = Business Keys Product Number Drug Label Code NDA Application # Firm Name Dose Form Code Drug Listing Patent Number Patent Use Code Hubs = Unique Lists of Business Keys Business Keys are used to TRACK and IDENTIFY key information (C) TeachDataVault.com
  • 21. Business Keys = Ontology Firm Name Business Keys should be Drug Listing arranged in an ontology In order to learn the Product Number dependencies of the data Dose Form Code set NDA Application # Drug Label Code Patent Number Patent Use Code NOTE: Different Ontologies represent different views of the data! (C) TeachDataVault.com
  • 22. Hub Entity A Hub is a list of unique business keys. Hub Structure Hub Product Primary Key Product Sequence ID Unique Index <Business Key> Product Number (Primary Index) Load DTS Product Load DTS Record Source Prod Record Source Note: • A Hub’s Business Key is a unique index. • A Hub’s Load Date represents the FIRST TIME the EDW saw the data. • A Hub’s Record Source represents: First – the “Master” data source (on collisions), if not available, it holds the origination source of the actual key. (C) TeachDataVault.com
  • 23. Business Keys • What exactly are Business Keys? – Example 1: • Siebel has a “system generated” customer key • Oracle Financials has a “system generated” customer key • These are not business keys. These are keys used by each respective system to track records. – Example 2: • Siebel Tracks customer name, and address as unique elements. • Oracle Financials tracks name, and address as unique elements. • These are business keys. • What we want in the hub, are sets of natural business keys that uniquely identify the data – across systems. • Stay away from “system generated” keys if possible. – System Generated keys will cause damage in the integration cycle if they are not unique across the enterprise. (C) TeachDataVault.com
  • 24. Hub Definition • What Makes a Hub Key? – A Hub is based on an identifiable business key. – An identifiable business key is an attribute that is used in the source systems to locate data. – The business key has a very low propensity to change, and usually is not editable on the source systems. – The business key has the same semantic meaning, and the same granularity across the company, but not necessarily the same format. • Attributes and Ordering – All attributes are mandatory. – Sequence ID 1st, Busn. Key 2nd , Load Date 3rd ,Record Source Last (4th). – All attributes in the Business Key form a UNIQUE Index. (C) TeachDataVault.com
  • 25. The technical objective of the Hub is to: • Uniquely list all possible business keys, good, bad, or indifferent of where they originated. • Tie the business keys in a 1:1 ratio with surrogate keys (giving meaning to the surrogate generated sequences). • Provide a consolidation and attribution layer for clear horizontal definition of the business functionality. • Track the arrival of data, the first time it appears in the warehouse. • Provide right-time / real-time systems the ability to load transactions without descriptive data. (C) TeachDataVault.com
  • 26. Hub Table Structures SQN = Sequence (insertion order) LDTS = Load Date (when the Warehouse first sees the data) RSRC = Record Source (System + App where the data ORIGINATED) (C) TeachDataVault.com
  • 27. Sample Hub Product ID PRODUCT # LOAD DTS RCRD SRC 1 MFG-PRD123456 6-1-2000 MANUFACT 2 P1235 6-2-2000 CONTRACTS 3 *P1235 2-15-2001 CONTRACTS 4 MFG-1235 5-17-2001 MANUFACT 5 1235-MFG 7-14-2001 FINANCE 6 1235 10-13-2001 FINANCE 7 PRD128582 4-12-2002 MANUFACT 8 PRD125826 4-12-2002 MANUFACT 9 PRD128256 4-12-2002 MANUFACT 10 PRD929929-* 4-12-2002 MANUFACT Unique Index Notes: • ID is the surrogate sequence number (Primary Key) • What does the load date tell you? • Do you notice any overloaded uses for the product number? • Are there similar keys from different systems? • Can you spot entry errors? • Are any patterns visually present? (C) TeachDataVault.com
  • 28. 2. Links = Associations Firms Generate Firms Generate Labels Product Listings Listings Contain Firms Manufacture Labeler Codes Products Listings for Products are in NDA Applications Links = Transactions and Associations They are used to hook together multiple sets of information (i.e., Hubs) (C) TeachDataVault.com
  • 29. Associations = Ontological Hooks Firm Name Firms Generate Product Listings Drug Listing Firms Manufacture Product Number Products Listings for Products NDA Application # are in NDA Applications Business Keys are associated by many linking factors, these links comprise the associations in the hierarchy. (C) TeachDataVault.com
  • 30. Link Definitions • What Makes a Link? – A Link is based on identifiable business element relationships. • Otherwise known as a foreign key, • AKA a business event or transaction between business keys, – The relationship shouldn’t change over time • It is established as a fact that occurred at a specific point in time and will remain that way forever. – The link table may also represent a hierarchy. • Attributes – All attributes are mandatory (C) TeachDataVault.com
  • 31. Link Entity A Link is an intersection of business keys. It can contain Hub Keys and Other Link Keys. Link Structure Link Line-Item Primary Key Link Line Item Sequence ID Unique Index {Hub Surrogate Keys 1..N} Hub Product Sequence ID (Primary Index) Load DTS Hub Order Sequence ID Record Source Load DTS Record Source Note: • A Link’s Business Key is a Composite Unique Index • A Link’s Load Date represents the FIRST TIME the EDW saw the relationship. • A Link’s Record Source represents: First – the “Master” data source (on collisions), if not available, it holds the origination source of the actual key. (C) TeachDataVault.com
  • 32. Modeling Links - 1:1 or 1:M? • Today: – Relationship is a 1:1 so why model a Link? • Tomorrow: – The business rule can change to a 1:M. – You discover new data later. • With a Link in the Data Vault: – No need to change the EDW structure. – Existing data is fine. – New data is added. (C) Kent Graziano
  • 33. Link Table Structures SQN = Sequence (insertion order) LDTS = Load Date (when the Warehouse first sees the data) RSRC = Record Source (System + App where the data ORIGINATED) (C) TeachDataVault.com
  • 34. Sample Link Entity - Relationship Hub Customer Order CSID CUST # LOAD DTS RCRD SRC Satellite 1 ABC123456 10-12-2000 MFG Hub Order 2 DKEF 1-25-2001 CONTRACTS OrdID ORDER # LOAD DTS RCRD SRC 1 ORD0001 10-12-2000 MFG 2 ORD0002 10-2-2000 CONTRACTS LSEQID CSID OrdID LOAD DTS RCRD SRC 1000 1 1 10-14-2000 FINANCE 1001 1 2 10-14-2000 FINANCE Link Order-Details LSEQID OrdID PID LIT LOAD DTS RCRD SRC Link Cust Order 1000 1 100 1 10-14-2000 FINANCE 1001 1 101 2 10-14-2000 FINANCE Order Details Satellite Hub Product PID PRODUCT # LOAD DTS RCRD SRC Product 100 PRD128582 10-14-2000 MFG Satellite 101 PRD128256 10-14-2000 MFG (C) Kent Graziano
  • 35. Sample Link Entity - Hierarchy Hub Customer Link Customer Rollup ID CUSTOMER # LOAD DTS RCRD SRC From To LOAD DTS RCRD SRC CSID 1 ABC123456 10-12-2000 MANUFACT CSID 1 NULL 10-14-2000 FINANCE 2 ABC925_24FN 10-22-2000 CONTRACTS 3 DKEF 1-25-2001 CONTRACTS 2 1 10-22-2000 FINANCE 4 KKO92854_dd 3-7-2001 CONTRACTS 3 1 2-15-2001 FINANCE 5 LLOA_82J5J 6-4-2001 SALES 4 2 4-3-2001 HR 6 HUJI_BFIOQ 8-3-2001 SALES 5 2 6-4-2001 SALES 7 PPRU_3259 2-2-2002 FINANCE 8 PAFJG2895 2-2-2002 CONTRACTS 9 929ABC2985 2-2-2002 CONTRACTS 10 93KFLLA 2-2-2002 CONTRACTS Note: • If you have logic – you can roll together customers, or companies, or sub-assemblies, bill of materials, etc.. • We do not want to disturb the facts (underlying data in the hub), but we do want to re- arrange hierarchies at different points over time. (C) Kent Graziano
  • 36. Link To Link (Link Sale Component) Sat Totals Hub Invoice Link Sat Dates Product Hierarchy Hub Link Sale Hub Product Line Item Customer Sat Product Link Sale Sat Sat Sat Desc. Component Quantity Cust Active Address Sub-Totals Note: • Link Sale Component provides a shift in grain. • Link Sale Component allows for configurable options of products tracked on a single line-item product sold. • Link Sale Component provides for sub-assembly tracking. (C) Kent Graziano
  • 37. 3. Satellites = Descriptors Firm Patent Locations Expiration Info Listing Formulation Listing Medication Product Dosages Ingredients Drug Packaging Types Satellites = Descriptors These data provide context for the keys (Hubs) And for the associations (Links) (C) TeachDataVault.com
  • 38. Satellite Definitions • What Makes a Satellite? – A Satellite is based on an non-identifying business elements. • Attributes that are descriptive data, often in the source systems known as descriptions, or free-form entry, or computed elements. – The Satellite data changes, sometimes rapidly, sometimes slowly. • The Satellites are separated by type of information and rate of change. – The Satellite is dependent on the Hub or Link key as a parent, • Satellites are never dependent on more than one parent table. • The Satellite is never a parent table to any other table (no snow flaking). • Attributes and Ordering – All attributes are mandatory – EXCEPT END DATE. – Parent ID 1st, Load Date 2nd, Load End Date 3rd,Record Source Last. (C) TeachDataVault.com
  • 39. Descriptors = Context Firm Firm Name Locations Firms Generate Listing Product Listings Drug Listing Formulation Firms Manufacture Product Number Products Product Start & End of Ingredients manufacturing Context specific point in time warehousing portion (C) TeachDataVault.com
  • 40. Satellite Entity A Satellite is a time-dimensional table housing detailed information about the Hub’s or Link’s business keys. Hub Primary Key Customer # • Satellites are defined by Load DTS Load DTS TYPE of data and RATE OF Extract DTS Extract DTS CHANGE Load End Date Load End Date Detail Customer Name • Mathematically – this reduces Business Data Customer Addr1 Customer Addr2 redundancy and decreases <Aggregation Data> storage requirements over {Update User} {Update User} {Update DTS} {Update DTS} time (compared to a Star Schema) Record Source Record Source (C) TeachDataVault.com
  • 41. Satellite Entity- Details • A Satellite has only 1 foreign key; it is dependent on the parent table (Hub or Link) • A Satellite may or may not have an “Item Numbering” attribute. • A Satellite’s Load Date represents the date the EDW saw the data (must be a delta set). – This is not Effective Date from the Source! • A Satellite’s Record Source represents the actual source of the row (unit of work). • To avoid Outer Joins, you must ensure that every satellite has at least 1 entry for every Hub Key. (C) TeachDataVault.com
  • 42. Satellite Table Structures SQN = Sequence (parent identity number) LDTS = Load Date (when the Warehouse first sees the data) LEDTS = End of lifecycle for superseded record RSRC = Record Source (System + App where the data ORIGINATED) (C) TeachDataVault.com
  • 43. Satellite Entity – Hub Related Hub Customer ID CUSTOMER # LOAD DTS RCRD SRC 0 N/A 10-12-2000 SYSTEM 1 ABC123456 10-12-2000 MANUFACT 2 ABC925_24FN 10-2-2000 CONTRACTS 3 ABC5525-25 10-1-2000 FINANCE CUSTOMER NAME SATELLITE CSID LOAD DTS NAME RCRD SRC 0 10-12-2000 N/A SYSTEM 1 10-12-2000 ABC Suppliers MANUFACT 1 10-14-2000 ABC Suppliers, Inc MANUFACT 1 10-31-2000 ABC Worldwide Suppliers, Inc MANUFACT Dummy satellite 1 12-2-2000 ABC DEF Incorporated CONTRACTS record eliminates need for outer 2 10-2-2000 WorldPart CONTRACTS joins during 2 10-14-2000 Worldwide Suppliers Inc CONTRACTS extract. 3 10-1-2000 N/A FINANCE (C) Kent Graziano
  • 44. Satellite Entity – Link Related Link Order Details ID Product ID OrdID LOAD DTS RCRD SRC 0 0 0 10-12-2000 SYSTEM 1 PRD102 1 10-12-2000 MANUFACT 2 PRD103 1 10-2-2000 CONTRACTS Satellite Order Totals ID LOAD DTS Tax Total RCRD SRC 0 10-12-2000 <NULL> <NULL> SYSTEM 1 10-12-2000 3.00 0.00 MANUFACT Dummy satellite 1 10-14-2000 4.00 12.00 MANUFACT record eliminates need for outer 1 10-31-2000 3.69 14.02 MANUFACT joins during 1 12-2-2000 4.69 13.69 CONTRACTS extract. 2 10-2-2000 2.45 10.00 CONTRACTS 2 10-14-2000 1.22 14.00 CONTRACTS (C) Kent Graziano
  • 45. Satellite Splits – Type of Information ID CUSTOMER # LOAD DTS RCRD SRC Hub Customer 0 N/A 10-12-2000 SYSTEM 1 ABC123456 10-12-2000 MANUFACT 2 ABC925_24FN 10-2-2000 CONTRACTS 3 ABC5525-25 10-1-2000 FINANCE CUSTOMER SATELLITE CSID LOAD DTS NAME Contact Sales Rgn Cust Score RCRD SRC 0 10-12-2000 N/A N/A N/A 0 SYSTEM 1 10-12-2000 ABC Suppliers Jen F. SE 102 MANUFACT 1 10-14-2000 ABC Suppliers, Inc Jen F. SE 120 MANUFACT 1 10-31-2000 ABC Worldwide Suppliers, Inc Jen F. SE 130 MANUFACT 1 12-2-2000 ABC DEF Incorporated Jack J. SC 85 CONTRACTS 2 10-2-2000 WorldPart Jenny SE 99 CONTRACTS 2 10-14-2000 Worldwide Suppliers Inc Jenny SE 102 CONTRACTS 3 10-1-2000 N/A N/A N/A 0 FINANCE (C) Kent Graziano
  • 46. Satellite Splits – Type of Information ID CUSTOMER # LOAD DTS RCRD SRC Hub Customer 0 N/A 10-12-2000 SYSTEM 1 ABC123456 10-12-2000 MANUFACT 2 ABC925_24FN 10-2-2000 CONTRACTS 3 ABC5525-25 10-1-2000 FINANCE Customer Name Satellite Customer Sales Satellite (name Info) (Sales Info) • Because of the type of information is different, we split the logical groups into multiple Satellites. • This provides sheer flexibility in representation of the information. • We may have one more problem with Rate Of Change… (C) Kent Graziano
  • 47. Satellite Splits – Rate of Change ID CUSTOMER # LOAD DTS RCRD SRC Hub Customer 0 N/A 10-12-2000 SYSTEM 1 ABC123456 10-12-2000 MANUFACT 2 ABC925_24FN 10-2-2000 CONTRACTS 3 ABC5525-25 10-1-2000 FINANCE CUSTOMER SATELLITE CSID LOAD DTS NAME Contact Sales Rgn Cust Score RCRD SRC 0 10-12-2000 N/A N/A N/A 0 SYSTEM 1 10-12-2000 ABC Suppliers Jen F. SE 102 MANUFACT 1 10-14-2000 ABC Suppliers, Inc Jen F. SE 120 MANUFACT 1 10-31-2000 ABC Worldwide Suppliers, Inc Jen F. SE 130 MANUFACT 1 12-2-2000 ABC DEF Incorporated Jack J. SC 85 CONTRACTS 2 10-2-2000 WorldPart Jenny SE 99 CONTRACTS 2 10-14-2000 Worldwide Suppliers Inc Jenny SE 102 CONTRACTS 3 10-1-2000 N/A N/A N/A 0 FINANCE (C) Kent Graziano
  • 48. Satellite Splits – Rate of Change ID CUSTOMER # LOAD DTS RCRD SRC Customer Name Satellite 0 N/A 10-12-2000 SYSTEM (name Info) 1 ABC123456 10-12-2000 MANUFACT 2 ABC925_24FN 10-2-2000 CONTRACTS Customer Sales Satellite 3 ABC5525-25 10-1-2000 FINANCE (Sales Info) Hub Customer Customer Scoring Satellite • Assume the data to score customers begins arriving in the warehouse every 5 minutes… We then separate the scoring information from the rest of the satellites. • IF we end up with data that (over time) doesn’t change as much as we thought, we can always re-combine Satellites to eliminate joins. (C) Kent Graziano
  • 49. Satellites Split By Source System SAT_SALES_CUST SAT_FINANCE_CUST SAT_CONTRACTS_CUST PARENT SEQUENCE PARENT SEQUENCE PARENT SEQUENCE LOAD DATE LOAD DATE LOAD DATE <LOAD-END-DATE> <LOAD-END-DATE> <LOAD-END-DATE> <RECORD-SOURCE> <RECORD-SOURCE> <RECORD-SOURCE> Name First Name Contact Name Phone Number Last Name Contact Email Best time of day to reach Guardian Full Name Contact Phone Number Do Not Call Flag Co-Signer Full Name Phone Number Address City State/Province Zip Code Satellite Structure PARENT SEQUENCE Primary LOAD DATE Key <LOAD-END-DATE> <RECORD-SOURCE> {user defined descriptive data} {or temporal based timelines} (C) TeachDataVault.com 49
  • 51. Worlds Smallest Data Vault Hub Customer Hub_Cust_Seq_ID • The Data Vault doesn’t have to be “BIG”. Hub_Cust_Num • An Data Vault can be built incrementally. Hub_Cust_Load_DTS Hub_Cust_Rec_Src • Reverse engineering one component of the existing models is not uncommon. • Building one part of the Data Vault, then Satellite Customer Name Hub_Cust_Seq_ID changing the marts to feed from that vault Sat_Cust_Load_DTS is a best practice. Sat_Cust_Load_End_DTS Sat_Cust_Name Sat_Cust_Rec_Src • The smallest Enterprise Data Warehouse consists of two tables: – One Hub, – One Satellite (C) TeachDataVault.com
  • 52. Top 10 Rules for DV Modeling Business keys with a low propensity for change become Hub keys. Transactions and integrated keys become Link tables. Descriptive data always fits in a Satellite. 1. A Hub table always migrates its’ primary key outwards. 2. Hub to Hub relationships are allowed only through a link structure. 3. Recursive relationships are resolved through a link table. 4. A Link structure must have at least 2 FK relationships. 5. A Link structure can have a surrogate key representation. 6. A Link structure has no limit to the number of hubs it integrates. 7. A Link to Link relationship is allowed. 8. A Satellite can be dependent on a link table. 9. A Satellite can only have one parent table. 10. A Satellite cannot have any foreign key relationships except the primary key to the parent table (hub or link). (C) TeachDataVault.com
  • 53. NOTE: Automating the Build • DV is a repeatable methodology with rules and standards • Standard templates exist for: – Loading DV tables – Extracting data from DV tables • RapidAce (www.rapidace.com – now Open Source) – Software that applies these rules to: • Convert 3NF models to DV • Convert DV to Star Schema • This could save us lots of time and $$ (C) Kent Graziano
  • 54. In Review… • Data Vault is… – A Data Warehouse Modeling Technique (& Methodology) – Hub and Spoke Design – Simple, Easy, Repeatable Structures – Comprised of Standards, Rules & Procedures – Made up of Ontological Metadata – AUTOMATABLE!!! • Hubs = Business Keys • Links = Associations / Transactions • Satellites = Descriptors (C) TeachDataVault.com
  • 55. The Experts Say… “The Data Vault is the optimal choice for modeling the EDW in the DW 2.0 framework.” Bill Inmon “The Data Vault is foundationally strong and exceptionally scalable architecture.” Stephen Brobst “The Data Vault is a technique which some industry experts have predicted may spark a revolution as the next big thing in data modeling for enterprise warehousing....” Doug Laney
  • 56. More Notables… “This enables organizations to take control of their data warehousing destiny, supporting better and more relevant data warehouses in less time than before.” Howard Dresner “[The Data Vault] captures a practical body of knowledge for data warehouse development which both agile and traditional practitioners will benefit from..” Scott Ambler
  • 58. Growing Adoption… • The number of Data Vault users in the US surpassed 500 in 2010 and grows rapidly (http://danlinstedt.com/about/dv- customers/) (C) Kent Graziano
  • 59. Conclusion? Changing the direction of the river takes less effort than stopping the flow of water (C) TeachDataVault.com
  • 61. Where To Learn More The Technical Modeling Book: http://LearnDataVault.com On YouTube: http://www.youtube.com/LearnDataVault On Facebook: www.facebook.com/learndatavault Dan’s Blog: www.danlinstedt.com The Discussion Forums: http://LinkedIn.com – Data Vault Discussions World wide User Group (Free): http://dvusergroup.com The Business of Data Vault Modeling by Dan Linstedt, Kent Graziano, Hans Hultgren (available at www.lulu.com ) 61
  • 63. Contact Information Kent Graziano Kent.graziano@att.net