The Data Warehouse Toolkit - Ch1

Note: This reading note only covers the basic introduction of:

  • DW/BI
  • Dimensional modeling

For detailed tutorial about dimension/fact, please refer to the original book The Data Warehouse Toolkit.

What is DW/BI?

Data warehousing and business intelligence system.

  • DW/BI system must consider the needs of the business.

  • You'll have an appreciation for the need to be half DBA (database administrator) and half MBA (business analyst) as you tackle your DW/BI project.

Different Worlds of Data Capture and Data Analysis

The DW/BI system has profoundly different needs, clients, structures, and rhythms than the operational systems of record.

Data Capture

  • Users of an operational system turn the wheels of the organization. They take orders, sign up new customers, monitor the status of operational activities, and log complaints.
  • The operational systems are optimized to process transactions quickly.
  • It typically do not maintain history, but rather update data to reflect the most current state.

Data Analysis

  • Users of a DW/BI system, on the other hand, watch the wheels of the organization turn to evaluate performance. They count the new orders and compare them with last week's orders, and ask why the new customers signed up, and what the customers complained about. DW/BI users almost never deal with one transaction at a time.
  • These systems are optimized for high-performance queries as users' questions often require that hundreds or hundreds of thousands of transactions be searched and compressed into an answer set.
  • Users of a DW/BI system typically demand that historical context be preserved to accurately evaluate the organization's performance over time.

Goals of Data Warehousing and Business Intelligence

Successful data warehousing and business intelligence demands more than being a stellar architect, technician, modeler, or database administrator.
With a DW/BI initiative, you have one foot in your IT comfort zone while your other foot is on the unfamiliar turf of business users. You must straddle the two, modifying some tried-and-true skills to adapt to the unique demands of DW/BI and behave like you're a hybrid DBA/MBA.

  1. SIMPLE and FAST: The DW/BI system must make information easily accessible.
    • The business intelligence tools and applications that access the data must be simple and easy to use.
    • They also must return query results to the user with minimal wait times.
  2. CONSISTENCY: The DW/BI system must present information consistently.
    • The data in the system must be credible. Data must be carefully assembled from a variety of sources, cleansed, quality assured, and released only when it is fit for user consumption.
    • Common labels and definitions for the DW/BI system's contents are used across data sources.
      • If 2 performance measures have the same name, they must mean the same thing. Conversely, if 2 measures don't mean the same thing, they should be labeled differently.
  3. ADAPT to CHANGE: User needs, business conditions, data, and technology are all subject to change.
    • The DW/BI system must be designed to handle this inevitable change gracefully so that it doesn't invalidate existing data or applications.
    • If descriptive data in the DW/BI system must be modified, you must appropriately account for the changes and make these changes transparent to the users.
  4. TIMELINESS: Present information in a timely way.
    • Raw data may need to be converted into actionable information within hours, minutes, or even seconds.
    • Need to have realistic expectations for what it means to deliver data when there is little time to clean or validate it.
  5. SECURITY: The DW/BI system must be a secure bastion that protects the information assets.
    • The warehouse likely contains potential harmful details in the hands of the wrong people.
    • The DW/BI system must effectively control access to the organization's confidential information.
  6. AUTHORITY and TRUSTWORTHY: The DW/BI system must serve as the authoritative and trustworthy ofundation for improved decision making.
    • The data warehouse must have the right data to support decision making.
    • The original label that predates DW/BI is still the best description of what you are designing: a decision support system.
  7. The business community must accept the DW/BI system to deem it successful.
    • If the business community does not embrace the DW/BI environment and actively use it, you have failed the acceptance test.
    • DW/BI usage is sometimes optional. Business users will embrace the DW/BI system if it is the "simple and fast" source for actionable information.

DW/BI Manager's Responsibilities

  1. Understand the business users
  2. Deliver high-quality, relevant, and accessible information and analytics to the business users
  3. Sustain the DW/BI environment

Dimensional Modeling Introduction

Dimensional modeling is widely accepted as the preferred technique for presenting analytic data because it addresses 2 simultaneous requirements:

  1. Deliver data that's understandable to the business users.
  2. Deliver fast query performance.

    • Dimensional modeling is a longstanding technique for making databases simple.
      • "Make everything as simple as possible, but not simpler."
      • A data model that starts simple has a chance of remaining simple at the end of the design.
      • A model that starts complicated surely will be overly complicated at the end, resulting in slow query performance and business user rejection.
    • The ability to visualize something as abstract as a set of data in a concrete and tangible way is the secret of understandability.
    • Dimensional models are quite different from 3NF (third normal form) which seek to remove data redundancies.
      • Normalized 3NF structures are immensely useful in operational processing because an update or insert transaction touches the database in only one place. However, they are too complicated for BI queries.
        • Most relational database management systems can't efficiently query a normalized model.
        • The complexity of users' unpredictable queries overwhelms the database optimizers, resulting in disastrous query performance.
      • Both 3NF and dimensional models can be represented in Entity-Relationship Diagrams (ERDs) because both consist of joined relational tables.
      • The key difference between 3NF and dimensional models is the degree of normalization.
      • Dimensional modeling addresses the problem of overly complex schemas in the presentation area.

Star Schemas Versus OLAP Cubes

  • Star schemas: Dimensional models implemented in relational database management systems.
  • OLAP (OnLine Analytical Processing) cubes: Dimensional models implemented in multidimensional database environments.

Stars and cubes have a common logical design with recognizable dimensions; however, the physical implementation differs.

OLAP cube

  • Pros:

    • Data loaded into an OLAP cube is stored and indexed using formats and techniques that are designed for dimensional data.
    • Performance aggregations or precalculated summary tables are often created and managed by the OLAP cube engine.
    • Cuves deliver superior query performance because of the precalculations, indexing strategies, and other optimizations.
    • Cubes also provide more analytically robust functions that exceed those available with SQL.
  • Cons:

    • You pay a load performance price for these capabilities, especially with large data sets.
  • General recommendation:

    • Detailed, atomic information be loaded into a star schema;
    • Optional OLAP cubes are then populated from the star schema.