Ensuring Data Integrity, Part I
Back

Ensuring Data Integrity Through the Use of Prevent and Detect Controls, Part I

This is the first part of a three part article by Jonathan Wu. Each section deals with the importance of data integrity and it's effect on business intelligence. Please look for the next two sections to be released shortly.

Introduction

There are numerous factors that can cause a data warehouse project to fail. From poorly gathered functional business requirements to lack of adequately skilled project team members, no factor is more devastating to the organization than the failure of the user community to accept the data warehouse because the information that it contains is of questionable quality. Understanding why there are problems with the information in the data warehouse can be a daunting task for the individual(s) responsible for resolving it. The problem may be due to any one or combination of the following items:

How does one approach and resolve data problems within a data warehouse? The purpose of this article is to address data integrity through the establishment of a control environment that mitigates and identifies data problems within a data warehouse.

Data Integrity

What is data integrity? For the purposes of this article, data integrity is defined as information that adheres to a strict standard of value and completeness. That is, the data is both accurate and the entire population of relevant data is contained within the data warehouse.

The credibility of the data warehouse solely rests with the integrity of its data. Therefore, understanding the impact of perceptions is important to appreciating the requirement of data integrity within the data warehouse. Usage of the data warehouse solely rests with the users’ perceptions of the data warehouse. Negative users’ perceptions will lead to a decline in its usage and value to the organization. First and foremost, the data warehouse should be reconciled to establish confidence in both the user and Information Systems communities that the data has integrity. Second, reconciling provides the means of addressing user questions concerning their data extracted from the warehouse.

The End-User Perspective

Throughout the life of a data warehouse, users will be questioning the information that they obtain from it unless the warehouse has established credibility within the user community.

During the weeks after a data warehouse is placed into production, users are going to ask the question "Can I rely on the information from the data warehouse"? In addition, the most common user question we encounter "Is the information in my query or report correct"? How do you address these questions? Your responses and resolutions to these questions will have an immense impact on users’ perceptions of the data warehouse.

The IS Perspective

After you have completed the initial load or an incremental update of the data warehouse, how do you know that all the required information loaded? You can check to see if the load process completed successfully or if there is any data in the warehouse tables. Assuming that the load process completed successfully, how do you know that the data is correctly loaded? If you do not have the ability to substantiate your gut reaction that everything is "OK", how can you address the users of the data warehouse? Remember that your responses and resolutions to their questions will directly impact their perceptions of the data warehouse.

Top

Data Integrity Controls

How do you prevent meaningless, corrupt or redundant data from entering the data warehouse? How do you know that all the required information is loaded? By establishing an environment that incorporates controls into the data warehouse process flow, the integrity of the data will be ensured. Data integrity controls can be grouped into two categories:

  1. Prevent Controls,
  2. Detect Controls,

Each category addresses data at different stages of the process flow.

Prevent Controls

Prevent controls aid in the integrity of data before it is loaded into the data warehouse. These controls are incorporated into the data migration, cleansing, transformation and loading processes and are the primary means of preventing meaningless, corrupt or redundant data from entering the data warehouse.

Detect Controls

Detect controls are those controls which assess the accuracy and completeness of data after it is loaded into the data warehouse or at each stage of the data process flow. These controls are incorporated into the reconciliation process and are the primary means of detecting incorrect or insufficient information within the data warehouse.

Data Control Points

At each stage of the data migration, cleansing, transformation and loading process, there exists an opportunity to ensure the integrity of the data before it is entered into the data warehouse. In addition, the reconciliation process identifies any discrepancies in the data after it has entered the data warehouse or at each stage of the data process flow. These data control points are highlighted in the following diagram:

 

 

 

 

 

 

 

Top

Data Process Flow Stages:

  1. Data Migration
  2. Cleansing
  3. Transformation
  4. Loading
  5. Reconciliation

William H. Inmon, a.k.a. the Father of Data Warehousing, has estimates that, on average, 80% of the efforts of building a data warehouse go into the data process flow stages.

 

Ensuring Data Integrity, Part II
Back

Ensuring Data Integrity Through the Use of Prevent and Detect Controls, Part II

This is the second part of a three part article by Jonathan Wu. Each section deals with the importance of data integrity and it’s effect on business intelligence. Please look for the next section to be released shortly.

Overview of Data Migration [Prevent Control]

The purpose of data migration is to move data from selected source systems into the staging area of the data warehouse. Only data that is required by the users for reporting or is used during the transformation and loading processes should be moved. Meaningless information can be prevented from entering the data warehouse by not moving unnecessary information from source systems to the staging area of the data warehouse.

Data that should be moved into the staging area of the data warehouse would include reference and transaction data. For example, in a sales data warehouse, reference data would refer to customer information and transaction data would be the information associated with a sale to a customer. Information that typically should not be moved into the staging area of the data warehouse includes the RDBMS system administrator tables and source system application tables containing meta data or temporary processing data. For example, in an Oracle database, the data dictionary tables, such as ALL_TABLES, contains information that is not meaningful to end-users and would not be used in the transformation or loading process of the data warehouse. The following table highlights the type of data that should and should not be migrated to the staging area of the data warehouse.

Content of Table

Migrate Data

Do Not

Migrate Data

Source System Reference Data:

  • Required for reporting purposes
  • Not required for reporting purposes


X



X

Source System Transaction Data:

  • Required for reporting purposes
  • Not required for reporting purposes


X




X

RDBMS System Data

 

X

Source System Application Meta Data or Temporary Processing Data

 

X

Tip #1
Identifying and understanding where th data is located and what data to move is a difficult task. Do not underestimate the challenges of this task.

Tip #2
When in doubt, leave it out until you know what you want to do with the data.

Top

 

 

Overview of Data Cleansing [Prevent Control]

The purpose of data cleansing is to correct, to standardize the format, and to fill in any values required by the data warehouse. This process also helps to identify redundant data, which will be prevented from entering the data warehouse during the loading process. The following is a simplistic example of data cleansing.

Before cleansing:

The Customer Information table displayed below contains data that is in a non-standard format. The data elements of FIRST_NAME, LAST_NAME, AND COMPANY_NAME contain information that is in an inconsistent format. In addition, the data element of STATE has missing information.

Customer Information Table

FIRST_NAME

LAST_NAME

COMPANY_NAME

AREA_CODE

PHONE

STATE

sAM

Adams

boston beer co.

617

3685000

MA

Sam

Adams

Boston beer co,.

617

3685000

MA

Samuel

Adams

Boston Beer Co.

617

3685000

 

SAMUEL

ADAMS

BOSTON BEER

617

3685000

MA

           

Martin

Zweig

Zweig Funds

800

2722700

NY

After cleansing:

The cleansing process standardized the data in the Customer Information Table into a common format that has identified redundant information, which will be excluded from the data warehouse during the load process.

Customer Information Table

FIRST_NAME

LAST_NAME

COMPANY_NAME

AREA_CODE

PHONE

STATE

Samuel

Adams

Boston Beer Co.

617

3685000

MA

Samuel

Adams

Boston Beer Co.

617

3685000

MA

Samuel

Adams

Boston Beer Co.

617

3685000

MA

Samuel

Adams

Boston Beer Co.

617

3685000

MA

           

Martin

Zweig

Zweig Funds

800

2722700

NY

Tip #3
Utilize software tools that migrate, clense and transform data. The Return on Investment (ROI) justifies purchasing software tools verses in-house developed SQL scripts. The costs associated with maintaning and enhancing the in-house developed SQL scripts will significantly exceed the cost of purchasing third party software tools.

Overview of Data Transformation [Prevent Control]

The purpose of data transformation is to manipulate the data into the format and structure required by the data warehouse. The transformation process should reduce the number of data elements that are loaded from the staging area into the data warehouse. While developing the transformation rules for this process, only those data elements that are required by the data warehouse are utilized. If there are data elements that are not needed for reporting or loading of the data warehouse, prevent them from entering the data warehouse by not incorporating them into the transformation or load statements. The following is a simplistic example of data mapping from the source system to the data warehouse format.

Top

The transformation rules incorporated into the data mapping example above are as follows:

Data Element from Source System

Transformation Rule

Data Element in the Data Warehouse

 

Sequence created in data warehouse.

CUSTOMER_ID

CUSTOMER_ID

Data element not loaded into warehouse.

 

CUSTOMER_NUM

One to one mapping.

CUSTOMER_NUMBER

FIRST_NAME

One to one mapping and concatenation with LAST_NAME.

FIRST_NAME

FULL_NAME

LAST_NAME

One to one mapping and concatenation with FIRST_NAME.

LAST_NAME

FULL NAME

AREA_CODE

Concatenation with PHONE.

PHONE

PHONE

Concatenation with AREA_CODE.

PHONE

ADDRESS1

Concatenation with ADDRESS2 AND ADDRESS3.

ADDRESS

ADDRESS2

Concatenation with ADDRESS1 AND ADDRESS3.

ADDRESS

ADDRESS3

Concatenation with ADDRESS1 AND ADDRESS2.

ADDRESS

CITY

One to one mapping.

CITY

STATE

One to one mapping.

STATE

ZIP_CODE

One to one mapping.

ZIP_CODE

LAST_UPDATE_BY

Data element not loaded into warehouse.

 

LAST_UPDATE_DATE

Data element not loaded into warehouse.

 

ATTRIBUTE1

Data element not loaded into warehouse.

 

ATTRIBUTE2

Data element not loaded into warehouse.

 

Tip #4
Validate transformation rules with the user community.

 

Top

Ensuring Data Integrity, Part III
Back

Ensuring Data Integrity Through the Use of Prevent and Detect Controls, Part III

This is the final part of a three part article by Jonathan Wu. Each section deals with the importance of data integrity and it’s effect on business intelligence.

There are two primary approaches to loading a data warehouse: 1) full or, 2) incremental refreshes. The full refresh approach begins by truncating the tables in the data warehouse and then loading them with all the required data. The incremental refresh approach identifies changes to the source data from the last time the data warehouse was loaded and then inserts, updates or deletes rows of data in each table of the data warehouse, as required. Either loading approach can prevent unwanted data from entering the data warehouse by containing: 1) conditions in the load statements, and 2) unique indexes on the target tables within the data warehouse.

Full Refresh

The full refresh loading approach can prevent unwanted data from entering the data warehouse by containing conditions in the load statements. By placing a broad range of conditions on each of the load statements, unwanted data is prevented from entering the data warehouse. For example, the data warehouse contains financial information in US dollars only. The loading statement should contain a condition that filters for only financial information recorded in US dollars and would prevent any non-US Dollar financial information from entering the data warehouse. After the load process is complete, unique indexes are places on each of the tables of the data warehouse. In the event that there are redundant values on data elements that are defined by the unique index, the RDBMS will display an error message or the unique index will not be created. In such an event, the redundant rows of data need to be identified within the data warehouse table and eliminated. Note: For performance purposes of the load statement, there should not be any indexes on the target tables during the loading. After the loading process is complete, create all necessary indexes.

Incremental Refresh

The incremental refresh loading approach can prevent unwanted data from entering the data warehouse by containing: 1) conditions in the load statements and, 2) unique indexes on the target tables within the data warehouse. As with the full refresh approach, conditions on each of the load statements prevents unwanted data from entering the data warehouse. In addition, unique indexes that are placed on the target tables within the data warehouse will prevent redundant data from entering the data warehouse. The RDBMS will either load only unique rows of data into the data warehouse or display an error message.

Tip #5
As the data warehouse grows, the full refresh approach becomes resource and time intensive. Plan on using the incremental refresh approach.

Overview of Reconciling the Data Warehouse [Detect Control]

The reconciliation process identifies data issues that slipped past the prevent controls. The reconciliation process is designed to provide assurance as well as identify data that does not agree with the information that is contained within the source systems. Reconciling data determines the accuracy and completeness of the information.

  • Quality of Data

Accuracy is assessed by the use of control totals on selected data elements, which are then compared to the anticipated results.

  • Quantity of Data

Completeness is determined by quantifying the number of records and comparing the results to the number of records anticipated.

There are two primary approaches to reconciling the data warehouse: 1) stage or, 2) complete reconciliation. The stage approach to reconciling data takes place after each data process flow stage while the complete reconciling approach occurs only at the completion of the load process. Which ever approach is utilized, reconciling the data warehouse will provide a safety net for identifying data exceptions as well as assist you with addressing questions and perspectives from all interested parties within your organization.

Complete Reconciliation

At the completion of every load process, a complete reconciliation is performed which compares information from the data warehouse to that of the corresponding source systems. The report displayed below is an example of a report that we produce for a complete reconciliation. Please note that the report identifies a data integrity problem that will require investigation.

Top

Stage Reconciliation

After each data process flow stage, reconciliation is performed. This approach is utilized when a complete reconciliation is not feasible due to the number of source systems or the complexities of the cleansing or transformation processes. With the stage reconciliation approach, accuracy and completeness of the data is determined after each of the following stages:

  • Data Migration

After the data from the source systems has been migrated to the staging area of the data warehouse, reconciliation is performed between source systems data and staging area data.

  • Cleansing

After the cleansing process has completed, reconciliation is performed between un-cleansed data, the exception listing and cleansed data in the staging area.

  • Transformation

After the transformation process has completed, reconciliation is performed between cleansed data, the exception listing and transformed data in the staging area.

  • Loading

After the loading process has completed, reconciliation is performed between transformed data in the staging area and the data in the data warehouse.

A report similar to the complete reconciliation report previously displayed should be developed after each stage reconciliation.

Tip #6After reconciliation has been performed, produce a report that is distributed to all individuals accessing or associated with the data warehouse. The report helps to establish credibility for the warehouse. After credibility has been established, reduce the distribution of the report to all interested parties or create an exception- based report.

Conclusion

As data warehouses become mission critical applications for an organization, the need to ensure the integrity of the data will increase. The success of a data warehouse rests with the users’ perceptions of it. If the data is inaccurate or incomplete, user confidence and use of the data warehouse will diminish. By establishing a control environment that incorporates prevent and detect controls, you will have created the means of providing user confidence and detecting abnormalities in the data.

Top