|
![]() |
|
Ensuring Data Integrity, Part I Back |
Ensuring Data Integrity Through the Use of Prevent and Detect Controls, Part I
This is the first part of a three part article by Jonathan Wu. Each section deals with the importance of data integrity and it's effect on business intelligence. Please look for the next two sections to be released shortly.
Introduction
There are numerous factors that can cause a data warehouse project to fail. From poorly gathered functional business requirements to lack of adequately skilled project team members, no factor is more devastating to the organization than the failure of the user community to accept the data warehouse because the information that it contains is of questionable quality. Understanding why there are problems with the information in the data warehouse can be a daunting task for the individual(s) responsible for resolving it. The problem may be due to any one or combination of the following items:
How does one approach and resolve data problems within a data warehouse? The purpose of this article is to address data integrity through the establishment of a control environment that mitigates and identifies data problems within a data warehouse.
Data Integrity
What is data integrity? For the purposes of this article, data integrity is defined as information that adheres to a strict standard of value and completeness. That is, the data is both accurate and the entire population of relevant data is contained within the data warehouse.
The credibility of the data warehouse solely rests with the integrity of its data. Therefore, understanding the impact of perceptions is important to appreciating the requirement of data integrity within the data warehouse. Usage of the data warehouse solely rests with the users’ perceptions of the data warehouse. Negative users’ perceptions will lead to a decline in its usage and value to the organization. First and foremost, the data warehouse should be reconciled to establish confidence in both the user and Information Systems communities that the data has integrity. Second, reconciling provides the means of addressing user questions concerning their data extracted from the warehouse.
The End-User Perspective
Throughout the life of a data warehouse, users will be questioning the information that they obtain from it unless the warehouse has established credibility within the user community.
During the weeks after a data warehouse is placed into production, users are going to ask the question "Can I rely on the information from the data warehouse"? In addition, the most common user question we encounter "Is the information in my query or report correct"? How do you address these questions? Your responses and resolutions to these questions will have an immense impact on users’ perceptions of the data warehouse.
The IS Perspective
After you have completed the initial load or an incremental update of the data warehouse, how do you know that all the required information loaded? You can check to see if the load process completed successfully or if there is any data in the warehouse tables. Assuming that the load process completed successfully, how do you know that the data is correctly loaded? If you do not have the ability to substantiate your gut reaction that everything is "OK", how can you address the users of the data warehouse? Remember that your responses and resolutions to their questions will directly impact their perceptions of the data warehouse.
Data Integrity Controls
How do you prevent meaningless, corrupt or redundant data from entering the data warehouse? How do you know that all the required information is loaded? By establishing an environment that incorporates controls into the data warehouse process flow, the integrity of the data will be ensured. Data integrity controls can be grouped into two categories:
Each category addresses data at different stages of the process flow.
Prevent Controls
Prevent controls aid in the integrity of data before it is loaded into the data warehouse. These controls are incorporated into the data migration, cleansing, transformation and loading processes and are the primary means of preventing meaningless, corrupt or redundant data from entering the data warehouse.
Detect Controls
Detect controls are those controls which assess the accuracy and completeness of data after it is loaded into the data warehouse or at each stage of the data process flow. These controls are incorporated into the reconciliation process and are the primary means of detecting incorrect or insufficient information within the data warehouse.
Data Control Points
At each stage of the data migration, cleansing, transformation and loading process, there exists an opportunity to ensure the integrity of the data before it is entered into the data warehouse. In addition, the reconciliation process identifies any discrepancies in the data after it has entered the data warehouse or at each stage of the data process flow. These data control points are highlighted in the following diagram:

Data Process Flow Stages:
William H. Inmon, a.k.a. the Father of Data Warehousing, has estimates that, on average, 80% of the efforts of building a data warehouse go into the data process flow stages.
|
Ensuring Data Integrity, Part II Back |
This is the second part of a three part article by Jonathan Wu. Each section deals with the importance of data integrity and it’s effect on business intelligence. Please look for the next section to be released shortly
.Overview of Data Migration [Prevent Control]
The purpose of data migration is to move data from selected source systems into the staging area of the data warehouse. Only data that is required by the users for reporting or is used during the transformation and loading processes should be moved. Meaningless information can be prevented from entering the data warehouse by not moving unnecessary information from source systems to the staging area of the data warehouse.
Data that should be moved into the staging area of the data warehouse would include reference and transaction data. For example, in a sales data warehouse, reference data would refer to customer information and transaction data would be the information associated with a sale to a customer. Information that typically should not be moved into the staging area of the data warehouse includes the RDBMS system administrator tables and source system application tables containing meta data or temporary processing data. For example, in an Oracle database, the data dictionary tables, such as ALL_TABLES, contains information that is not meaningful to end-users and would not be used in the transformation or loading process of the data warehouse. The following table highlights the type of data that should and should not be migrated to the staging area of the data warehouse.
|
Content of Table |
Migrate Data |
Do Not Migrate Data |
|
Source System Reference Data:
|
X |
X |
|
Source System Transaction Data:
|
X |
X |
|
RDBMS System Data |
X |
|
|
Source System Application Meta Data or Temporary Processing Data |
X |
|
Tip #1 |
|
Tip #2 |
Overview of Data Cleansing
[Prevent Control]The purpose of data cleansing is to correct, to standardize the format, and to fill in any values required by the data warehouse. This process also helps to identify redundant data, which will be prevented from entering the data warehouse during the loading process. The following is a simplistic example of data cleansing.
Before cleansing:
The Customer Information table displayed below contains data that is in a non-standard format. The data elements of FIRST_NAME, LAST_NAME, AND COMPANY_NAME contain information that is in an inconsistent format. In addition, the data element of STATE has missing information.
Customer Information Table
|
FIRST_NAME |
LAST_NAME |
COMPANY_NAME |
AREA_CODE |
PHONE |
STATE |
|
sAM |
Adams |
boston beer co. |
617 |
3685000 |
MA |
|
Sam |
Adams |
Boston beer co,. |
617 |
3685000 |
MA |
|
Samuel |
Adams |
Boston Beer Co. |
617 |
3685000 |
|
|
SAMUEL |
ADAMS |
BOSTON BEER |
617 |
3685000 |
MA |
|
Martin |
Zweig |
Zweig Funds |
800 |
2722700 |
NY |
After cleansing:
The cleansing process standardized the data in the Customer Information Table into a common format that has identified redundant information, which will be excluded from the data warehouse during the load process.
Customer Information Table
|
FIRST_NAME |
LAST_NAME |
COMPANY_NAME |
AREA_CODE |
PHONE |
STATE |
|
Samuel |
Adams |
Boston Beer Co. |
617 |
3685000 |
MA |
|
Samuel |
Adams |
Boston Beer Co. |
617 |
3685000 |
MA |
|
Samuel |
Adams |
Boston Beer Co. |
617 |
3685000 |
MA |
|
Samuel |
Adams |
Boston Beer Co. |
617 |
3685000 |
MA |
|
Martin |
Zweig |
Zweig Funds |
800 |
2722700 |
NY |
|
Tip #3 |
Overview of Data Transformation
[Prevent Control]The purpose of data transformation is to manipulate the data into the format and structure required by the data warehouse. The transformation process should reduce the number of data elements that are loaded from the staging area into the data warehouse. While developing the transformation rules for this process, only those data elements that are required by the data warehouse are utilized. If there are data elements that are not needed for reporting or loading of the data warehouse, prevent them from entering the data warehouse by not incorporating them into the transformation or load statements. The following is a simplistic example of data mapping from the source system to the data warehouse format.

The transformation rules incorporated into the data mapping example above are as follows:
|
Data Element from Source System |
Transformation Rule |
Data Element in the Data Warehouse |
|
Sequence created in data warehouse. |
CUSTOMER_ID |
|
|
CUSTOMER_ID |
Data element not loaded into warehouse. |
|
|
CUSTOMER_NUM |
One to one mapping. |
CUSTOMER_NUMBER |
|
FIRST_NAME |
One to one mapping and concatenation with LAST_NAME. |
FIRST_NAME FULL_NAME |
|
LAST_NAME |
One to one mapping and concatenation with FIRST_NAME. |
LAST_NAME FULL NAME |
|
AREA_CODE |
Concatenation with PHONE. |
PHONE |
|
PHONE |
Concatenation with AREA_CODE. |
PHONE |
|
ADDRESS1 |
Concatenation with ADDRESS2 AND ADDRESS3. |
ADDRESS |
|
ADDRESS2 |
Concatenation with ADDRESS1 AND ADDRESS3. |
ADDRESS |
|
ADDRESS3 |
Concatenation with ADDRESS1 AND ADDRESS2. |
ADDRESS |
|
CITY |
One to one mapping. |
CITY |
|
STATE |
One to one mapping. |
STATE |
|
ZIP_CODE |
One to one mapping. |
ZIP_CODE |
|
LAST_UPDATE_BY |
Data element not loaded into warehouse. |
|
|
LAST_UPDATE_DATE |
Data element not loaded into warehouse. |
|
|
ATTRIBUTE1 |
Data element not loaded into warehouse. |
|
|
ATTRIBUTE2 |
Data element not loaded into warehouse. |
|
Tip #4 |
|
Ensuring Data Integrity, Part III Back |
This is the final part of a three part article by Jonathan Wu. Each section deals with the importance of data integrity and it’s effect on business intelligence
.There are two primary approaches to loading a data warehouse: 1) full or, 2) incremental refreshes. The full refresh approach begins by truncating the tables in the data warehouse and then loading them with all the required data. The incremental refresh approach identifies changes to the source data from the last time the data warehouse was loaded and then inserts, updates or deletes rows of data in each table of the data warehouse, as required. Either loading approach can prevent unwanted data from entering the data warehouse by containing: 1) conditions in the load statements, and 2) unique indexes on the target tables within the data warehouse.
Full Refresh
The full refresh loading approach can prevent unwanted data from entering the data warehouse by containing conditions in the load statements. By placing a broad range of conditions on each of the load statements, unwanted data is prevented from entering the data warehouse. For example, the data warehouse contains financial information in US dollars only. The loading statement should contain a condition that filters for only financial information recorded in US dollars and would prevent any non-US Dollar financial information from entering the data warehouse. After the load process is complete, unique indexes are places on each of the tables of the data warehouse. In the event that there are redundant values on data elements that are defined by the unique index, the RDBMS will display an error message or the unique index will not be created. In such an event, the redundant rows of data need to be identified within the data warehouse table and eliminated. Note: For performance purposes of the load statement, there should not be any indexes on the target tables during the loading. After the loading process is complete, create all necessary indexes.
Incremental Refresh
The incremental refresh loading approach can prevent unwanted data from entering the data warehouse by containing: 1) conditions in the load statements and, 2) unique indexes on the target tables within the data warehouse. As with the full refresh approach, conditions on each of the load statements prevents unwanted data from entering the data warehouse. In addition, unique indexes that are placed on the target tables within the data warehouse will prevent redundant data from entering the data warehouse. The RDBMS will either load only unique rows of data into the data warehouse or display an error message.
|
Tip #5 |
Overview of Reconciling the Data Warehouse [Detect Control]
The reconciliation process identifies data issues that slipped past the prevent controls. The reconciliation process is designed to provide assurance as well as identify data that does not agree with the information that is contained within the source systems. Reconciling data determines the accuracy and completeness of the information.
Accuracy is assessed by the use of control totals on selected data elements, which are then compared to the anticipated results.
Completeness is determined by quantifying the number of records and comparing the results to the number of records anticipated.
There are two primary approaches to reconciling the data warehouse: 1) stage or, 2) complete reconciliation. The stage approach to reconciling data takes place after each data process flow stage while the complete reconciling approach occurs only at the completion of the load process. Which ever approach is utilized, reconciling the data warehouse will provide a safety net for identifying data exceptions as well as assist you with addressing questions and perspectives from all interested parties within your organization.
Complete Reconciliation
At the completion of every load process, a complete reconciliation is performed which compares information from the data warehouse to that of the corresponding source systems. The report displayed below is an example of a report that we produce for a complete reconciliation. Please note that the report identifies a data integrity problem that will require investigation.

Stage Reconciliation
After each data process flow stage, reconciliation is performed. This approach is utilized when a complete reconciliation is not feasible due to the number of source systems or the complexities of the cleansing or transformation processes. With the stage reconciliation approach, accuracy and completeness of the data is determined after each of the following stages:
After the data from the source systems has been migrated to the staging area of the data warehouse, reconciliation is performed between source systems data and staging area data.
After the cleansing process has completed, reconciliation is performed between un-cleansed data, the exception listing and cleansed data in the staging area.
After the transformation process has completed, reconciliation is performed between cleansed data, the exception listing and transformed data in the staging area.
After the loading process has completed, reconciliation is performed between transformed data in the staging area and the data in the data warehouse.
A report similar to the complete reconciliation report previously displayed should be developed after each stage reconciliation.
|
Tip #6After reconciliation has been performed, produce a report that is distributed to all individuals accessing or associated with the data warehouse. The report helps to establish credibility for the warehouse. After credibility has been established, reduce the distribution of the report to all interested parties or create an exception- based report. |
Conclusion
As data warehouses become mission critical applications for an organization, the need to ensure the integrity of the data will increase. The success of a data warehouse rests with the users’ perceptions of it. If the data is inaccurate or incomplete, user confidence and use of the data warehouse will diminish. By establishing a control environment that incorporates prevent and detect controls, you will have created the means of providing user confidence and detecting abnormalities in the data.