DISCOVERING DATA LINEAGE IN DATA WAREHOUSE: METHODS AND TECHNIQUES FOR TRACING THE ORIGINS OF DATA IN DATA-WAREHOUSE

(1)

METHODS AND TECHNIQUES FOR TRACING THE ORIGINS OF DATA IN DATA-WAREHOUSE

By

Roselie B. Webjornsen

SUBMITTED IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF

MASTER OF SCIENCE AT

UNIVERSITY OF OSLO OSLO, NORWAY

AUGUST 2005

Copyright by Roselie B. Webjornsen , 2005c

(2)

Date: August 2005 Author: Roselie B. Webjornsen

Supervisors: Naci Akkok and Judith Gregory

Title: Discovering Data Lineage in Data Warehouse:

Methods and techniques for Tracing the origins of data in data-warehouse

Department: Informatics Degree: M.Sc.

Signature of Author

iii

(3)

Table of Contents iv

List of Tables vii

List of Figures viii

Abstract x

Acknowledgements xi

1 Introduction 1

1.1 Architecture of Data Warehousing System . . . 2

1.1.1 Data Acquisition, Integration and Maintenance . . . 3

1.1.2 Reporting and Analysis . . . 4

1.2 Data Lineage . . . 5

1.2.1 Data Lineage Tracing . . . 6

1.2.2 Data Lineage Benets and Application . . . 6

1.3 Research Problem and Contribution to Knowledge . . . 7

1.3.1 Data Lineage Problem . . . 7

1.3.2 Academic Research . . . 8

1.3.3 Industry Implementation . . . 9

1.3.4 Data Lineage Solution . . . 10

1.4 Related Works . . . 11

1.5 Thesis Scope and Delimitation . . . 13

1.6 Thesis Outline . . . 14

2 Existing Lineage Tracing Mechanisms 15 2.1 Sample Case . . . 16

2.2 Data Lineage Granularity Level . . . 19 iv

(4)

2.3.1 Identifying Lineage information . . . 35

2.3.2 Storing Lineage information . . . 38

2.4 Reporting the Data Lineage . . . 42

2.4.1 Metadata Query . . . 43

2.4.2 View Tracing Query . . . 44

2.4.3 Tracing Procedure . . . 47

2.4.4 Inverse functions . . . 47

3 Industry Implementations 48 3.1 Standardization Eorts . . . 49

3.1.1 Organization of Standard Metamodel . . . 50

3.1.2 Metamodels for Data Lineage Support . . . 53

3.2 Industry Solutions . . . 57

3.2.1 Industry's view of Data Lineage . . . 58

3.2.2 ETL and Metadata Services . . . 58

3.2.3 Data Lineage Reporting . . . 79

4 Characterizing Data Lineage 86 4.1 Aspects of Data Lineage Problem . . . 86

4.1.1 Source Systems . . . 87

4.1.2 Sources of data . . . 88

4.1.3 Data Transformation . . . 90

4.1.4 Performance . . . 92

4.1.5 Maintenance . . . 93

4.2 Composition of Data Lineage . . . 94

4.2.1 Data Lineage Containers . . . 95

4.2.2 Metadata for Lineage Tracing . . . 96

4.3 Data lineage Information . . . 98

4.3.1 Mapping Information . . . 99

4.3.2 Lineage Data Values . . . 100

5 Data Lineage Solution 103 5.1 Data Lineage Tracing Mechanism . . . 103

5.1.1 Map . . . 104

5.1.2 Reconstruct . . . 105

5.1.3 Step-by-step . . . 106 v

(5)

5.3 Design . . . 117

5.3.1 Lineage Components . . . 117

5.3.2 Design considerations . . . 119

5.3.3 Design Components . . . 120

5.4 Implementation . . . 122

5.4.1 Recording . . . 123

5.4.2 Reporting . . . 123

6 Conclusion and Future Work 127

Bibliography 129

vi

(6)

2.1 Summary of Cui's Transformation Properties - source from [7] page 160 31 2.2 Transformation Summary for AbsenceStatistics Warehouse . . . 32

vii

(7)

1.1 Basic data warehousing architecture - adapted from [7], page 2 . . . 2

2.1 Source and Target Tables . . . 16

2.2 Source Tables Content . . . 17

2.3 Transformation Steps - adapted from Cui [7] . . . 18

2.4 Transformation Graph - adapted from Cui [7] . . . 19

2.5 Target Content . . . 19

2.6 Multi-dimensional Schema . . . 22

2.7 Mapping Scenario - adapted from [52] . . . 23

2.8 Meta Data Storage Schema - Source from [52] page 9 . . . 24

2.9 Lineage for View V2 . . . 28

2.10 Transformation Instance - source from [7] . . . 29

2.11 Transformation Classes - source from [7] . . . 29

2.12 Metadata Storage Implementation from m1 mapping - adapted from [52] . . . 41

3.1 The Metadata Standard Metamodel - source from [40] . . . 50

3.2 Sample Transformation Package - source from [40] . . . 55

3.3 Transformation Classes and Associations - source from [40] . . . 57

3.4 SAP BW MetaData Repository - source from [28] . . . 60

3.5 SAP BW Metadata Objects in context - source from [28] . . . 61

3.6 SAP BW ETL Services architecture - source from [28] . . . 63

3.7 Oracle Warehouse Builder Basic Architecture - source from [42] . . . 66 viii

(8)

3.10 DTS Package illustration - source from [37] . . . 70

3.11 Microsoft Metadata Services interfaces - source from [37] . . . 72

3.12 Typical TBW Load Architecture - source from [49] . . . 74

3.13 LineageModel within Models of Teradata MDS - source from [46] . . 75

3.14 Ascential Data Stage Designer - source from [39] . . . 77

3.15 SAP BW Example of Data Load Dataow Report . . . 81

3.16 An example of OBW Lineage Report - - source from [34] . . . 82

3.17 Teradata Additional Lineage Summary - source from [49] . . . 83

3.18 Data Lineage Menu - source [39] . . . 84

3.19 Data Lineage Path - source [39] . . . 85

5.1 Summary of Data Lineage Tracing Mechanism . . . 104

5.2 Lineage System Conceptual Framework Component . . . 108

5.3 Data lineage within the warehouse environment . . . 118

5.4 Designing Data lineage . . . 121

5.5 Lineage Data ow . . . 124

ix

(9)

A data warehouse enables enterprise-wide analysis and reporting functionality that is usually used to support decision-making. Data warehousing system integrates data from dierent data sources. Typically, the data are extracted from dierent data sources, then transformed several times and integrated before they are nally stored in the central repository. The extraction and transformation processes vary widely - both in theory and between solution providers. Some are generic, others are tailored to users' transformation and reporting requirements through hand-coded solutions. Most research related to data integration is focused on this area, i.e., on the transformation of data. Since data in a data warehouse undergo various complex transformation processes, often at many dierent levels and in many stages, it is very important to be able to ensure the quality of the data that the data warehouse contains. The objective of this thesis is to study and compare existing approaches (methods and techniques) for tracing data lineage, and to propose a data lineage solution specic to a business enterprise data warehouse.

x

(10)

First and foremost, I thank God for blessing me with family, friends and people who helped me make this thesis a realization.

I thank my family for their love and support; my parents for teaching me the value of hardwork, my brothers for believing in me, my sons for understanding my busy schedule and most especially my dear husband, Henry, who sacriced his desires and convenience so that I could write and complete this thesis.

My heartfelt thanks to my supervisors; Naci Akkok who unselshly shared his knowledge and insights with depth and humor and Judith Gregory for helping and guiding me and teaching me how to write a thesis. If it wasn't for both of you, I would not have come this far.

I thank my leaders and managers, especially Morten Morch who believes in me, motivates me and causes me to perceive things in their proper perspective.

I thank my colleagues and friends, Ragnar, Cathy, Armida, Aurea and many others who have inspired me in many dierent ways.

xi

(11)

Introduction

A data warehouse enables enterprise-wide analysis and reporting functionality that is usually used to support decision-making. Data warehousing system integrates data from dierent data sources. Typically, the data are extracted from dierent data sources, then transformed several times and integrated before they are nally stored in the central repository. The extraction and transformation processes vary widely - both in theory and between solution providers. Some are generic, others are tailored to users' transformation and reporting requirements through hand-coded solutions. Most research related to data integration is focused on this area, i.e., on the transformation of data. Since data in a data warehouse undergo various complex transformation processes, often at many dierent levels and in many stages, it is very important to be able to ensure the quality of the data that the data warehouse contains. The objective of this thesis is to study and compare existing approaches (methods and techniques) for tracing data lineage, and to propose a data lineage solution specic to a business enterprise data warehouse.

1

(12)

Figure 1.1: Basic data warehousing architecture - adapted from [7], page 2

1.1 Architecture of Data Warehousing System

Data warehousing is the process of extracting, cleaning, transforming, and loading data into a data warehouse for reporting and analysis [6]. Figure 1.1 (improvised dia- gram from Cui's thesis [7]), illustrates a basic architecture of data warehousing system consisting of two major components, namely; the Data Acquisition, Integration and Maintenance, and the Reporting and Analysis. The data Acquisition Integration and Maintenance component encompasses complex and numerous processes which mainly involve data extraction, cleaning, transformation and loading processes. The data in the data warehouse act as the central point of data integration (data are gathered and integrated from dierent sources) as well as the point of distribution (data are delivered to consumers of information for reporting, analysis and data mining) [26].

(13)

1.1.1 Data Acquisition, Integration and Maintenance

Data acquisition and integration is an important process in data warehousing. The data in the data warehouse passes through cleansing and integration processes before it enters the data warehouse [27]. Melding data from heterogenous and disparate sources is a major challenge (e.g. given the dierences in naming, domain denitions, identication numbers, and the like) [18]. Data integration processes are indeed complex and prone to error. Since data in a data warehouse is often integrated from a variety of sources and transformed by complex processes, the original source is often obscured.

Data Transformation

Data Acquisition and Integration is generally part of Extraction Transformation and Loading, ETL. ETL is generally referred to as Data Transformation and is a well known process cycle inherent in warehousing environment [13]. The typical end result of an ETL process in a warehousing environment are data stored in multidimensional schema. The ETL typically contains Transformation programs which perform data cleansing, integration, and summarization tasks over the source data before loading them into the warehouse [7]. In this thesis, we evaluated existing data transformation solutions and studied their approaches to data lineage. The result of the evaluation forms part of the basis of the proposed data lineage solution.

Maintenance

Maintaining the objects, processes and data is a major administrative task in data warehousing systems. Data lineage reporting plays a very important role in data

(14)

warehouse maintenance tasks. Being able to view the origins of data is very important from design to implementation. A common warehouse administrative task for example is changing, updating or just plain analyzing a specic transformation task (e.g. analysis of sources and targets mapping). Another common example for administrative task is analysing the data quality uploaded to the warehouse. This thesis, presents the dierent approaches of data lineage maintenance implemented in existing solutions. These existing data lineage solutions are considered to form part of our proposed data lineage solution.

1.1.2 Reporting and Analysis

Providing integrated access to multiple, distributed, heterogenous databases and other information sources has become one of the leading issues in database research and industry [54]. The main objective of the data warehouse is to provide "integrated"

enterprise-wide reporting and analysis. A data warehouse does not just facilitate data- intensive and frequent ad hoc queries [18], it is also used for further data exploration and data mining. In a business enterprise context, a typical example of a data mining application is enabling statistical analysis (on integrated and mined data) of the past behavior and actions of the enterprise in order to understand where the organization has been, determine its current situation, and predict or change what will happen in the future [26].

The mission of a data warehouse is to provide information to most eectively support decision-making [6, 28, 18]. One of the important factors to achieve this mission is to ensure data quality. Having the ability to navigate and "drill-down"

reports to certain levels of detail supports ecient data analysis, but being able to

(15)

"drill-through" to data origin provides for a deeper level of understanding of the data in question. A question may be raised about the origin of data for validation or for legal purposes. This thesis considers Reporting and Analysis which answers questions related to data origin and data transformation history, the data lineage.

Illustrations of existing lineage tracing mechanisms are presented in this thesis and the same mechanisms are considered to form part of our proposed data lineage system components.

1.2 Data Lineage

Discovering the origin of a specic warehouse data element is known as the data lineage problem [12]. Basically data lineage is the answer to the question, "where did this data come from?" [17]. The answer to this question is often not straightforward because the data in question can be either a derivation from a complex view¹ or a result of complex transformation [7]. Transformed data usually pass through dierent stages and at some point of each stage, they may undergo calculation based on complex procedural code. The "where" question does not just refer to the sources of data, but it also describes how the data was derived or transformed from its source [3, 55, 16]. In general, the data lineage of a datum consists of its source and its entire processing history [56]. Knowledge about the data sources and subsequent processing history of a piece of data is fundamental to assess the data quality and reliability [23]. For some applications, data lineage traceability is imperative (e.g.

for data warehouses that produce ocial reports where legal dispute over data is a possibility).

1Often several layers of simple to complex views.

(16)

1.2.1 Data Lineage Tracing

The ability to visualize data in a reverse order, following its transformation step by step, is called lineage data process [12]. Dierent approaches in lineage tracing are proposed in dierent studies [52, 25, 4, 2, 19, 56, 7, 9, 11, 13, 55]. Each of these approaches are evaluated to form part of the basis for the data lineage solution proposed in this thesis.

1.2.2 Data Lineage Benets and Application

Supporting lineage tracing in data warehousing environments brings several benets and applications. Some of these benets are enumerated below:

In-depth data analysis - Being able to trace back to the origins of data is useful and sometimes even necessary for in-depth data analysis. To some application where data produced in the data warehouse is used for legal purposes, one important requirement would be data traceability.

Investigation of anomalies - Sometimes data in data warehouse may seem anomalous. The analyst may pose a question, "where did this data come from?"

and may require an explanation of how it is derived. In the absence of data lineage, it would be dicult and costly if not impossible to answer this question.

Debugging - Data Lineage can be used to investigate the source data programs that produced anomalous data [56]. Having the possibility to report on the source-to-target mappings and seeing the transformation description (i.e.

derivation process, operators) is an important aspect for data warehouse maintenance.

(17)

Impact Analysis - Dierent forces or inuences may impose changes to transformation processes or maybe even to data warehouse schemas. Businesses evolve and internal and external requirement may change over time. The impact of these changes to the data warehouse can be analyzed prior to implementation in the existence of data lineage. Impact Analysis determines which tables, columns and processes are aected by changes. In addition, lineage information allows the user to trace the impact of faulty source data or buggy programs on derived sets.

Facilitating data mining and data discovery - Knowing about the sources of data can enhance data mining and data discovery processes. It also improves condence on data used for data mining.

Correct Data in the Operational System - Data Lineage can trace back to cleansed data and can be used as a feedback loop to correct data back in the operational system.

1.3 Research Problem and Contribution to Knowl- edge

This section discusses the data lineage problem and the overview of our approach in solving the problem.

1.3.1 Data Lineage Problem

In a real data warehouse implementation, data are not only derived by xed sets of operators or algebraic properties. Most often data are transformed by a complex programming procedure. In practice, these mean sometimes as many as 60 or more

(18)

transformations [7]. Transformation can often be so complex that implementing data lineage tracing is considerably dicult and open-ended [8, 7]. In addition, some of the transformation logic is customized and hand-coded and may not adhere to any dened formal transformation. As the data traverse through dierent transformations, the traces to the data origin become obscure (e.g. the format is changed) and somewhere in between the complexity of transformation process, an anomaly may occur. With- out data lineage traces, discovering this anomaly would be dicult if not impossible.

These and many other aspects in data lineage make implementing a formal approach to lineage tracing a big challenge. This challenge stirs the interest of the research community resulting in a number of studies presenting formal calculation (i.e. algorithms and standard procedures) for tracing data lineage [7, 56, 55, 21, 2, 25, 4, 19, 8, 11, 10].

Existing solutions in the industry provide lineage solutions at higher granularity level [34, 14, 53, 44, 39, 24, 17, 38, 28, 49], but lack the eectiveness of the nest level of lineage granularity that academe presents. In addition, most business enterprise data warehouse implementation blueprints do not include a clear strategy for tracing data lineage [51]. This thesis combines and distills the dierent data lineage existing solutions both from university research and industry and formulates a conceptual framework for a data lineage system.

1.3.2 Academic Research

In academic research, dierent approaches in lineage tracing are introduced, from coarse-grained [4, 2, 25] to ne-grained [56, 45, 7] data lineage. Most existing works on data lineage in the academe focus on providing algorithms and procedures to trace back data from sources. Although, this thesis does not go into detail on algorithms and procedures, we provide illustrations on the existing tracing mechanisms

(19)

introduced in dierent data lineage studies. The purpose of these illustrations is to present some details of pertinent concepts behind the data lineage solution introduced in this thesis. Data lineage problem is a challenge that needs more than just lineage tracing mechanisms; it requires strategic initiatives [51]. This thesis provides a higher level of data lineage solution by introducing a conceptual framework for data lineage systems components.

1.3.3 Industry Implementation

This thesis presents dierent data lineage solutions implemented by some prominent players in industry. The intention is to provide an overview of existing features that we combine to our data lineage solution - a solution which attempts to cover the dierent aspects of the data lineage problem at a strategic level. The prominent solution providers² of data integration which we included in our evaluation show data lineage features. However, to the best of my knowledge, these lineage features do not oer the nest level of granularity, which the academic works introduce. For example, the industry solutions provide data lineage reporting functionality that reports only against the metadata [34, 14, 53, 44, 39, 24, 17, 38, 28, 49]. These features can provide reporting about sources and targets (i.e. sources-to-target mapping) and may allow

"drill-down" to transformation descriptions. However having the ability to "drill- through", or to trace back the data lineage in step-wise fashion, oers an avenue of improvement. This is one of the important considerations in data lineage solution introduced in this thesis, although this thesis does not detail data lineage reporting

2Data integration solution providers included in this thesis are SAP Business Information Ware- house, Oracle, Teradata, Microsoft and Ascential Software.

(20)

mechanisms. Metadata³ plays an important role in data lineage solutions. For this reason we also evaluated the metamodel⁴ standardization eorts provided by OMG⁵, the Common Warehouse MetaModel (CWM)⁶ [40]. CWM provides constructs that support data lineage solutions. This thesis considered the metamodel constructs as a basis for the part of the data lineage solution covering transformation metadata design.

1.3.4 Data Lineage Solution

The data lineage problem is seen in dierent angles and perspective in the academe and industry. Dierent data lineage approaches show partial solutions addressing some aspects of the data lineage problem. Most of the academic works consider the data lineage on a data value granularity level, while the industry focuses more on schema mappings. In industry, the schema mapping report is considered as a data lineage report, but in the academe, the data lineage report not only produce schema, schema elements and transformation descriptions, it also produces lineage data values themselves. The academe details formal approaches for data lineage computation (algorithms and procedures), while the industry provides features and tools that support data lineage. Academic research may seem to provide better solutions in terms of data lineage solution concepts; however industry provides tangible solutions as evidenced by real data warehouse implementations. In this thesis, we combine and

3Metadata is generally known as the data about the data. A typical example is the system catalog in relational DBMS. The system catalog describes data denition (i.e. it describes the tables and columns that contain your data.) Some tools can manipulate system catalog as you would manipulate any other data. Examples of manipulating metadata include viewing data lineage or table information. In this case, metadata are treated as ordinary data.

4Metadata model

5Object Management Group.

6Metamodel Standard specication mainly focusing on interchanging metadata.

(21)

distill dierent approaches to data lineage problem. First, we describe data lineage by presenting the dierent aspects of the data lineage problem and classifying dierent types of data lineage information. This thesis presents a higher level but holistic data lineage solution by providing a conceptual framework for data lineage systems components relevant to business enterprise contexts. This work can be used as a reference or a basis for designing and integrating data lineage solution to the business enterprise data warehouse.

1.4 Related Works

The data lineage problem attracts interest in the research community. In this section, we present the dierent works related to data lineage. Most of these works generally provide data lineage tracing approach which help constructs the solution proposed in this thesis.

Fan and Poulovassilis [21], introduced lineage tracing approach based on schema transformation pathways. In this work, they show how individual transformation steps is used to trace the derivation of the integrated data in a step-wise fashion.

However their work focuses on a Hypergraph Data Model, (HDM) which is designed to suit automed integration system [20, 22], this thesis focuses more on Relational Data Models.

Other earlier works oered coarse-grained or schema-level lineage tracing. The approaches propose lineage tracing based on annotations or attributions [25, 4, 2]. [25, 2] use data derivation information which is stored in metadata. Faloutsos, Jagadish and Sidiropoulos in [19] also use metadata in addition to condensed information (e.g.

summary data) to trace data lineage. Statistical calculation is performed against

(22)

the metadata and condensed information to reconstruct the estimated value of data origin.

The approach proposed by Woodru and Stonebraker does not use metadata.

Rather than relying on metadata, their approach computes lineage using a limited amount of information about the processing operators and about the base data [56].

[56] introduced lineage tracing based on weak inversion. Weak inversion does not perfectly invert the data, it uses weak inversion and verication to provide a number of guarantees about the lineage it generates. The weak inversion and verication functions are to be registered by the user to the DBMS. When tracing a data lineage, an inversion planner determines which weak inversion and verication function to invoke, constructs a plan and then executes the plan by calling the corresponding sequence of functions within the DBMS.

All of the data lineage approaches above do not guarantee accurate and exact tuple or tuples of the origin, but these approaches are potentially feasible in certain situations and can be practical solutions to some business enterprise-specic requirements.

Cui and Widom oer lineage tracing algorithms for relational views with aggregation [12]. These algorithms provide a means for users to select a data warehouse view tuple and "drill-through" to examine the exact source tuple or tuples that produced the view. Lineage tracing involves other aspects also, such as performance and maintenance. In [10], Cui and Widom provide a way of optimizing lineage tracing by introducing auxiliary views. Auxiliary views are parts of data lineage which are physically stored in the data warehouse. Cui and Widom, in [8] further introduced

(23)

algorithms based on general transformation properties⁷. For each transformation property, an equivalent tracing procedure is used to enable lineage tracing. Cui's and Widom's works gained signicant contribution in data lineage problem and are cited in many data lineage related works.

Bose does not introduce a data lineage tracing mechanism but rather presents a framework for composing and managing data lineage specic to scientic data [3].

This thesis provides lineage solution for business enterprise data warehouse by characterizing data lineage and providing framework for data lineage system components.

1.5 Thesis Scope and Delimitation

This thesis evaluates and describes dierent lineage tracing mechanisms in theory as well as the existing implementations in the industry. Existing theories which form part of the data lineage solution presented in this thesis are illustrated. Additionally, existing data lineage implementations in the industry are presented. Based on the evaluations, (i.e. both in theory and industry), a data lineage solution which is specic to business enterprise context, is introduced. This thesis attempts to provide a holistic view of the dierent aspects of the data lineage problem and aims to introduce a solution which covers these dierent aspects. However this thesis does not go into detail on algorithms and procedures, although illustrations on some algorithms and procedures extracted from dierent academic research are provided.

7Transformation property tells the type of transformation applied to the data.

(24)

1.6 Thesis Outline

The thesis is organized as follows. Chapter 2 elaborates dierent tracing mechanism based on dierent data lineage related studies. Chapter 3 discusses the Standard metamodel specication which provides metamodel constructs that support data lineage. Additionally, the overview of data lineage implementation for each solution providers included in the evaluation in the industry is presented. Base on the evaluations, Chapter 4 characterizes data lineage by describing the dierent aspects of the data lineage problem and classifying the dierent types of data lineage. Chapter 5 proposes a data lineage solution covering the aspects of the data lineage problem discussed in chapter 4. In Chapter 5 we introduce the conceptual framework for lineage system components and discuss data lineage design and implementation. In our dis- cussion we emphasize that lineage system Components must be seamlessly integrable with the data warehousing system. Finally, Chapter 6 discusses the conclusion and the possible future works related to the proposed solution.

(25)

Existing Lineage Tracing Mechanisms

This chapter presents dierent tracing mechanisms which are gathered from previous works related to data lineage. Section 2.1 provides a sample case which shall be used as reference for illustrating the dierent data lineage approach. Section 2.2 discusses the basic lineage granularity levels as commonly described in academic works related to data lineage. In addition, section 2.2 also illustrates examples for each granularity level. Section 2.3 highlights dierent approaches in storing data lineage physically and section 2.4 presents the dierent ways for physically retrieving the data lineage for reporting purposes.

The lineage tracing mechanisms presented in this chapter are mostly based on the previous works of [25, 4, 2, 19, 56, 12, 10, 8, 52]. The main intention of this chapter is to describe and highlight the dierent concepts of existing tracing mechanisms.

However this only focuses on the overview of each approaches and does not detail formal calculations (i.e. algorithms and procedures) presented in the referred previous works. Each or a combination of each tracing mechanisms may be applicable to certain enterprise data warehouse tracing requirements. Therefore, knowing the overview of

15

(26)

Figure 2.1: Source and Target Tables

these approaches is useful to determine which specic solution is applicable to which problem in a data warehouse environment. For details on algorithms and procedure, refer to [25, 4, 2, 19, 56, 12, 10, 8, 52].

2.1 Sample Case

This section is prepared for illustration purposes. This illustration shall be used in dierent tracing mechanisms presented in this chapter.

Source Data. Warehouse data AbsenceStatistics are produced based on three source tables: Employee, WorkShedule and Attendance. See to gure 2.1 which presents the source tables and target tables. Employee table is mostly self-explanatory.

WorkSchedule table stores the number of hours that the employees are supposed to be working each month. The Attendance Table is also self-explanatory except for the AttendanceType which indicates if hours are entered as ordinary working hours, sick leaves or overtime. Figure 2.2 shows the content of the source tables.

(27)

Figure 2.2: Source Tables Content

(28)

Figure 2.3: Transformation Steps - adapted from Cui [7]

Warehouse Data. Suppose that the Head of Health and Safety Environment, (HSE) Division would like to analyze healthy working environment for each department. Sickness Absence may not always be directly related to healthy working environment but may also, for example, indicate employees' work satisfaction. Therefore this warehouse view may also serve the department head to analyze the absences of his or her employees. To materialize a view in the data warehouse that has this information, the transformation steps shown in gure 2.3 are implemented.

Target Views represented by Vi where i is the sequence number of Views in each transformation steps, are views that store transformed data which may function as an intermediate view for further transformation or the ultimate data target. The transformation steps traversed by the warehouse data result in the intermediate views V1,V2,V3,V4 , refer to gure 2.3. To simplify the transformation illustration, we use the transformation graph shown in gure 2.4, similar to Cui's illustration in [7].

(29)

Figure 2.4: Transformation Graph - adapted from Cui [7]

Figure 2.5: Target Content

The data shown in gure 2.5 are loaded to the warehouse after a series of transformation. Refer to gures 2.3 and 2.4.

2.2 Data Lineage Granularity Level

Lineage granularity pertains to the level of details of the data lineage when tracing back to its origin. Dierent data lineage related works in the academe describe data lineage granularity in two main categories; Schema level and Instance level. The rst refers to the origin of data in terms of data structures (e.g. column to column mapping), the latter refers to the original values themselves from which the data are

(30)

derived. In this section, we look in to the dierent academic works and describe each of their data lineage tracing approaches. Section 2.2.1. presents the Schema level way of tracing the data lineage and section 2.2.2. presents the Instance level approach.

2.2.1 Schema-level approach

Schema-level or coarse-grained tracing approach refer to the origin of data not in terms of data values but in terms of schema elements and transformations from which they are derived. Schema-level approaches utilize metadata repository to store lineage information (e.g. transformation or derivation set) involving a warehouse data item [2, 52]. [2, 52] introduce a possibility to tag each warehouse instance with identier of the derivation set that produced it. The identier stored in the warehouse instance and the derivation set stored in the Meta data repository are then used for query computation during lineage reporting.

In [2], the equivalent of a derivation set is named a package. Each package is a workow that denes a transformation. To enable lineage traceability, a package identier is stored in a warehouse item. When a user asks about the data lineage of a warehouse item, the package identier is used to view the package transformation description, (e.g. the sources and targets of the transformations and the computation applied to the data in transformation process).

Applying the method of [2, 52] in our example, we then assume that the transformation information such as transformation steps, intermediate view denitions etc., are already stored in Meta data repository . Referring to our sample case in section 2.1, the transformation Steps T1 to T5 and View Denitions V1 to V5 are assumed to be already stored in the Meta data repository.

(31)

Schema Level Tracing Scenarios

The transformation process shown in gure 2.2 and gure 2.3 can be seen as the transformation package as described in [2]. Additionally to complete the Target Schema, according to [2], we add the PId attribute which stands for Package identier. The resulting table would then be AbsenceStatistics (Department, CalMonth, PAb- sence, PId). To simplify our example, we refer only to the warehouse target as a one dimensional schema (typically, data in the data warehouse for OLAP are loaded in a Multidimensional Schema). Assuming that our data target is a multidimensional schema, when we add package identier, PId, we are actually adding a dimension to the schema. We name it package dimension for illustration purposes. Dimensions are simply tables linked to the main table which is known in data warehouse as the fact table. The package dimension that we add may contains pertinent information about the data, for example package descriptions among others. Figure 2.6 illustrates an example of the Multidimensional Schema that has a package identier, PId. Package dimension is similar to the idea of audit dimension mentioned in [6]. Although the intention of audit dimension in [6] is to indicate data quality, it does capture important ETL processing and contains a description of the xes and changes that have been applied to a specic row in a fact table. In addition, it may also contain more attributes that may describe the data lineage.

Reporting data lineage using the package dimension may initially use the data in the Package dimension if the user would only require lineage overview (e.g. data source, timestamps). From here, the user may trace back the transformations and queries regarding the source (including intermediate views) to target mappings. In

(32)

Figure 2.6: Multi-dimensional Schema

each transformation step, the user may also view information that describes the transformation step (e.g. calculation applied to the source or intermediate views resulting to intermediate or target views).

A schema-level related work in [52] details on the mapping mechanisms utilizing the Meta data repository. [52] introduced mappings in the form of foreachQs and existQt. Qs and Qt stands for source and target queries respectively. These Queries are used to describe the mappings. The source query describes what to retrieve from the source and the target query describes how the retrieved data will be structured to the destination schema. For each mapping, schema or schemas may be transformed in terms of element name or element types and the data may be aggregated or calculated.

Figure 2.7 shows an example of this mapping to populate the target schema using our sample case in section 2.1. Consider the mapping m1 in our example shown in gure 2.7. The source column Ahours of this example is transformed to AsumHours in the target schema while the data value is aggregated to summarize the "SickLeave"

hours for each employee in each month.

In addition to this mapping mechanism, [52] introduced a Meta data storage schema which includes seven tables shown in gure 2.8. Mapping table is the main

(33)

Figure 2.7: Mapping Scenario - adapted from [52]

(34)

Figure 2.8: Meta Data Storage Schema - Source from [52] page 9

responsible table for linking the source-to-target schemas. A mapping table consists of mid, forQ and conQ columns. The mid is the unique key identifying a mapping instance. The two columns, forQ and conQ contain the source and target select clause respectively. The source query entered in foreach describes what to retrieve from the source and the target query entered in exists describes how the retrieved data will be structured to conform to target schema.

In addition to the mapping table, the source-to-target schema elements are stored in the Element relation which consists of element ID, name, type (e.g. string), parent (i.e. the origin of the element) and db (i.e. the data source). Three tables are used to contain the schema-to-target schema denitions: Query, Binding and Condition. Each query has an indentication which is stored in a Query table. The Binding table stores the from clause and Condition table stores the where clause of the source-to-target queries.

Looking back at our sample case in section 2.1., what if, for example, the employees within the same department serve dierent projects and each project has its own time- sheet application database. This scenario extends our sample case because the data source now originates from dierent databases. Our mapping example in gure 2.7 is extended to record the source databases of the source items.

(35)

Suppose the head of the department investigates the source of the specic tuple in the data warehouse. The information stored in Meta data repository will make it possible to query on schema elements and transformations which will show the manager how the data are transformed and which source databases the values in the warehouse came from.

2.2.2 Instance-level approach

Instance-level or ne-grained tracing attempts to nd the specic values in the base tables that justify the appearance of data in a warehouse view. While schema-level tracing refers to the origin of data in terms of schema elements and transformations description, instance-level tracing refers to the data value itself. The following sub- sections describe each dierent method of tracing data lineage and provide examples.

The examples given are based on the works [19, 7, 55, 9, 11, 12, 8, 7].

Views and Transformations

Widom and Cui in their works [9, 11, 12, 8, 7] describe tracing mechanisms by which the exact or nearly exact data lineage for a given warehouse data item can be produced. In their works, they provide algorithms for lineage tracing for data warehouse views and lineage tracing for general data warehouse transformations. When a warehouse materialized view is populated using standard relational operators, then it is possible to trace the exact lineage tuples for a given warehouse item using the algorithms Widom and Cui provided in [12, 9, 7]. However, in practice, a data warehouse is often populated through a complex transformation process. In this case, Widom and Cui introduced lineage tracing for general warehouse transformations. In their

(36)

works, they identied eleven transformation properties which become the basis for selecting a tracing procedure to be used in lineage tracing.

Lineage tracing for Views. This lineage tracing mechanism considers a relational scenario. The data warehouse in this scenario is populated through simple to complex relational views which are specied using select, project join, aggregation, union, intersection and dierence operators. [12, 9, 7] develop tracing algorithms for this scenario. The tracing algorithms use tracing procedure that is automatically created based on the view denition.

Computing the lineage of data warehouse view generally requires not only the view denition but also the original data and sometimes even additional information.

In data warehousing implementation, this requirement is not an appealing solution because data may come from dierent sources. For example, data may come from dierent systems (e.g. legacy systems) or databases, or may reside in an inaccessi- ble logical location (e.g. source data maybe archived) among many other practical reasons.

[7, 12] introduced view lineage tracing without requiring the data source with certain practical trade-os (e.g. loading against tracing performance). To achieve this goal, [7, 10] proposed the idea of auxiliary views. These views reside within the data warehouse environment and contains some part of lineage information (see to section 2.3.2). In addition to this, [7, 10] introduced dierent options to store auxiliary views by providing algorithms for their implementation and maintenance.

To illustrate view lineage tracing introduced in [7, 12], we consider our sample case in section 2.1. To simplify the illustration we consider only the transformation T2 shown in gure 2.3 and 2.4 which produce the View V2 . Figure 2.9 a) shows

(37)

the source data which produced the View V2 . If we apply the Tracing Query as introduced in [7, 12] to compute the lineage of V2 tuple(D70, 172.5, 3.2005), then we will get the result shown in gure 2.9b. Refer to section 2.4.2 for the resulting query based on algorithms in [7, 12].

Lineage tracing for general data warehouse transformations. In real data warehouse implementations, data in the data warehouse are often populated through a series of complex transformation. In this case, tracing the data lineage cannot be just calculated using the standard relational operators. Additional logic needs to be incorporated in order to produce the data lineage for a particular data warehouse item. This problem is identied in [7, 8]. [7, 8] introduced lineage tracing basing on general transformation. The intention of this mechanism is to produce the exact or nearly exact tuples that are responsible for the appearance of the warehouse item produced by data warehouse general transformation¹.

To be able to trace back to the exact origin of data requires knowledge on the transformation path, the data traversed as well as the computation applied to the data within each transformation. Producing a data lineage as accurate as possible, requires knowledge of the transformation description. It may not be emphasized in [2, 52], but their work clearly shows, that transformation information are used to enable lineage tracing. To emphasize the notion that data lineage tracing is dependent on Transformation description, a simple example is given, shown in gure 2.10 (see [7, 8]). This gure shows a source table I that has tuples (a,-1), (a,2), and (b,0) and Transformed data having the tuples (a,2),(b,0). In this example, the data in question are the tuple (a,2). Consider two transformation scenarios 1 and 2 below:

1These are the common transformations that occurs in data warehouse environment.

(38)

Figure 2.9: Lineage for View V2

(39)

Figure 2.10: Transformation Instance - source from [7]

Figure 2.11: Transformation Classes - source from [7]

Scenario 1: Transformation that lters out input with negative Y value

Scenario 2: Transformation that groups input data based on X values and computes the sum of their Y values multiplied by 2

Tracing the data lineage for this data in question depends on how the data is transformed. If data are transformed as described in Scenario 1, then the data lineage for this transformation would only be the tuples (a,2) and (b,0). However, if the data are transformed as described in Scenario 2, the data lineage would be the entire table, because all tuples in the entire table participates in the transformation computation.

Transformation Properties. In reality data transformations often involve

(40)

more than just standard relational operators. Tracing the data lineage for such transformations requires some known transformation structure or properties that can be used to determine and trace the data lineage [8, 7]. [8, 7] develop lineage tracing mechanisms using the transformation properties. First, they identied transformation properties for general data warehouse transformations.

Transformation Classes. [7, 8] classies General Transformation into three namely, the Dispatcher, Aggregator and Black-box, refer to gure 2.11. These three classes are based on how they map input data items to output items.

A Dispatcher produces zero or more output data items independently. Figure 2.11(a) illustrates a dispatcher, in which an input items I consisting of 1, 2 and 3 produces output items O consisting of 1, 2, 3, 4, 5 and 6 independent to each other. An input item 2 produces nothing and an input items 1 and 3 produces more outputs. A dispatcher can be identied as lter if each input item produces either itself or nothing. The data lineage for any output through transformation with lter property is the same item as in the input.

An Aggregator transforms one or more input data items and produces one output data item. Figure 2.11(b) illustrates an aggregator in which output data items 1 and 2 and 3 are transformed using one or more input data items. Output data items 1 and 3 are the transformed from multiple input items while output item 2 is a result of single input item. A more ecient tracing pocedure for aggregator is develop by subclassifying it in to context-free aggregatorand key-preserving aggregator.

A Black-box illustrated in gure 2.11c) uses all combination of data set to trans- form one or more data sets, therefore tracing procedure for this type of transformation simply returns the entire input.

(41)

Table 2.1: Summary of Cui's Transformation Properties - source from [7] page 160

Property Tracing Procedure

dispatcher TraceDS

lter return O

aggregator TraceAG

context-free aggregator TraceCF

key-preserving aggregator TraceKP

black-box return I

forward key-map TraceFM

backward key-map TraceBM

backward total-map TraceTM

tracing procedure requiring input TP tracing procedure not requiring input TP

Schema Mappings. Cui and Windom [8, 7] consider schema information to improve lineage tracing but do not entirely depend on it as Velegrakis, Miller and Mylopoulos [52] do. Schema mappings specify the link of input attributes to the output attributes. [52] details this by introducing a Meta data storage schema that contains this information. Our sample case shows schema mapping refer to gure 2.3. This is further demonstrated in gure 2.7. [8, 7] develop a lineage tracing procedure using schema mapping to improve the tracing procedure for dispatcher and aggregator. However, [8, 7] procedure does not report on mappings as [52] does.

[8, 7] uses schema mappings to identify specic transformation properties based on how the source tuples attribute values is mapped to the tuples of target attributes.

Three schema mapping properties are classied, namely forward key map, backward key-map and backward total-map (see [8, 7] for details). A lineage tracing procedure is developed for each of these properties which allows tracing the lineage values of the source, rather than to the lineage schema of the source.

(42)

Table 2.2: Transformation Summary for AbsenceStatistics Warehouse Name Description

T1 Select on Attendance type

T2 Aggregate Employee and WorkSchedule T3 Join and aggregate Employee and V1 T4 Union views V2 and V3

T5 calculate percentage of absence and add column PAbsence

A Transformation may exhibit more than one properties. Table 2.1 lists the transformation properties. Some properties are better than the others. This means that the tracing procedure formulated for better properties may give a more ecient tracing or may give a more accurate lineage result. In which case, [8, 7] determines the best one to exploit for lineage tracing and show the hierarchy according to importance.

Lineage Tracing Procedure. Tracing procedure take specied output items and may take Input items as parameters to produce the data lineage. The tracing procedures formulated in [8, 7] introduce a mechanism in which a data lineage is traced by identifying which transformation property or properties have transformed a specic data in the data warehouse. The lineage tracing procedure is composed for transformation sequence and transformation graphs.

Widom [8] formulated a tracing procedure for each of the transformation properties. Consider gure 2.4, the transformation graph in our sample case. Each transformation step is represented as Ti where i refers to the sequence number of transformation. If we consider the work of [8], each of the transformation in our sample case T1 to T5 has a transformation property. The summary of transformations in our sample case in section 2.1 is presented in table 2.2 based on the work of [8].

(43)

Consider the warehouse data in our sample case in gure 2.5. Suppose the Head of HSE Division would like to know who are the employees in a particular department were having sick leaves for a specic month. Specically the HSE Head may wish to list the employees that produce the warehouse data tuple(D70, 3.2005, 43.5). Tracing back the data lineage is like traversing the path in the transformation graph shown in gure 2.4 and reconstructing the views in the relations shown in gure 2.3. In our sample case, HSE head may "drill-through" V4 to V1 then nally to the source. [7, 8]

proposed lineage tracing mechanism that allows the user to do exactly this. Lineage tracing will traverse back the transformation path to produce the exact tuples of the origin of the data in question.

Statistical Method

Faloutsos, Jagadish and Sidiropoulos in [19] introduce a statistical method to reconstruct as good as an estimate of the original base data. [19] describes a formal approach to the recovery of information from a summary data in form of constraints and utilizes the well-developed "inverse problem" theory.

We include this method not to detail a statistical calculation approach but to recognize the relevance of this approach for practical application on data warehousing environment. Ralph Kimball [32] describes a practical example for using statistical method to answer the question about the the correctness of data loaded in a warehouse . If there is a need to respond to queries that can be answered accurately only from the base data, the views and transformation approach of lineage tracing of [8, 10, 7]

are applicable. This type of question requires to "drill-through" to the exact tuple of the data origin. In some cases it may be too expensive if not impossible to trace

(44)

the exact or almost exact tuples of the origin of data. The reason may be that the source data reside on an unreachable location (e.g. o-line, archive etc.) However, some data origin related questions can be answered quickly from the summarized data. [19] proposed a mechanism to meet this requirement.

Inverse Method

Some transformation may be accompanied by a tracing procedure or inverse transformation, which is the best case for lineage tracing according to Widom [8]. Because it is the inverse of the transformation it could be assumed that it basically inverts data perfectly. Intuitively, if inverse procedure or function inverts the data perfectly, (assuming the data data is transformed in several layers and levels), then a certain amount of lineage information should have been stored within the data warehouse. In practice, inversion function is seldom available [8, 56]. ETL development is complex and time consuming that preparing an inverse function for each transformation is often not considered during the development.

Woodru and Stonebraker [56] introduced a general framework for computing the data lineage using weak inversion and verication. This approach is to compute lineage on-demand using a limited amount of information about the processing operators and the base data. Weak inversion does not perfectly invert the data, it uses weak inversion and verication to provide a number of guarantees about the lineage it generates. The weak inversion and verication functions is to be registered by the user to the DBMS. When tracing a data lineage, an inversion planner determines which weak inversion and verication function to invoke, constructs a plan and then executes the plan by calling the corresponding sequence of functions within the DBMS.

(45)

2.3 Recording the Data Lineage

How and where to record data lineage is essentially part of the lineage problem solution. If data lineage is to be traceable, then information about data lineage or part of data lineage itself needs to be stored where lineage tracing process can reach it.

Intuitively, the problem of recording data lineage involves balancing trade-os (e.g.

performance). Recording data lineage or part of data lineage after every transformation is sure to penalize the loading process in data warehouse. Thus, deciding whether to store data lineage or to reconstruct it on demand requires extreme cau- tion. Dierent lineage tracing mechanism views lineage recording dierently. Cui and Widom [7, 12, 10, 8] for example describe algorithms to determine which part of data lineage information to store or store nothing at all. Woodru and Stonebraker [56] propose to register weak inversion function for each transformation and use it to reconstruct data lineage. Faloutsos, Jagadish and Sidiropoulos [19] uses limited knowledge about the data to reconstruct the data from their base origin. Whether the option is to store or reconstruct data, all tracing mechanism still use some form of information about the data lineage. This section discusses the dierent approaches to recording data lineage based on my analytical survey made on the existing research [25, 4, 2, 19, 56, 12, 10, 8, 52].

2.3.1 Identifying Lineage information

Based on the survey of this study, there are dierent types of lineage information necessary to enable lineage tracing. Lineage information can be data about the source systems, source and intermediate schemas, functions or procedures used for calculations, or source data values themselves. This section classies some of these types

(46)

and describe each type.

Schema Information

Any information describing the composition of the schema that participates on data transformation. This information may be in the form of the following:

View denition - Views that participates in transformation and may form part of lineage tracing queries used [52, 7].

Schema Name - Names of views, tables, or record or attributes. relation [52, 56, 55].

Schema Elements - Attributes or column names [52, 56, 55]

Element or Attribute Types - the source element or attribute types and the intermediate target element or attribute type [52, 56, 55].

It may also be pieces and parts of the schema denition such as: select clause, from clause or where clause of the statement, refer to the work of [52] describe in section 2.2.1. which describes how to map source-to-target using this information.

Functional and Execution Information

Any information pertaining to the tracing program, procedure or functions that will be used in lineage tracing specic to a particular warehouse data. This may consist of the name and the description of the program. This may also consist of information that describes the transformation name and process and links to one procedural code to another that comprise the lineage tracing instruction invoked during lineage query.

(47)

Woodroof and Stonebraker [56, 55] propose the idea of function registration co- herent to its proposal to trace data lineage basing on weak inversion, refer to section 2.2.2. Several pieces of information about weak inversion and verication function is to be registered to trace data lineage using the weak inversion. During lineage tracing, an inversion planner, (i.e. a program which generate an execution plan to trace the data lineage), infers which functions are to be used for weak inversion verication.

[56, 55] describe the information which is to be registered; the Inverting Function (i.e.

which perform the weak inversion or verication) and the Function to be inverted. Cui and Widom [8, 7] introduce a tracing procedure which is invoked during the lineage tracing. The properties of the procedure which dictate which programming procedure to execute during lineage tracing and the reference to the procedure itself are stored within the data warehouse environment. Bernstein and Bergstraesser [2] introduce the transformation package which works as a workow; starting from extraction up to loading process. The identier that references this package is stored within the data warehouse.

Lineage Intermediate Values

It is not usual but it may be necessary to store lineage intermediate values to enhance lineage tracing performance. Storing these values in data warehouse is resource intensive and requires serious consideration. The works of Cui and Widom [8, 7], describe dierent algorithms to store intermediate lineage values. During lineage tracing, lineage intermediate values is accessed to produced a lineage view instead of reconstructing from nothing. The presence of the lineage view in the data warehouse for tracing, intuitively improves performance during lineage queries but penalizes the

(48)

data warehouse loading process. Additionally, data in the warehouse grows in vol- umes of magnitude and storing data lineage alongside it poses a serious storage and administration challenge.

Other Data Lineage Related Information

Other lineage related information is stored in a dimension table. [1, 2, 6] introduced similar ideas about an audit dimension. The audit dimension is a table that contains pertinent information about the row of the fact table. For example it may contain the identier of derivations applied to the data before the data arrives to the fact table, the source systems, the date and time it was loaded.

2.3.2 Storing Lineage information

In this section we look into the mechanisms for storing lineage information. We pick one example for each of the two main lineage granularity levels for illustration.

Specically we look on the work of [52] for storing schema information, and on the works of [7, 10] for storing lineage intermediate data and the works of [55, 7, 8] for storing function or procedural code.

Schema Information

Schema information is stored to support tracing the data origin or mapping target data to its source. A good illustration for storing schema information is the ones introduced by [52]. [52] presented Meta data storage schema which includes seven tables, refer to gure 2.8. In this section we present how these Meta data storage schema are used.

Looking back to our sample case in section 2.1. Three source tables; Employee,

(49)

Attendance and WorkSchedule, produce the warehouse data AbsenceStatistics through transformation steps T1 to T5. The transformation steps are shown in gure 2.3. In section 2.2.1, we presented schema level approach of data lineage as introduced in [52]. Section 2.2.1 presented transformation steps in the form of mappings. Figure 2.7 shows the mappings scenario using our sample case in section 2.1. To illustrate how the Meta data storage schema is used, refer to gure 2.7, particularly the m1 mapping. Mapping m1 involves the Attendance table expressed in foreach query and the intermediate view V1 expressed in exist query. We assume that the source database for Attendance table is sdb and V1 is stored in a transient database tdb.

The schema elements which are involved in mapping m1 is stored in the Meta data storage as shown in gure 2.12. Element table stores the schema name, its type (e.g. string) its parent element, and the databases where it resides. Query maintains the query identiers, in our example, q0 and q1 for the foreach and exist queries respectively. The Binding table records the from clause of each query as a list of bindings. In our example the information recorded are the binding identier a within the query q0 , the schema element where it starts (entered as r1) and the schema which it refers (entered as e3). The Condition table records the elements in the where clause. In our example shown in gure 2.12, the recorded information are the query id q0, the binding id a and element id, e3 which participate in the expression with operator being "=" to a constant value "Sickleave.". The column eid2 would have been an element id if the value in bid2 is not a constant value. The Mapping table is used to encode mappings. It records the mapping id m1 the query identiers q0 and q1 for foreach and exist queries respectively. The expression in the select clause is recorded in the Correspondence table. The binding id and the element

(50)

id for the foreach and exist queries are encoded in forbid, forEid, conBid and conEid columns respectively.

Data Sources and Lineage Intermediate Values

Figure 2.3 in our sample case, illustrated transformation steps. Each transformation step produces an intermediate view before the nal transformation step which populates the data warehouse. The intermediate views² in this transformation steps may be created and populated dynamically during execution time, or may be stored physically in a disk. Storing the intermediate views can be very expensive as the data in the data warehouse is usually huge. There are dierent options for storing lineage data values in the data warehouse. The maintenance aspect of this option (storing lineage data values) can also be complex and this topic does not escape the curiosity of the research community [22, 20, 9, 10, 45, 12, 7, 5].

Cui [7] details on dierent mechanisms of storing intermediate lineage values by presenting dierent algorithms of storing auxiliary information. The auxiliary view which stores the intermediate lineage values is generated during view specication.

When the data from source are loaded to the warehouse through data integrator, the data integrator also populates the auxiliary view if view is tagged to be traceable.

Cui and Widom [7, 10] describe dierent options to store the lineage auxiliary views. First is Store Nothing option. This option, as its name implies, stores nothing in the auxiliary view. This scheme retrieves all information from the data source tables during lineage tracing. This saves storage and storage maintenance but provides poor lineage tracing performance. Another option is Store Base Tables.

2Intermediate or Auxiliary views contain transformed data that will be used for further transformation.

(51)

Figure 2.12: Metadata Storage Implementation from m1 mapping - adapted from [52]

(52)

Within the data warehouse, create a copy of each source table that the view is dened.

View maintenance in this scheme is uncomplicated, but the base table can be large and may contain data that are irrelevant to warehouse data, and thus are not usable for lineage tracing. Another method is to create and implement Store Lineage View. This scheme stores all lineage information for all tuples in the primary view.

This schema signicantly simplies the tracing query and potentially reduce the query cost. But lineage views can be large and costly to maintain. Another option is Store Split Lineage. This option split the lineage view and store a set of tables instead.

Split lineage tables contain no irrelevant source data, since every row in this table contributes to some data warehouse row. Furthermore, the size of the split lineage tables can be much smaller than lineage view. The rest of the options are Store partial base table and Store Base Projection. These options store specic portions of base tables only which reduces the size of Base Table method. See [7, 10]

for details about the dierent options in storing the lineage auxiliary views.

2.4 Reporting the Data Lineage

In the previous section, we presented the data lineage granularity levels, we classify dierent types of data lineage basing on the dierent data lineage related studies, and we look on dierent approaches in storing and preparing the data lineage information for possible queries about the origin of data in the data warehouse. No matter how brilliant the algorithms and procedures for storing data lineage and preparing it for tracing, they amount to nothing if the user is unable to view and use the data lineage itself. In this section, the process for tracing or drilling-through the data lineage is presented. We look to specic approaches in the dierent works on how the lineage

(53)

is constructed. The tangible part of all the mechanisms discussed above, is the data lineage produced and made visible for the user who ask questions about the data origin of a warehouse data item.

2.4.1 Metadata Query

In section Schema Information, under section 2.3.2, we illustrated an example on how schema elements are stored basing on [52]. [52] proposes Meta data storage implementation consisting of seven tables which stores schema information to allow data lineage tracing. Furthermore, we also presented how the schema is stored by using our sample case in section 2.1. (see to gure 2.12).

Velegrakis, Miller and Mylopoulos [52] provide a formal basis for realizing schema mapping which is a form of tracing data lineage. This formal basis builds on the Meta data storage implementation where schema elements and transformations are available for querying. To realize this mapping, [52] extend standard query language with special operators to utilize the Meta data. This extended language is referred to as MXQL which stands for meta-data extended query language. Schemas and mappings are to be stored as described in section 2.3.2 and illustrated in gure 2.12, in order to be queried and returned in answer sets as regular data. MXQL queries can be executed to exploit the Meta data storage schema while hiding the details of the meta-data storage implementation. Tools for invoking the queries and presenting information in the report can then be used to show the result of the queries to the users.

(54)

2.4.2 View Tracing Query

A section Source and Lineage Intermediate Values under section 2.3.2, describes the overview of dierent approaches in storing the values of data lineage within the data warehouse. The primary objective of this approach is to optimize "drill-through"

or step-wise query for the data lineage of the data warehouse items that are transformed via SQL views or dierent levels of views. Furthermore, this approach ensures to return accurate lineage values when queried in step-wise fashion.

[7, 10] formulate tracing algorithms which enable tracing the origin of data for a specic data warehouse item. A tracing query is built based on the target view query. To illustrate an example, let us go back to our sample case in section 2.1 which presents a data warehouse view AbsenceStatistics, refer to gure 2.5. Ab- senceStatistics view is populated from data coming from the three source tables which are transformed via ve transformations (see gures 2.2, 2.3 and 2.4). Prior to the transformation process, each target view in the ve transformations is assumed to be already dened as part of the of the transformation description. Assuming that the data in each of the intermediate views are permanently stored for lineage traceability, then drilling-through to the data origin is possible using the tracing query described in [7, 10]. To simplify our illustration, we utilize the example shown in gure 2.9. Figure 2.9a) presents the source tables and the intermediate target view V2. Furthermore, this gure also highlights the view item in question, row (D70, 172.5, 3.2005). Figure 2.9b) illustrates the data lineage for row (D70, 172.5, 3.2005).

The data lineage is derived using the tracing query created based on [7, 10]. The tracing query is built using the view denition. View V1 denition is shown below.

(55)

CREATE VIEW AS V1, SELECT e.Department,

sum(w.WSHours) as WSHours, w.WSCalMonth

FROM Employee e WorkSchedule w WHERE e.EmpID = w.EmpID

To trace the data lineage of row (D70, 172.5, 3.2005), the tracing query splits the Relations that are involved in the view denition, in this particular View. In our view example the relations are the Employee and WorkSchedule. After the split, a query is made against each relation basing on the view denition condition and the row in question. Implementing the tracing query produces the exact data lineage for the row in question as shown in gure 2.9b). The tracing queries created for the row in question (D70, 172.5, 3.2005) appears as follows: