Thursday, January 29, 2009

ETL

Extract Transform and Load (ETL)

Extract, transform, and load (ETL): Is a process in data warehousing that involves

· Extracting data from outside sources,

· Transforming it to fit business needs, and ultimately

· Loading it into the data warehouse.

ETL is important, as it is the way data actually gets loaded into the warehouse. This article assumes that data is always loaded into a data warehouse, whereas the term ETL can in fact refer to a process that loads any database.

Extract

The first part of an ETL process is to extract the data from the source systems. Most data warehousing projects consolidate data from different source systems. Each separate system may also use a different data organization / format. Common data source formats are relational databases and flat files, but may include non-relational database structures such as IMS or other data structures such as VSAM or ISAM. Extraction converts the data into a format for transformation processing.

Transform

The transform stage applies a series of rules or functions to the extracted data to derive the data to be loaded. Some data sources will require very little manipulation of data. In other cases, one or more of the following transformations types may be required

· Selecting only certain columns to load (or selecting null columns not to load)

· Translating coded values (e.g., if the source system stores M for male and F for female, but the warehouse stores 1 for male and 2 for female)

· Encoding free-form values (e.g., mapping "Male" and "M" and "Mr." onto 1)

· Deriving a new calculated value (e.g., sale_amount = qty * unit_price)

· Joining together data from multiple sources (e.g., lookup, merge, etc.)

· Summarizing multiple rows of data (e.g., total sales for each region)

· Generating surrogate key values

· Transposing or pivoting (turning multiple columns into multiple rows or vice versa)

· Splitting a column into multiple columns (e.g., putting a comma-separated list specified as a string in one column as individual values in different columns)

Load:

The load phase loads the data into the data warehouse. Depending on the requirements of the organization, this process ranges widely. Some data warehouses merely overwrite old information with new data. More complex systems can maintain a history and audit trail of all changes to the data.

Challenges:

ETL processes can be quite complex, and significant operational problems can occur with improperly designed ETL systems.

The range of data values or data quality in an operational system may be outside the expectations of designers at the time validation and transformation rules are specified. Data profiling of a source during data analysis is recommended to identify the data conditions that will need to be managed by transform rules specifications.

The scalability of an ETL system across the lifetime of its usage needs to be established during analysis. This includes understanding the volumes of data that will have to be processed within Service Level Agreements, (SLAs). The time available to extract from source systems may change, which may mean the same amount of data may have to be processed in less time. Some ETL systems have to scale to process terabytes of data to update data warehouses with tens of terabytes of data. Increasing volumes of data may require designs that can scale from daily batch to intra-day micro-batch to integration with message queues for continuous transformation and update.

A recent development in ETL software is the implementation of parallel processing. This has enabled a number of methods to improve overall performance of ETL processes when dealing with large volumes of data.

There are 3 main types of parallelisms as implemented in ETL applications:

· Data: By splitting a single sequential file into smaller data files to provide parallel access.

· Pipeline: Allowing the simultaneous running of several components on the same data stream. An example would be looking up a value on record 1 at the same time as adding together two fields on record 2.

· Component: The simultaneous running of multiple processes on different data streams in the same job. Sorting one input file while performing a deduplication on another file would be an example of component parallelism.

All three types of parallelism are usually combined in a single job.

An additional difficulty is making sure the data being uploaded is relatively consistent. Since multiple source databases all have different update cycles (some may be updated every few minutes, while others may take days or weeks), an ETL system may be required to hold back certain data until all sources are synchronized. Likewise, where a warehouse may have to be reconciled to the contents in a source system or with the general ledger, establishing synchronization and reconciliation points is necessary.

Tools:

While an ETL process can be created using almost any programming language, creating them from scratch is quite complex. Increasingly, companies are buying ETL tools to help in the creation of ETL processes.

A good ETL tool must be able to communicate with the many different relational databases and read the various file formats used throughout an organization. ETL tools have started to migrate into Enterprise Application Integration, or even Enterprise Service Bus, systems that now cover much more than just the extraction, transformation and loading of data. Many ETL vendors now have data profiling, data quality and metadata capabilities.

Steps that needs to be followed while we are testing a Manual ETL process:

1. Source Data will be in any/combinations of these: Flat Files, CSV’s, Legacy DB systems.

2. The Fast Load Script picks up the data from the Source and loads that in to a Temporary stage tables.

3. Once the stage tables are loaded with the data from the sources make sure all your source data is loaded into stage tables.

4. Once the data is extracted from different sources to the stage tables the transformation of the data is started.

5. Based upon the fields that are required and business requirements the transformation of the data happens.

6. Check for the log files for the process that is carried out on the data in the stage tables whether all the required transformations are done correctly and data is in required format.

7. The transformed data is then loaded into the Temp Tables and the data which is not matching the required business criteria is moved to the exception tables.

8. Data from the Temp tables is loaded as it is into the Master Tables.

9. To clear the data in the exceptions table they should be recycled by clearing the reasons for the exceptions.

10. Once the exceptions are cleared the data is then moved to temp and from there to Master tables.

11. Once the data is in the Master Tables, Report Generation tools use this data and generate the reports for the strategic decision making of the Management or the stake holders in the decision making capacity.

PROS AND CONS:

There are pros and cons of both tool based ETL and hand-coded ETL. Tool based ETL provides maintainability, ease of development and graphical view of the flow. It also reduces the learning curve on the team.

Hand coded ETL is good when there is minimal transformational logic involved. It is also good when the sources and targets are in the same environment. However, depending on the skill level of the team, this can extend the overall development time.

ETL Tools are meant to extract, transform and load the data into Data Warehouse for decision making. Before the evolution of ETL Tools, the above mentioned ETL process was done manually by using SQL code created by programmers. This task was tedious and cumbersome in many cases since it involved many resources, complex coding and more work hours. On top of it, maintaining the code placed a great challenge among the programmers.

These difficulties are eliminated by ETL Tools since they are very powerful and they offer many advantages in all stages of ETL process starting from extraction, data cleansing, data profiling and transformation, debugging and loading into data warehouse when compared to the old method.

One of the advantages in manual ETL testing is the tester doesn’t need any knowledge and training on any of the Tools just the knowledge of the ETL process is enough

QA Vs QC

Quality Assurance:
1.Concentrates on the process of producing the products.
2.Defect prevention oriented.
3.Usually done through out the life cycle.
4.This is usually a staff function.
5.Examples-Reviews and Audits.

Quality Control:
1.Concentrates on specific products.
2.Defect-detection and correction oriented.
3.Usually done after the product is built.
4.This is usually a line product.
5.Example-S/W testing at various levels.


QA is process oriented and QC is product oriented.
QC comes after QA.
QA is defined as a set of activities designed to ensure the
development and maintenance state is adequate and to ensure
that the system will meet its requirements..
QC is defined as a set of activities designed to evaluate
the development of a working product.