Data Extraction, Transformation, and
Migration Tools:
This information is intended to facilitate data extraction,
cleansing and migration methods selection, with particular emphasis on tools.
The main body of the document briefly summarizes issues and makes
recommendations. The appendix presents an overview of applicable commercial
tools. Our recommendation is to select a suite of tools, each one of which meet
specific needs.
The Issue.
The process of developing several data warehouses and other database projects
has highlighted the need to effectively and efficiently manage the extraction,
cleansing, transformation and migration of data from legacy systems.
Effectiveness is necessary whenever the data is of great value (which is
usually). Efficiency is necessary, because the long term investment of resources
in these activities can be high.
Why tools? Tools can help achieve technical efficiency, and can
perform data quality tasks that would otherwise be impractical. Just as
important, tools such as Prism also provide a structure to the migration
process that can be leveraged as an effective management tool.
The Question.
What tools and techniques are most appropriate to assist with data extraction,
Cleansing and Migration to support Data Warehouse, Database Consolidation, and
Systems Reengineering Projects, given the business problem and environment?
The Decision Process. The purpose is not to document a decision that has been made but
to lay out the issues to facilitate discussion, and to make recommendations in
support of the decision process.
There is not one decision to make, but several in a series. Most
of the decisions are not about technical choices, but about management choices.
On the face of it, the key decision seems to be: do we need a
- Quick solution, to get
information out ASAP; or a
- Systematic solution to handle
routine processes smoothly, efficiently, effectively.
But in fact, both are needed. In an emergency situation, the
"quick solution" is necessary. Also, if it is not clear if the
finished warehouse or database as initially conceived will meet needs, the
"quick solution" functions as a test or prototype, that can be
followed by a more systematic solution. However, systematic data extraction and
cleaning solution of some sort will be needed for most problems (problems
discussed in the *Data Quality Methods* and Procedures white paper). The only
real question is when, and how capable and complicated it must be. A cost
benefits analysis can help address how automated and capable the eventual
solution should be.
- The base of knowledge about system data is adequate -
or - Is inadequate.
- Extracts/loads are one-time - or - Ongoing/repetitive.
- Focus of concern is quality of data content - or -
Expedient delivery of data.
- Data quality issues are specific (e.g. address, or
person-name) - or - General (several categories).
- Does a suitable tool exist - or - Do tools have to be
acquired or created.
- Is extraction/transport to be managed centrally - or -
Managed through distributed control?
- Is extraction/transport (and scheduling) to be
controlled via Parameters - or - Custom coded extracts and
transformations.
Figures 1 and 2 model a decision flow that incorporates the above
questions. The purpose of this model is to demonstrate the rational behind the
selection of a set of tools with varying strengths and capabilities. Figure 1
highlights the differences in tool applicability among three different tasks:
on-line data entry, on-line querying, and batch transport. On-line data entry
and querying are not the focus of this paper. Batch transport is expanded upon
in Figure 2, featuring decisions relevant to batch data extraction, cleaning,
transformation and transport.
Each branch in the diagram represents a decision branch, and each
box represents a tool category. A specific recommended tool is listed in the
diagram for each tool category. The sections below list other tools that fit
within that category. Figure 2 also features three tool "super
categories": Analysis, Data Quality and Scrubbing, and Extract Transport.
No comments:
Post a Comment