Data Profiling before loading source data into Data warehouse – A real Time saver!


Understanding the source data is critical before its loading into Data warehouse since it has significant impact on how you design the structures and build the ETL.

Data profiling is a great way to analyze the source data to understand in what condition is it in terms of pattern, nulls, dirtiness etc. SSIS 2008 has a new task called 'Data Profiling Task'. We can leverage it for data warehouse ETL.

There are three main parts to the data profiling task.

  1. Running the data profile task in the SSIS
  2. Providing the output generated by the above tasks to 'Data Profile Viewer'
  3. Observing and analyzing the report and then design ETL on the basis of it.

Running the data profile task in the SSIS:

  1. The Data Profiling Task needs connection for source database, so we need to create one connection in the connection manager window.
  2. Create a connection to the server where database is there.
  3. Save the connection information and return to the SSIS package designer.
  4. Drag a Data Profiling Task from the Toolbox onto the Control Flow and double - click the new task to open up the Data Profiling Task Editor


  5. The Data Profiling Task includes a wizard that will create your profiling scenario quickly, therefore click the Quick Profile Button on the General tab.


  6. In the Single Table Quick Table Form, choose the connection, and in the Table Or View drop - down, select the desired table or view from the list. Enable all the checkboxes in the Compute list and change the Functional Dependency Profile to use 2 columns as determinant columns.


  7. Select OK to save the changes, which will populate the Requests list in the Data Profiling Task Editor as shown in Figure below.


  8. Return to the General tab of the editor and in the Destination property box, choose New File Connection. This is where you will define the XML file where the profile output is stored.
  9. In the File Connection Manager Editor, change the Usage type drop - down to "Create file " and enter name of the xml file in the File text box. Select OK to save your changes to the connection and OK to save your changes in the Data Profiling Task Editor.
  10. Run this simple package in BIDS, which will initiate several queries against the source table or view. This may take several minutes depending upon the number of records. I had about 30000 records and it took about half an hour to run the package.
  11. Result of this file would be stored in the xml file created above.

Providing the output generated by the above tasks to 'Data Profile Viewer'

After running the data profile task, we need to evaluate the results.

  1. Open the Data profile viewer from Start Button>All Programs>Sql Server 2008>Integration Services>Data Profile Viewer
  2. Click Open and select the xml file generated above.



  3. In the Profiles navigation tree drill, first click on the table icon on the top left to put the tree viewer into “ Column View. ” Second, drill down into the details by expanding the Data Sources, server (local), Databases, and Columns objects.
  4. The first profiling output to observe is the Candidate Key Profile, so click this item under the Columns list, which will open the results in the viewer on the right. You will see that the Data Profiling Task has identified six columns that are unique across the entire table (with 100%

    uniqueness) as shown below.


  5. Next, in the profile tree, expand the NationalIDNumber column and highlight the Column Length Distribution, and then in the distribution profile on the right, double - click the length distribution of 1, as shown below. Here we can see the minimum and maximum length of NationalIDNumber, its distribution across this field and the actual data.



  6. Under NationalIDNumberin the profile browser, select the Column Null Ratio Profile and then double - click the row in the profile viewer on the right to view the detail rows. As seen there are no null value here but if it will have it will show the distribution accordingly.



  7. Next, we may select the Column Value Distribution Profile on the left under the Title and observe the output in the results viewer. How many unique values are there in the entire table? Its 67 here.


  8. This can also help to create regex for validations as given below, using Column Pattern profiles. The Column Pattern Profile uses the regular expression syntax to display what pattern or range of patterns that the data in the column contain.


  9. The final data profiling type to review is the Column Statistics Profile. This is only applicable to data types related to numbers (integer, float, decimal, numeric) and dates (dates only allow minimum and maximum calculations). Max, min, mean and standard deviation for the field Is calculated using this.


Thus, as seen above, we can use the output of Data profiling Task to identify the range, quality, pattern in the source which helps to formulate the ETL design in a fast and reliable way. Isn’t it a real time saver? Please let me know your feedbacks by commenting on this article.

Happy Datawarehousing :) !

No comments: