Tuesday, October 1, 2013

What is Data profiling?

What is data profiling?

Data profiling is process of exploring the data available in order to get some idea about it. We can perform several tasks in this step like,
  • Find number of missing values from each column (count)
  • Find null values from each column (these are not missing value)
  • Find count of useful values from each column (not-Null and not-missing)
  • Find percentage of useful values from each column
  • Find cardinality of each column (number unique values)
  • Find distinctness as percent calculated by dividing cardinality with actual value count
In addition we can also consider,
  • Column names and significance
  • For low cardinality columns actual values and their significance
Why should we do it?

As you can see clearly from the discussion above it helps to understand the data and its quality. According wikipedia,

"Data profiling is an analysis of the candidate data sources for a data warehouse to clarify the structure, content, relationships and derivation rules of the data"  

When should we do it?
  • When we are dealing with new data set and want to get idea about it
  • Several stages in data warehouse development to make sure we are getting the right data with right quality
How should we do it?
  • There are several tools available in market which offer data profiling like DQ Analyzer, datamartist etc.
  • Most of them will have in-build profiling feature for normal scenarios
  • Gartner have classified data quality tools into 4 magic quadrants based of ability to execute and vision. We can see it in the image below,

 

No comments:

Post a Comment