Data standardization: define, test and transform

As organizations shift toward establishing a data culture across the enterprise, many still struggle to get their data right. Retrieving data from disparate sources and getting different formats and representations of what appears to be the same information creates serious roadblocks in your data journey. Teams experience delays and errors when performing their routine tasks or extracting insights from datasets. Such issues force organizations to adopt a data standardization mechanism that ensures data exists in a consistent and unified view across the enterprise.

Let’s take a closer look at the data standardization process: what it means, what steps it involves, and how you can achieve a unified view of data across your organization.

What is data standardization?

Simply put, data standardization is the process of converting data values ​​from an incorrect format to a correct one. To enable a standardized, unified and consistent view of data across the enterprise, data values ​​must conform to the required standard – in the context of the data fields to which they belong.

Example of data standardization errors

For example, the record of the same customer residing in two different locations should not contain discrepancies in first and last name, email address, phone number, and residential address:

source 1
Surname Email-address phone number Date of birth gender residential address
John Oneel [email protected] 5164659494 2/14/1987 M 11400W Olimpic BL #200
source 2
first name Surname Email-address phone number Date of birth gender residential address
John O’neal +1 516-465-9494 2/14/1987 Masculine 11400W Olympic 200

In the example above, you can see the following types of inconsistencies:

  1. Structure: The first source covers the customer name as a single field, while the second stores it as two fields – first and last name.
  1. Sample: The first source has a valid email pattern enforced in the email address field, while the second source visibly lacks that @ Symbol.
  1. Data type: The first source only allows digits in the phone number field, while the second has a string type field that also contains symbols and spaces.
  1. Format: The first source has the date of birth in MM/DD/YYYY format while the second has it in MM/DD/YYYY format.
  1. Domain value: The first source allows storing the gender value as M or F, while the second source stores the full form – male or female.

Such data inconsistencies cause you to make serious mistakes that can cost your business a lot of time, cost and effort. Because of this, implementing an end-to-end mechanism to standardize data is critical to maintaining your data hygiene.

How is data standardized?

Data standardization is a simple four-step process. But depending on the type of inconsistencies present in your data and what you want to achieve, the methods and techniques used for standardization can vary. Here we present a general rule of thumb that any organization can use to overcome their standardization mistakes.

  1. Define what the default is

In order to achieve a state, you must first define what the state actually is. The first step in any data standardization process is to identify what needs to be accomplished. The best way to know what you need is to understand the business needs. You need to scan your business processes to see what data is needed and in what format. This will help you set a baseline for your data needs.

A data standard definition helps identify:

  • The databases that are crucial for your business process,
  • The required data fields of these assets,
  • The data type, format, and pattern that their values ​​must conform to,
  • The range of acceptable values ​​for these fields, etc.
  1. Test records against the defined standard

Once you have a standard definition, the next step is to test how well your data sets perform against them. One way to assess this is to use data profiling tools, which produce comprehensive reports and find information such as the percentage of values ​​that fit the data field’s requirements, such as: e.g.:

  • Do the values ​​match the required data type and format?
  • Are the values ​​outside the acceptable range?
  • Do values ​​use short forms such as abbreviations and nicknames?
  • Are addresses standardized as needed – such as USPS standardization for US addresses?
  1. Transform non-conforming values

Now it’s finally time to transform values ​​that don’t conform to the defined standard. Let’s take a look at common data transformation techniques that are used.

Some data fields must first be parsed to obtain the required data components. For example, parse the name field to separate the first, middle, and last name and any prefixes or suffixes included in the value.

  • Data type and format conversion

You may need to remove non-compliant characters during conversion, e.g. B. Removing symbols and letters from a phone number that consists only of digits.

  • Pattern matching and validation

The pattern conversion is done by configuring a regular expression for the pattern. For example, an email address can be validated with the regex: ^[a-zA-Z0-9+_.-][email protected][a-zA-Z0-9.-]+$. Values ​​that don’t match the regex must be parsed and converted to the defined pattern.

Company names, addresses, and personal names often contain abbreviated forms, which can result in your record containing different representations of the same information. For example, you may need to expand country states, e.g. B. Convert NY to New York.

  • Noise reduction and spell correction

Certain words don’t really add meaning to a value and instead introduce a lot of noise into a data set. Such values ​​can be identified in a dataset by comparing it to a dictionary containing those words, labeling them, and deciding which to permanently remove. The same process can be followed to find misspellings and typos.

  1. Test the dataset again against the defined standard

In the final step, the transformed dataset is tested again against the defined standard to determine the percentage of data standardization errors resolved. For the errors that still remain in your dataset, you can tweak or reconfigure your methods and run the data through the process again.

Wrap up

The amount of data being generated today – and the variety of tools and technologies used to collect that data – is leading organizations to confront the terrible data chaos. They have everything they need, but aren’t quite sure why the data isn’t in an acceptable and usable form. Adopting data standardization tools can help correct such inconsistencies and enable a much-needed data culture in your organization.

Originally published at

Laisser un commentaire

Votre adresse e-mail ne sera pas publiée. Les champs obligatoires sont indiqués avec *