What Does Dirty Data Mean?
Dirty data refers to data that contains erroneous information. It may also be used when referring to data that is in memory and not yet loaded into a database. The complete removal of dirty data from a source is impractical or virtually impossible.
The following data can be considered as dirty data:
- Misleading data
- Duplicate data
- Incorrect data
- Inaccurate data
- Non-integrated data
- Data that violates business rules
- Data without a generalized formatting
- Incorrectly punctuated or spelled data
Techopedia Explains Dirty Data
In addition to incorrect data entry, dirty data can be generated due to the improper methods in data management and data storage. Some dirty data types are explained below:
- Incorrect data - To ensure that the data is valid or correct, the value entered should comply with the field's valid values. For instance, the value entered in the month field should range from 1 to 12, or an individual's age has to be less than 130. The data value correctness may be programmatically enforced by means of lookup tables or with edit checks.
- Inaccurate data - It is possible that a data value can be correct, but not accurate. At times, it is practical to examine against other files or fields to find out if the data value is accurate based on the context it is used. Still, accuracy can often only be validated by manual verification.
- Business rule violations - Data that violates business rule is another type of dirty data. For instance, an effective date must always come before an expiry date. Another business rule violation example can be a patient's Medicare insurance claim where the patient may be still under the retirement age and fails to be entitled to Medicare.
- Inconsistent data - Unchecked data redundancy leads to data inconsistencies. Each organization is affected with inconsistent and repetitive data. This is particularly typical with customer data.
- Incomplete data - Data with missing values is the main type of incomplete data.
- Duplicate data - Duplicate data may occur due to repeated submissions, improper data joining or user error.
In order to increase the data quality and prevent dirty data, organizations should incorporate methodologies to ensure the completeness, validity, consistency, and correctness of the data.