Definition - What does Dirty Data mean?
Dirty data refers to data that contains erroneous information. It may also be used when referring to data that is in memory and not yet loaded into a database. The complete removal of dirty data from a source is impractical or virtually impossible.
The following data can be considered as dirty data:
- Misleading data
- Duplicate data
- Incorrect data
- Inaccurate data
- Non-integrated data
- Data that violates business rules
- Data without a generalized formatting
- Incorrectly punctuated or spelled data
Techopedia explains Dirty Data
- Incorrect data - To ensure that the data is valid or correct, the value entered should comply with the field's valid values. For instance, the value entered in the month field should range from 1 to 12, or an individual's age has to be less than 130. The data value correctness may be programmatically enforced by means of lookup tables or with edit checks.
- Inaccurate data - It is possible that a data value can be correct, but not accurate. At times, it is practical to examine against other files or fields to find out if the data value is accurate based on the context it is used. Still, accuracy can often only be validated by manual verification.
- Business rule violations - Data that violates business rule is another type of dirty data. For instance, an effective date must always come before an expiry date. Another business rule violation example can be a patient's Medicare insurance claim where the patient may be still under the retirement age and fails to be entitled to Medicare.
- Inconsistent data - Unchecked data redundancy leads to data inconsistencies. Each organization is affected with inconsistent and repetitive data. This is particularly typical with customer data.
- Incomplete data - Data with missing values is the main type of incomplete data.
- Duplicate data - Duplicate data may occur due to repeated submissions, improper data joining or user error.
Techopedia Deals: Oracle Database 12c Administration Training Bundle
Join thousands of others with our weekly newsletter
The 4th Era of IT Infrastructure: Superconverged Systems:
Approaches and Benefits of Network Virtualization:
Free E-Book: Public Cloud Guide:
Free Tool: Virtual Health Monitor:
Free 30 Day Trial – Turbonomic: