Dirty Data

Definition - What does Dirty Data mean?

Dirty data refers to data that contains erroneous information. It may also be used when referring to data that is in memory and not yet loaded into a database. The complete removal of dirty data from a source is impractical or virtually impossible.

The following data can be considered as dirty data:

  • Misleading data
  • Duplicate data
  • Incorrect data
  • Inaccurate data
  • Non-integrated data
  • Data that violates business rules
  • Data without a generalized formatting
  • Incorrectly punctuated or spelled data

Techopedia explains Dirty Data

In addition to incorrect data entry, dirty data can be generated due to the improper methods in data management and data storage. Some dirty data types are explained below:
  • Incorrect data - To ensure that the data is valid or correct, the value entered should comply with the field's valid values. For instance, the value entered in the month field should range from 1 to 12, or an individual's age has to be less than 130. The data value correctness may be programmatically enforced by means of lookup tables or with edit checks.
  • Inaccurate data - It is possible that a data value can be correct, but not accurate. At times, it is practical to examine against other files or fields to find out if the data value is accurate based on the context it is used. Still, accuracy can often only be validated by manual verification.
  • Business rule violations - Data that violates business rule is another type of dirty data. For instance, an effective date must always come before an expiry date. Another business rule violation example can be a patient's Medicare insurance claim where the patient may be still under the retirement age and fails to be entitled to Medicare.
  • Inconsistent data - Unchecked data redundancy leads to data inconsistencies. Each organization is affected with inconsistent and repetitive data. This is particularly typical with customer data.
  • Incomplete data - Data with missing values is the main type of incomplete data.
  • Duplicate data - Duplicate data may occur due to repeated submissions, improper data joining or user error.
In order to increase the data quality and prevent dirty data, organizations should incorporate methodologies to ensure the completeness, validity, consistency, and correctness of the data.
Share this:

Connect with us

Email Newsletter

Join thousands of others with our weekly newsletter

The 4th Era of IT Infrastructure: Superconverged Systems
The 4th Era of IT Infrastructure: Superconverged Systems:
Learn the benefits and limitations of the 3 generations of IT infrastructure – siloed, converged and hyperconverged – and discover how the 4th...
Approaches and Benefits of Network Virtualization
Approaches and Benefits of Network Virtualization:
Businesses today aspire to achieve a software-defined datacenter (SDDC) to enhance business agility and reduce operational complexity. However, the...
Free E-Book: Public Cloud Guide
Free E-Book: Public Cloud Guide:
This white paper is for leaders of Operations, Engineering, or Infrastructure teams who are creating or executing an IT roadmap.
Free Tool: Virtual Health Monitor
Free Tool: Virtual Health Monitor:
Virtual Health Monitor is a free virtualization monitoring and reporting tool for VMware, Hyper-V, RHEV, and XenServer environments.
Free 30 Day Trial – Turbonomic
Free 30 Day Trial – Turbonomic:
Turbonomic delivers an autonomic platform where virtual and cloud environments self-manage in real-time to assure application performance.