Data Wrangling

Why Trust Techopedia

What Does Data Wrangling Mean?

Data wrangling is a process that data scientists and data engineers use to locate new data sources and convert the acquired information from its raw data format to one that is compatible with automated and semi-automated analytics tools.

Advertisements

Data wrangling, which is sometimes referred to as data munging, is arguably the most time-consuming and tedious aspect of data analytics. The wrangler's goal is to create strategies for selecting and managing large, aggregated datasets in order to produce a semantic data model.

The exact tasks required in data wrangling depend on what transformations the analyst requires to make a dataset useable. The basic steps involved in data wranging include:

Discovery — learn what information is contained in a data source and decide if the information has value.

Structuring — standardize the data format for disparate types of data so it can be used for downstream processes.

Cleaning — remove incomplete and redundant data that could skew analysis.

Enriching — decide if you have enough data or need to seek out additional internal and/or 3rd-party sources.

Validating — conduct tests to expose data quality and consistency issues.

Publishing — make wrangled data available to stakeholders in downstream projects.

In the past, wrangling required the analyst to have a strong background in scripting languages such as Python or R. Today, an increasing number of data wrangling tools use machine learning (ML) algorithms to carry out wrangling tasks with very little human intervention.

Techopedia Explains Data Wrangling

Unlike like cowbody coding, a derogatory term for programmers who like to skip quality assurance (QA) testing, the job title "data wrangler" is actually a legitimate job title for employees who work in data management.

The job requires a data engineer who has the technical skills to find value from raw or unstructured data, and the business skills to ensure that the organizations's data models are reliable, reproducible, accessible, interoperable and analyzable for multiple purposes.

In a cloud-first organization, the Chief Data Officer's or Chief Data Scientist's responsibilities often revolve around addressing the problem of how to aggregate distributed data into a series of processing pipelines so it can be ingested, curated and indexed, while the data wranglers determine how the pipelines should acquire and clean the data.

Advertisements

Related Terms

Margaret Rouse
Senior Editor
Margaret Rouse
Senior Editor

Margaret is an award-winning technical writer and teacher known for her ability to explain complex technical subjects to a non-technical business audience. Over the past twenty years, her IT definitions have been published by Que in an encyclopedia of technology terms and cited in articles by the New York Times, Time Magazine, USA Today, ZDNet, PC Magazine, and Discovery Magazine. She joined Techopedia in 2011. Margaret's idea of a fun day is helping IT and business professionals learn to speak each other’s highly specialized languages.