Data engineers are very much in demand these days, but too many executives and others have big questions about what these professionals do.
There’s a significant confusion around the difference between software engineers and data engineers, along with questions about how data scientists and data engineers work together. Factor all kinds of new big data projects, including machine learning and business insight tools into the mix, and you have some significant confusion about the role of a data engineer and what their day-to-day work might consist of.
Read: 6 Key Data Science Concepts You Can Master Through Online Learning
A Concrete Data Refining Role
Speaking generally, the data engineer is responsible for working with data systems and refining data to fit into those systems, where a data scientist has a slightly different role in working directly with cleaning and organizing big data sets.
If there was one easy way to distinguish between what data scientists typically do, and what data engineers typically do, you could say that the data scientist would look at the data through a comprehensive lens while the data engineer would look at the data through the eyes of a database or big data processing system.
“Data engineers … specialize in translating the work of data scientists into hardened, data-driven software solutions for the business,” says Nima Negahban, CTO and founder of Kinetica, describing why data engineers will be in high demand in coming years. “This involves creating in-depth AI development, testing, devops and auditing processes that enable a company to incorporate AI and data pipelines at scale across the enterprise. That job of creating those hard and data-driven software solutions is a major part of what concerns data engineers in a modern enterprise.”
That delineation – the idea that data engineers work directly with big data systems, is a key way to understand what the data engineer offers an employer.
Data Engineers and a Changing Big Data Landscape
As the maintainers of big data systems and database setups, data engineers will often be knowledgeable in specific technologies like Apache Hadoop.
But they will also tend to know a lot about how these big data processing systems have evolved and which contenders are gaining popularity in today’s enterprise world.
Just a few years ago, Apache Hadoop was the gold standard for big data processing. Data engineers tied Hadoop to tools like YARN and MapReduce and produced clustered, structured data handling systems.
Now, Hadoop seems to be losing out to other types of systems.
In an article just a few months ago at The New Stack entitled “Will Kubernetes Sink the Hadoop Ship?,” writer Yaron Haviv notes that competitors Cloudera and Hortonworks have now merged, and that new Apache tools like Spark push Hadoop toward a kind of obsolescence.
In addition, cloud vendors have their own big data processing systems, which might also appeal to the data engineer’s workflow.
A third and very large movement is toward container virtualization. In a container setup, various data containers share a core operating system and present thin attack surfaces, while maximizing efficiency throughout the platform. Container technologies like Kubernetes have taken over many of the projects that used to run on Hadoop, and before that, on simple relational database servers.
“One of Kubernetes’ greatest advantages is its portability,” writes Haviv, “enabling users to build clusters which span multiple clouds or are distributed across locations. Portability also facilitates the development or testing of microservices in the cloud and deployment in one or many edge locations automatically.”
Data Engineers: Refining Data
Data engineers also have key roles related to taking raw data and making it structured. Data scientists may do some of this, too; however, again, data engineers will typically look at refining raw data and filtering it into a specific database system. You can think of them as the “system operators” or “system owners” in the data refinement process – they’re often thinking of data cleansing in the context of a specific environment. (For more on data scientists, see Job Role: Data Scientist.)
At DataScienceGraduatePrograms.com, an informative introductory resource highlights this part of the nature of data engineering:
Data engineers focus on the applications and harvesting of big data. Their role doesn’t include a great deal of analysis or experimental design. Instead, they are out where the rubber meets the road … creating interfaces and mechanisms for the flow and access of information.
Storing Data
Companies also have myriad choices in how to store data. The data engineer may be responsible for assessing these types of choices. For instance, it may be more helpful to utilize vendor storage services from Amazon or other vendors. Amazon’s S3 object storage model provides new ways to handle stored information that innovate on the traditional redundant array of independent disk (RAID) systems that were the norm just a few years ago.
Data Engineers as Matchmakers
As with other kinds of roles, data engineers also have roles to play within the organizational structure, and in trying to move business forward by making sure that goals and objectives match the structures that are in place.
Some of this requires seeking buy-in from executives or other stakeholders. Some of it requires making sure that middleware plugs into a data repository, or that big data systems can do their magic unencumbered by bottlenecks. All of this is often within the purview of the data engineer who will move refined and curated data through specific concrete IT systems and database models in a way that facilitates core business goals.
All of this shows how data engineers are very much the “guardians of the data storehouse” – when issues intersect the nature of big data and the systems that utilize or store it, they are often front and center in the org chart’s response. Think about how the data engineer fits into today’s, and tomorrow’s, business world.