Becoming a Data Scientist: What You Need to Know
Today data science is at the heart of nearly every business and organization. As the streams of data keep growing, there is a greater need than ever before to not only collect it, but sift through it and analyze it to direct decisions. Consequently, they need the skills and expertise of a data scientist, and many even build whole data science teams.
That demand for data scientists is still generally ahead of the supply, which accounts for both the large number of openings and the higher than average salary. According to Glassdoor’s figures, the median base salary for a data scientist is $108,000. It’s not just high pay to make up for a job people don’t enjoy. In fact, it ranks as the best job in America with a job satisfaction rank of 4.3 out of 5.
Defining the Role of a Data Scientist
Far more than a mere quant, the successful data scientist is a creative thinker and problem solver with domain understanding. In light of the fact that extracting value from data entails not just skill but art, some years ago, Venture Beat suggested that “data artist” may be more accurate: “Perhaps these scientists are not the Einsteins and Edisons but the Van Goghs and Picassos of the big data revolution.”
Data scientists don’t merely observe and quantify, but come up with creative approaches to extracting insight and value from data. A successful data scientist is not just someone who has checked off the list of hard skills. He or she has to have the ability to think about how to approach a problem in a new way that opens the way to a solution and then effectively communicate what worked and why.
The question is: What does one have to do to get on track to launch a career in data science? There are core key skills that most people agree on, but there is also the question of the capabilities a data scientist has to possess to do more than merely crunch numbers and program models. In the upcoming sections of this tutorial some experts offer their insight on what it takes to prepare for a career in data science.
Preparing to Get Qualified as a Data Scientist: Be Ready for Change
How does one go about training to become a data scientist?
One of the sources most cited for practical direction in pursuing a career in data science is a KDnuggets article that was originally published in 2014 but was regularly updated and expanded through 2018, though it retains the number of the original title. So 9 Must-have skills you need to become a Data Scientist, in fact, lists 13 skills, though whether or not all should count as “skills” is debatable.
KDnuggets’ must-have skills begins with formal education in terms of university degrees. The article points out that the majority of data scientists possess advanced degrees: 46% have PhDs, and 88% hold at least a master’s-level degree. They also build up their core skills at the undergraduate level. The most popular choice for this career track is a bachelor’s in math and statistics, which makes up about a third. The next most popular is a computer science degree, which is held by 19%. The third choice, which makes up 16%, is engineering.
Any of those choices would contribute to requisite skills for the field, though some shift into it from hard sciences or even from the arts. Certainly, a number of students at the NYC Data Science Academy enter the program with a degree in another field and get up to speed on coding and math prior to taking the plunge into the data science focus. Such students already understand the necessity of learning new skills to adapt to the needs of the workplace.
As the coding and other technical skills that data scientists need to know can vary over time — something we will look into a bit further on — a data scientist must above all retain the motivation and adaptability to acquire new skills and languages. Given the rapid pace of technology, the techniques involved in data science seven years down the road will look quite different from the ones that are currently in use.
This kind of change is inevitable, according to a recent article on what is involved in staying “relevant in the future of work.” In contrast to the old normal in which people would qualify for their professions at the beginning of their careers and then just keep doing the same thing, people who want to stay in the game tomorrow will have to keep learning new skills. “The half-life of a skill has dropped from 30 years to an average of 6 years,” it explains.
What that means for those who are currently seeking to get qualified is that they should not expect to be done with their training ever after. The old normal of “learn at school and do at work” is no longer sustainable in the corporate world. It’s not just a matter of getting ahead — but just of survival — to learn new technology and processes in order to keep up with the changes at work.
Given the reality of today’s world, the education of a data scientist must encompass more than obtaining a degree in computer science and a certificate in data science or taking the courses for the various tools used in the profession. It’s a matter of learning how to approach problems like a data scientist and then using the various tools available to obtain the best insight and models to suit an organization’s goals. Keeping on top of the game will require keeping up with new techniques that emerge.
But you still will have to begin with a core list of skills, and we will look at that in the next section.
The Technical Skills Needed by a Data Scientist and how to Acquire Them
What are the technical skills of a data scientist? The answers to this question do vary. In the interest of taking cues from real life rather than just from curricula, we’ll look at the 15 skills that a Kaggle survey identified as the most used in the field.
As you can see from the graph of the top 15 choices below, Python is far ahead the top skill, claiming more than 76%. Second place goes to R at just under 60%. SQL is somewhat behind that, coming in under 54%. There’s also a significant showing for fourth place, which goes to Jupyter Notebooks with just over 40%.
The shares then drop below 29% for TensorFlow, followed by Amazon Web Services, which is just a bit ahead of Unix shell, as both top 23%. The categories that follow are all hovering under the 20% mark, and that encompasses: Tableau, C/C++ and NoSQL. The next two that are very close are MATLAB/Octave and Java, both with over 18%. Likewise, Hadoop/Hive/Pig is just barely ahead of Spark, with both just over 17%. One more skill just makes the cut, and that’s Microsoft’s Excel Data Mining, a tool used by a bit under 14%.
Python Is #1
Python’s pride of place in data science is established not just by Kaggle but by other surveys, as attested to by the attention it has been getting in media. Last year, for example, The Economist ran the headline “Python is becoming the world’s most popular coding language.” Even though C++ has had a type of resurgence, as reported by Tech Republic, Python still retains pride of place for data science. In fact, a recent Dice article reported, “Python is on the precipice of becoming the programming language to know if you want a well-paid engineering job on Wall Street.”
So clearly Python is very important, and, as we’ll see in the next section on what a Python course includes, it encompasses some of the other skills that made the list, as well.
Learning the Languages and Skills
People who want to master Python have a number of options. Self-motivated individuals can learn on their own by referring to books, YouTube tutorials, and self-directed practices. Those who want more instruction and direction can sign up for courses either in colleges or at specialized coding schools. Both of these will often include an online option.
There are some beginner level courses that are free. However, typically, there is some charge for the more advanced courses, as well as the ones that offer a certificate that can be added to a resume.
A thorough course will not just provide instruction in the language itself, but in supplementary packages. That means that individuals who complete a Python data science course shouldn’t just learn the basics of Python coding, but also the following:
- An in-depth understanding of data science processes, data wrangling, data exploration, data visualization, and hypothesis building and testing, including knowledge on how to install the Python environment and its auxiliary tools and libraries
- Understanding and application of the concepts of Python and associated packages, including NumPy, SciPy, Pandas, Scikit-Learn and the matplotlib library
- Expertise in machine learning and natural language processing with open-source Jupyter Notebooks
- Knowledge of how to use web scraping to extract useful data from websites
- Insight on how to integrate Python with Hadoop, Spark and MapReduce
While not all of the above are listed explicitly among the top skills in the Kaggle survey, they are generally considered part of the the data science toolkit. There also is some overlap in the purpose of the skills identified on Kaggle and the one obtained in Python courses. Tableau, for example, is a data visualization tool that some data scientists use, though the thoroughly trained ones will also master other tools to be used according to the requirements of the particular project they are working on.
A number of data science programs will teach only Python because of its immense popularity. There are also some that offer a course of study based only on R. But for the person who aspires to really be as well-grounded as possible, a mastery of both is the way to go. As Drace Zhan, a data scientist at NYC Data Science Academy observed in 12 Key Tips for Learning Data Science, “Python is ideal but R is a great fall back tool. It’s best to have both in your arsenal.”
For those who are not enrolled in a university or data science course of study, there are additional options recommended in The 5 Most Effective Ways to Learn R. They include taking an online course, reading books, watching instructional videos and reading blogs. It particularly recommends the following:
- Revolutions (Microsoft’s R blog)
- Civil Statistician
- Flowing Data
- Datazar Blog
Zhan considers SQL to be “extremely important for a Data Analyst.”
There are a number of free or very low cost online courses available on the subject. Javarevisited recommends five options.
One is courses from Udemy, particularly, Complete SQL Bootcamp. The second is SQLZOO, which is described as “the most popular website for learning SQL online.” The third is a free SQL course provided by Stanford University. The fourth is Khan Academy’s “Intro to SQL: Querying and managing databases.” The fifth is SQL Bolt, which is presented as a very good bet even for those with no coding background. It offers “20 lessons starting from a basic SQL query to more advanced and confusing Join queries, aggregation, filtering and dealing with nulls.”
Rounding Out the Technical Skills
Zhan added that math skills enter into thorough comprehension of popular data science techniques, including “generalized linear models, decision tree, K-means, and statistical tests.”
Most of the rest of the top-ranked Kaggle skills are included in the applications data scientists learn in the course of mastering Python or R or languages unto themselves that can be studied formally in school, online or through the self-directed study means discussed for R. The same goes for Excel, though it isn’t a language but a component of the Microsoft Office Suite. While it isn’t regarded as a true data science tool, it likely is used by businesses because it is familiar to people working there and has some built-in visualizations tools. Many people learn Excel in college or just by working with it on the job and checking for tutorials on techniques.
But That’s Not All
There are still other skills entailed in being a data scientist, though. We’ll explore those in the last section.
Getting the Right Mix: Data Science Takes more than Math and Coding
As noted at the beginning, even though hard skills make up the core of data science, there are also some soft skills involved in bridging the gap between the data and its meaning, the information presented and the actionable insight. This is why a mix of skills, including the technical and creative, are needed to be successful in the field.
A mix of skills for the professional data scientist is what emerges from the list Roger Huang presents in Every Data Science Interview Boiled Down To Five Basic Questions. Those five questions work out to 60% hard skills, 20% soft skills and 20% ability to apply knowledge to the situation. The hard skills make up three of the questions: one on math, one on coding and one on statistics.
Soft skills come into play in providing the answer for what Huang calls “behavioral questions” that assess the applicant’s fitness for the company culture. Then there is what he calls the “scenario question,” the one that challenges applicants to demonstrate their ability to apply what they’ve learned to a particular situation and outline an approach that could work. Mastering scenario questions draws the power of imagination or creativity, as well as communication skills, and business acumen, soft skills that KDnuggets includes in the list of mandatory skills for a data scientist.
What’s Creativity Got to Do with It?
Bill Pardi explained why creativity is essential for data science success in an article on Medium. He clarified as follows: “What I mean by creativity in this context is the process of asking questions and experimenting. Creativity allows us to take the data we have, question our starting assumptions about what the data is telling us, and experiment until we make something useful out of it.”
Pardi offered the analogy of a chef who is the one who has the vision and skill to take the raw food and turn it into a spectacular dish. Without the chef’s cooking skills, the ingredients will not reach their potential. Data itself is a raw ingredient — not the finished product of data science, which is insight.
“For data to support truly creative or innovative outcomes, we must allow it to inform us of the facts so we can ask questions and experiment with the ‘adjacent possible’ to discover the insights and potential that the raw data doesn’t provide” is the gist of its argument.
Using Both Sides of the Brain for Success in Data Science
Pardi’s take on the need for creativity agrees with the insight Olivia Parr-Rud shared in 12 Key Tips for Learning Data Science. She insisted that data scientists need to use “art as much as a science.” She added that it is a mistake to consider “data science as a career that primarily uses the left-brain” when, in fact, “data scientists must use their whole brain.”
Integrating both parts of the brain is what makes it possible to do more than merely observe patterns, she explained:
Most left-brain/linear tasks can be automated or out-sourced. To offer a competitive edge as data scientists, we must be able to recognize patterns and synthesize large quantities of information using both sides of our brain. And we must be innovative thinkers.
It’s not just a matter of thinking creatively but conveying the ideas in a way that makes sense to the intended audience. That means data scientists have to be able to put themselves in the shoes of the decision-makers to see things from their perspective and explain the significance of the analytics in their terms.
As Parr-Rud put it: “Most executives don’t understand what we do or how we do it. So we need to think like leaders and communicate our findings and recommendations in language that our stakeholders understand and trust.”
This is where the data scientist needs to draw on three of the four soft skills identified by KDnuggets: teamwork, communication skills and business acumen. Some substitute domain expertise for business acumen. That refers to understanding what the particular context for the data and the goals of the analytics are.
Without deep domain expertise, Dean Abbott, co-founder and chief data scientist at SmarterHQ observed in an interview, “you don’t know what you’re looking for.” Data scientists have to communicate clearly with the people in the business who know the ins and outs of its operations to learn “which metrics are significant.”
What It’s All About
What about the fourth soft skill KDnuggets included? That’s intellectual curiosity, which underlies all motivation to frame questions and set up the process of finding answers.
This is what brings us to the very essence of science as Einstein described it: “The mere formulation of a problem is far more essential than its solution, which may be merely a matter of mathematical or experimental skills. To raise new questions, new possibilities, to regard old problems from a new angle requires creative imagination and marks real advances in science.”