Data Science – Part 2 – Data Scientist

Chirag Sanghavi
Latest posts by Chirag Sanghavi (see all)

<< Data Science – Part 1

Preface

After my article on data science, one would wonder who would be a data scientist, or another question might pop up that what would be the roles and responsibilities of a data scientist? Well this article clears all such queries and describes exactly the roles and responsibilities of a data scientist.

DJ Patil (Chief Data Scientist of the United States Office of Science and Technology Policy) and Jeff Hammerbacher (a prominent data scientist as well as chief scientist and cofounder at Cloudera) invented the term “data scientist” in 2008. So that is when “data scientist” emerged as a job title. (Term data science entered Wikipedia in year 2012.)

In Harvard Business Review a declaration was made that the job of a data scientist is the sexiest job of the 21st century.

Who is Data Scientist?

So who is data scientist? Why such a buzzword about this role in today’s date? So what does a data scientist do? Excited to know? Let’s see statements by famous people already in the field.

As per Dr. Usama M. Fayyad, a veteran data scientist  “A data scientist is someone who knows lot more statistics than a software engineer and a lot more software engineering than a statistician.

But to me when I started digging into the field of data science, it turned out to be much more than a software engineer and a statistician, it also involves a great story teller and a bold decision maker based on the data. Following are the definitions used by data science experts Shlomo Aragmon (Phd., Professor of Computer Science, Program Director, Master of Data Science at Illinois institute of technology) and Monica Rogati (Data Science advisor, Former VP of Data @Jawbone & @LinkedIn data scientist.).

Data Scientist = Statistician + programmer + coach + storyteller + artist” – Shlomo Aragmon

A data scientist is half hacker and half analyst, and he uses data to build and derive insights” – Monica Rogati.

So what exactly does a Data scientist do? Well further in the article this question is answered with marked differentiation with the terms data engineer and data scientist.

Data Engineer Vs Data Scientist:

What should be the exact role of data scientist in terms of actual practical work? To understand the roles and responsibilities of a data scientist firstly the confusion between the terms data scientist and data engineer needs to be cleared.

The data engineer will work with database systems, data API’s and tools for ETL purposes, and will be involved in data modeling and setting up data warehouse solutions. Whereas the data scientist needs to know about statistics, mathematics and machine learning to build predictive models, automate the work and tell the story to the key stake holders.

Data Scientist Vs Data Engineer

I will not go deep into writing in details about the responsibilities of data engineers but would explain the responsibilities of a data scientist.

Responsibilities of a Data Scientist

1. Data scientists will usually already get data that has passed a first round of cleaning and manipulation, which they can use to feed to sophisticated analytics programs and machine learning and statistical methods to prepare data for use in predictive and prescriptive modeling.

2. To build models, they need to do research industry and business questions, and they will need to leverage large volumes of data from internal and external sources to answer business needs. This also sometimes involves exploring and examining data to find hidden patterns.

3. Once data scientists have done the analyses, they will need to present a clear story to the key stakeholders and when the results get accepted, they will need to make sure that the work is automated so that the insights can be delivered to the business stakeholders on a daily, monthly or yearly basis.

4. The data scientist needs to be aware of distributed computing, as he will need to gain access to the data that has been processed by the data engineering team, but he or she’ll also need to be able to report to the business stakeholders: a focus on storytelling and visualization is essential.

What this means in terms of focus on the steps of the data science process workflow, you can see in the image below

Data Science Process
Data Science Process flow

Languages, Tools and Software’s used by Data Scientist

So what technology and tools does a data scientist use?  The following overview includes both commercial and open source alternatives.

Data scientists will make use of languages such as SPSS, R, Python, SAS, Stata and Julia to build models. The most popular tools here are, without a doubt, Python and R. When you’re working with Python and R for data science, you will most often resort to packages such as ggplot2 to make amazing data visualizations in R or the Python data manipulation library Pandas. Of course, there are many more packages out there that will come in handy when you’re working on data science projects, such as Scikit-Learn, NumPy, Matplotlib, Statsmodels, etc.

In the industry, you’ll also find that commercial SAS and SPSS do well, but also other tools such as Tableau, Rapidminer, Matlab, Excel, Gephi will find their way to the data scientist’s toolbox.

Language Tools and Softwares
Language Tools and Software’s

The figure above displays the languages, tools and software’s used by a data engineer (in blue circle) and a data scientist (in yellow circle).

You see again that one of the main distinctions between data engineers and data scientists, the emphasis on data visualization and storytelling, is reflected in the tools that are mentioned.

Tools, languages, and software that both parties have in common, as you might have already guessed, are Scala, Java, and C#.

These are languages that aren’t necessarily popular for both data scientists and engineers: you could argue that Scala is more popular with data engineers because the integration with Spark is especially handy to set up large ETL flows.

The same goes a bit for the Java language: at the moment, its popularity is on the rise with data scientists, but overall, it’s not widely used on a daily basis by professionals. The same can also be said about tools that both parties could have in common, such as Hadoop, Storm, and Spark.

Conclusion

Hope the roles and responsibilities of a data scientist is clearly defined in this article and hope readers enjoy reading this article.

Stay connected for more knowledgeable articles on data science.


Check Articles From Categories      Health and Parenting      Inspiring Stories      Technology      Microsoft Azure      SharePoint O365

2 Replies to “Data Science – Part 2 – Data Scientist”

Leave a Reply

Your email address will not be published. Required fields are marked *