In recent months/years, many blog posts have been trending on the social Web about what is a “data scientist“, as this term has become very popular. As there is much hype about this term, some people have even jokingly said that a “data scientist is a statistician who lives in San Francisco“.
In this blog post, I will talk about this recent discussion about what is a data scientist, which has led some people to claim that there is some easy signs to recognize a bad data scientist or a fake data scientist. In particular, I will discuss the blog post “10 signs of a bad data scientists” and explain why this discussion is going the wrong way.
10 signs of a bad data scientists
In this blog post, authors claim that a data scientists must be good a math/statistics, good at coding, good at business, and know most of the tools from Spark, Scala, Pyhthon, SAS to Matlab.
What is wrong with this? The obvious is that it is not really necessary to know all these technologies to analyze data. For example, a person may never have to use Spark to analyze data and will rarely use all these technologies in the same environment. But more importantly, this blog post seems to imply that a single person should replace a team of three persons: (1) a coder with data mining experience, (2) a statistician, and (3) someone good at business. Do we really need to have a person that can replace these three persons? The problem with this idea is that a person will always be stronger on one of these three dimensions and weaker on the two other dimensions. Having a person that possess skills in these three dimensions, and is also excellent in these three dimensions is quite rare. Hence, I here call it the data scientist unicorn, that is a person that is so skilled that he can replace a whole team.
In my opinion, instead of thinking about finding that unicorn, the discussion should rather be about creating a good data science team, consisting of the three appropriate persons that are respectively good at statistics, computer sciences, and business, and also have a little background/experience to be able to discuss with the other team members. Thus, perhaps that we should move the discussion from what is a good data scientist to what is a good data science team.
An example
I will now discuss my case as an example to illustrate the above point that I am trying to make. I am a researcher in data mining. I have a background in computer science and I have worked for 8 years on designing highly efficient data mining algorithms to analyze data. I am very good at this, ( I am even the founder of the popular Java SPMF data mining library). But I am less good at statistics, and I have almost no knowledge about business.
But this is fine because I am an expert at what I am doing, in one of these three dimensions, and I can still collaborate with a statistician or someone good at business, when I need. I should not replace a statistician. And it would be wrong to ask a statistician to do my job of designing highly efficient algorithms, as it requires many years of programming experience and excellent knowledge of algorithmic .
A risk with the vision of the “data scientist unicorn” that is good at everything is that it may imply that the person may not be an expert at any of those things.
Perhaps that a solution for training good data scientist is those new “data science” degrees that aim at teaching a little bit of everything. I would not say whether these degrees are good or not, as I did not look at these programs. But there is always the risk of training people who can do everything but are not expert at anything. Thus, why not trying to instead build a strong data science team?
==
Philippe Fournier-Viger is a full professor and also the founder of the open-source data mining software SPMF, offering more than 110 data mining algorithms.
If you like this blog, you can tweet about it and/or subscribe to my twitter account @philfv to get notified about new posts.