Is machine learning the actual focus of data scientists’ everyday work? Do you need to learn all the things to be a data scientist? And, most importantly: Do data scientists have a sense of humor?
On the Data Science Mixer podcast, I always ask our expert guests the same question in our “Alternative Hypothesis” segment: “What’s one thing that people think is true about data science or about being a data scientist that you have found to be incorrect?” (Be sure to check out the first roundup of debunked myths, too.)
Amazingly, we always get a fresh response. It seems there are enough myths about data science out there that there’s always something new for our guests to highlight.
Check out these responses from some of our recent episodes. It’s fascinating to see what these experts highlight from their experiences in data science.
One thing that people often think is true is that if you’re a data scientist, you’re just doing machine learning all the time, and that’s it. And I’ll say that I’ve done a ton of exploratory data analysis, and I also did quality analysis on models built by outside companies.
The machine learning part, when you understand it, is actually very easy and very fast. The hard part is understanding and learning the data, as my boss says … sometimes called cleaning the data! But it’s really about understanding it and learning it — and you end up also cleaning it up while you’re in there.
For me, it would be that data science still is not magic. Different companies have different levels of maturity. We’ll talk to a company and they’ll have audio that’s recorded at an extremely low bit rate with a ton of background noise. You can barely understand what somebody’s saying. And there’s an expectation that somehow you would get accurate transcription from that.
There’s certain things you can do, obviously; you can run a bunch of noise reduction. You can try and pop the signal a little bit. But at the end of the day, it’s an algorithm somebody wrote, operationalized, turned into a model, trained — and so while you’re looking for signal in that noise, your noise can’t be garbage.
I would say a good percentage of the time that there’s just a bunch of corrupt stuff, and there’s very little you can do with that. I think dealing with those types of data sources is going to be a real thing — like grainy video feeds. That’s going to be a reality for quite some time to come.
Learn a lot of basics, learn a lot of depth in a specific area — but you can’t, and shouldn’t, expect to develop depth in all areas.
The biggest myth I see is people think you need to learn everything before you can call yourself a data scientist. And there’s no such thing as learning everything. I’m years into this, and my bookmarked list of things to learn has gotten longer and not shorter.
Learning the basics, and understanding how to evaluate models, and understanding the statistics behind what you’re doing are all really important. Understanding the data going into your model and having that domain knowledge are really important.
But you don’t have to know every technique. You could find a subset of skills that you want to become good at. If there’s a certain industry you want to go into, make sure you have that domain knowledge and specialize.
So learn a lot of basics, learn a lot of depth in a specific area, but you can’t, and shouldn’t, expect to develop depth in all areas. It’s impossible. You don’t have to conquer all of that.
I frequently hear a belief that we can’t arm knowledge workers with data science tools and capabilities — that somehow, providing people tools will lead them to create damage and mayhem to their business. I sometimes hear this from IT teams. I sometimes hear this even from data science teams. I found very few examples of any examples where this is actually proven to be true.
The reality is what we’re talking about here is math. This is like thinking about when the calculator first came out. There were probably people who said, “We can’t give people calculators. They could make terrible mistakes because they don’t understand the math that’s going on inside the calculator. We should make them continue to do their math by hand, like accountants using the abacus.”
Most of us now use calculators and have for quite some time. We realized that having a calculator probably has meant fewer mistakes, not more. It’s not that you can’t make a mistake with a calculator. But you’re probably going to make fewer mistakes with the calculator than if you were on an abacus. I would propose that data science is very similar, and too many worry about it.
We need to document the real value that we get from doing all of this quote-unquote “grunt work.”
A lot of people think that data science is just for research, for finding new models. It’s not to say that that doesn’t happen, but there’s a lot more work that goes into this whole process before you even get to the modeling stage or pre-processing of the data. People have a conception that this is grunt work, and maybe they sort of feel discouraged by that, especially people who are entering the field — it’s like 80% or 90% what they call grunt work.
We need to start debunking that a little bit, being a bit more transparent about perhaps what our day looks like — documenting the real value that we get from doing all of this quote-unquote “grunt work.” That’s where I’ve found so many more insights than trying out a whole bunch of models. Just really getting in with the data and mucking around, seeing where there are missing pieces of information, why certain assumptions have been made, how that data was captured.
Data scientists have no sense of humor.
People often think that data scientists have no sense of humor, that they’re all very straight-laced and very nerdy, and that’s not really true. I’ve hung out with some data scientists that are pretty funny and kind of a little out there in their behavior. So the stereotypes are beginning to break down.