In this post, we will attempt to define data science by looking at something that's really well known in the field, the data science Venn diagram. Now if you want to, you can think of this in terms of what are the ingredients of data science. Well, we're gonna first say thanks to Drew Conway, the guy who came up with this, and if you want to see the original article, you can go to this address, but what Drew said is that data science is made of three things and we can put them as overlapping circles because is the intersection that's important.
The Data Science Venn Diagram |
Here on the top left is coding or computer programming or as he calls it, hacking. On the top right is statistics or stats and mathematics or quantitative abilities in general . On the bottom is domain expertise or intimate familiarity with a particular field of practice, business or health or education or something like that.The intersection in the middle is data science. So it is the combination of coding, Statistics, math and domain knowledge.
Now let's say little more about coding. The reason coding is important because it helps you gather and prepare the data. A lot of the data comes from novel sources and is not necessarily ready for you to gather and it can be in very unusual formats. Coding is important because it can require some real creativity to get the data from the sources to put it into your analysis.
Now there are a few kinds of coding that are important. For instance, there's statistical coding. A couple of major languages in this are R and Python - two open-source free programming languages that are specifically for data. Python is a general-purpose programming language, but well adapted to data.
The ability to work with databases is important too. The most common language there is SQL usually pronounced as the sequel, which stands for structured query language because that's where the data is.
data science components to learn |
Also, there's the command line interface or if you're on a Mac, people just call it the terminal . The most common language there is Bash which actually stands for a Bourne Shell. Searching is important for data scientists and Regex or regular expressions are critical. While there's not a huge amount to learn there, it's a small little field. It's sort of like super-powered wild card searching that makes it possible for you to both find the data and reformat it in ways that are going to be helpful for your analysis.
Now I'll say a few things about the math. You're going to need things like a little bit of probability, Algebra, Regression - a very common statistical procedure. Those things are important and the reason you need the math is because that's going to help you choose the appropriate procedures to answer the question with the data that you have and probably even more importantly, it's going to help you diagnose problems with things.