How to get started with Data Science using R

R being the lingua franca of data science and is one of the popular language choices to learn data science. Once the choice is made, often beginners find themselves lost in finding out the learning path and end up with a signboard as below.

signpost

Photo by George Chadwik (some rights reserved)

 In this blog post I would like to lay out a clear structural approach to learning R for data science. This will help you to quickly get started in your data science journey with R.

The Wrong Turn

Ok, I am going to learn and master R by learning all the packages. Then get to the data science theory and start doing my projects.

Never head this way! Learning R is similar to learning APIs – Focus on incremental learning instead of mastery.

First Things First

  • Download & Installation :  Download a suitable binary distribution of R for your operating system.
  • Get Rstudio: RStudio is a leading IDE for R development. This will help you to code more productively with all the plots, package management and the editor in one place.

Become a LearneR

Take your little steps by understanding the syntax, data structures and libraries in R

In the post 5 Steps to Get Started With Data Science have provided a list of resources to learn R. These resources would be a good starting point and help you in the incremental learning. R has a strong user community with ever growing list of packages and support. Once you are comfortable with the basics, start exploring the packages for different data science tasks. Learn how to import data sets in R using packages like readr , data.table.

Understand the R community and seek help actively from Stack Overflow. Sign up and follow R-bloggers for new snippets to try out.

Data Science with R

Now that you are familiar with R, the next step is using R to solve Data Science problems. Below is a list of common data science tasks and how you could use R to achieve them.

Data Loading

Getting the data into R is the first step of the data science process. R has a wide range of options to get data of all formats into R. Below is a common list of packages best suited for data loading.

  • readr
  • data.table
  • XLConnect
  • rjson
  • XML
  • foreign

Data Analysis & Visualization

After getting the data into the R environment the next step in the data science workflow is to do simple exploratory analysis. Below are a list of wonderful R packages that helps to simplify data analysis and preparation.

  • dplyr  Learn dplyr which helps you do simple and elegant data manipulation
  • data.table  – Handles big data with ease. Great package for faster data manipulation/analysis
  • ggplot2/ggvis – Awesome packages for data visualization

Data Preparation

Data preparation is an important step in the data science workflow. Clean data is really hard to find, often data needs to be transformed and molded into a form on which we can run models.

Modelling & Evaluation

Now the data is ready to hit the machine learning workbench. Below is a set of resources and packages which could help you through the model building process.

Communicate Results

Now that you have some insights from the data, it is lost without effective communication. R Markdown is great tool for reporting your insights and share with fellow data scientists.

Start Small .. Build Big

Understanding algorithms, building your first recipes for common data science tasks is the small step. This is where most of the tutorials, courses and blogs stop. You could achieve this small step in a weekend and focus on the next big step by building your repository of small projects. By this you build up your skill for R and data science

snowball

Picture Credit – Kamyar Adl

Delibrate practice on more data sets and different kind of challenges would take you to the next step – Mastery. Go for it!!

 

 

 

 

 

 

 

 

How to smartly choose your Data Science toolkit – R vs Python

Often I get questions from readers who are constantly caught in the tool conundrum

Should I choose R or Python to start learning data science?

tools-image

If you are newly entering the world of data science and not have tried either of these languages it is easy to land into this question. In this post we shall carefully examine both with the needs of data science in mind.

R

R is built by data Scientists for data scientist. So doing data analysis, building models, communicating results are the core strengths

The major power of R is it’s user community which offers extensive support and has developed the package base CRAN.

A few great packages for you to start exploring in R would be

  • ggplot2/ggvis – Data Visualization
  • dplyr (Data Munging and Wrangling)
  • data.table (Data  Wrangling)
  • Caret: (Machine learning workbench)
  • reshape2: (Data Shaping)

R has a steep learning curve and is generally built for stand alone systems. Although there are several packages to speed up the process.

If you are a beginner, I would strongly recommend downloading RStudio which is the de facto IDE for R

Python

Python is great programming language and is very easy to start with. You can easily perform most of the data science task like data wrangling, munging, visualization and of course it has a great machine learning library – scikit learn. If you are already familiar with Java/C++, it is straightforward to get started with Python.

According to the data science survey conducted by O’Reilly almost 40% of the data scientists use Python to solve their problems. Python also has a great community of open source packages.

Below are the list of packages which are great for data science applications

  • Seaborn – Data Visualization
  • Pandas – Data Munging and Wrangling
  • Numpy/Scipy – Data Wrangling/ Representation
  • Scikit-learn – Machine Learning library

Clash of the Titans: Python vs R

injustice-1

Photo Credit

It is indeed clash of the titans of the data science world. Here are a few guidelines which you could use to choose the language.

Popularity:

Python is one of the top programming languages. Let us get down to the numbers in the data science community. Got the data from here . There is an increase in The below graph popularity of Python is increasing in the data science community. (The plot done in R 😛 )

DS StackExchange_R_vs_Python

Personal Choice

Coming from an engineering background I chose python as it was more natural to me. Later explored in to R to understand its strengths and support. The best way is to start one and learn the other to work on its strengths.

Learning Curve

R has a steep learning curve as compared to python. But deliberate practice could help you climb the ladder faster.  In order to learn R I chose to use R for my projects deliberately, there by gaining knowledge and experience using it.

Type of Problem

Often the type of problem your solving has a bearing on the choice of language. If the nature of the problem at hand is to do thorough data analysis then I choose R, but If I need to write quick scripts to get things done, scrape the web then it is simpler to use Python.

Communication

Often overlooked but an important data science activity is the ability to communicate results and  exchange ideas. IPython notebooks are a beauty in itself providing the best interface to communicate, shortly followed by R Markdowns.

Verdict

As a data scientist it is always best to open to learn more tools. Preferring one over the other may be good to start with, but it is always know and use the tools to their best strengths.