How to get started with Data Science using R

R being the lingua franca of data science and is one of the popular language choices to learn data science. Once the choice is made, often beginners find themselves lost in finding out the learning path and end up with a signboard as below.

signpost

Photo by George Chadwik (some rights reserved)

 In this blog post I would like to lay out a clear structural approach to learning R for data science. This will help you to quickly get started in your data science journey with R.

The Wrong Turn

Ok, I am going to learn and master R by learning all the packages. Then get to the data science theory and start doing my projects.

Never head this way! Learning R is similar to learning APIs – Focus on incremental learning instead of mastery.

First Things First

  • Download & Installation :  Download a suitable binary distribution of R for your operating system.
  • Get Rstudio: RStudio is a leading IDE for R development. This will help you to code more productively with all the plots, package management and the editor in one place.

Become a LearneR

Take your little steps by understanding the syntax, data structures and libraries in R

In the post 5 Steps to Get Started With Data Science have provided a list of resources to learn R. These resources would be a good starting point and help you in the incremental learning. R has a strong user community with ever growing list of packages and support. Once you are comfortable with the basics, start exploring the packages for different data science tasks. Learn how to import data sets in R using packages like readr , data.table.

Understand the R community and seek help actively from Stack Overflow. Sign up and follow R-bloggers for new snippets to try out.

Data Science with R

Now that you are familiar with R, the next step is using R to solve Data Science problems. Below is a list of common data science tasks and how you could use R to achieve them.

Data Loading

Getting the data into R is the first step of the data science process. R has a wide range of options to get data of all formats into R. Below is a common list of packages best suited for data loading.

  • readr
  • data.table
  • XLConnect
  • rjson
  • XML
  • foreign

Data Analysis & Visualization

After getting the data into the R environment the next step in the data science workflow is to do simple exploratory analysis. Below are a list of wonderful R packages that helps to simplify data analysis and preparation.

  • dplyr  Learn dplyr which helps you do simple and elegant data manipulation
  • data.table  – Handles big data with ease. Great package for faster data manipulation/analysis
  • ggplot2/ggvis – Awesome packages for data visualization

Data Preparation

Data preparation is an important step in the data science workflow. Clean data is really hard to find, often data needs to be transformed and molded into a form on which we can run models.

Modelling & Evaluation

Now the data is ready to hit the machine learning workbench. Below is a set of resources and packages which could help you through the model building process.

Communicate Results

Now that you have some insights from the data, it is lost without effective communication. R Markdown is great tool for reporting your insights and share with fellow data scientists.

Start Small .. Build Big

Understanding algorithms, building your first recipes for common data science tasks is the small step. This is where most of the tutorials, courses and blogs stop. You could achieve this small step in a weekend and focus on the next big step by building your repository of small projects. By this you build up your skill for R and data science

snowball

Picture Credit – Kamyar Adl

Delibrate practice on more data sets and different kind of challenges would take you to the next step – Mastery. Go for it!!

 

 

 

 

 

 

 

 

How to smartly choose your Data Science toolkit – R vs Python

Often I get questions from readers who are constantly caught in the tool conundrum

Should I choose R or Python to start learning data science?

tools-image

If you are newly entering the world of data science and not have tried either of these languages it is easy to land into this question. In this post we shall carefully examine both with the needs of data science in mind.

R

R is built by data Scientists for data scientist. So doing data analysis, building models, communicating results are the core strengths

The major power of R is it’s user community which offers extensive support and has developed the package base CRAN.

A few great packages for you to start exploring in R would be

  • ggplot2/ggvis – Data Visualization
  • dplyr (Data Munging and Wrangling)
  • data.table (Data  Wrangling)
  • Caret: (Machine learning workbench)
  • reshape2: (Data Shaping)

R has a steep learning curve and is generally built for stand alone systems. Although there are several packages to speed up the process.

If you are a beginner, I would strongly recommend downloading RStudio which is the de facto IDE for R

Python

Python is great programming language and is very easy to start with. You can easily perform most of the data science task like data wrangling, munging, visualization and of course it has a great machine learning library – scikit learn. If you are already familiar with Java/C++, it is straightforward to get started with Python.

According to the data science survey conducted by O’Reilly almost 40% of the data scientists use Python to solve their problems. Python also has a great community of open source packages.

Below are the list of packages which are great for data science applications

  • Seaborn – Data Visualization
  • Pandas – Data Munging and Wrangling
  • Numpy/Scipy – Data Wrangling/ Representation
  • Scikit-learn – Machine Learning library

Clash of the Titans: Python vs R

injustice-1

Photo Credit

It is indeed clash of the titans of the data science world. Here are a few guidelines which you could use to choose the language.

Popularity:

Python is one of the top programming languages. Let us get down to the numbers in the data science community. Got the data from here . There is an increase in The below graph popularity of Python is increasing in the data science community. (The plot done in R 😛 )

DS StackExchange_R_vs_Python

Personal Choice

Coming from an engineering background I chose python as it was more natural to me. Later explored in to R to understand its strengths and support. The best way is to start one and learn the other to work on its strengths.

Learning Curve

R has a steep learning curve as compared to python. But deliberate practice could help you climb the ladder faster.  In order to learn R I chose to use R for my projects deliberately, there by gaining knowledge and experience using it.

Type of Problem

Often the type of problem your solving has a bearing on the choice of language. If the nature of the problem at hand is to do thorough data analysis then I choose R, but If I need to write quick scripts to get things done, scrape the web then it is simpler to use Python.

Communication

Often overlooked but an important data science activity is the ability to communicate results and  exchange ideas. IPython notebooks are a beauty in itself providing the best interface to communicate, shortly followed by R Markdowns.

Verdict

As a data scientist it is always best to open to learn more tools. Preferring one over the other may be good to start with, but it is always know and use the tools to their best strengths.

5 Steps to Get Started With Data Science

As a beginner it is easier to get lost in the details and shear overwhelming nature of learning machine learning.

To_study

Cross the beginner’s block

I get a lot of mails from readers asking.

  • How do I get started with Machine Learning? 
  • I do not have a background in math, how can I learn data science?                        

More often the materials on blog posts and courses are often targeted at intermediates. But remember it is easier to get started without the math. You would still need the math, but it can come later. Below is a step by step guide to get started, but remember..

Screen Shot 2016-08-22 at 9.38.33 AM

Be Curious

When I first started with machine learning, I started reading anything that had the title data science/machine learning. I often did not understand most of it, but slowly I started to grow chunks of knowledge which I later assembled. The important skill here is to be curious and believe.

Learn a tool

Never get overwhelmed with a choice of tool. Just pick one!. Often beginners are divided between R and Python. Here are a list of resources to get started with the tools.

Python

R

Get your hands dirty

The best place for a good data source would be the  UCI Machine Learning Repository.  The repository is an inventory of many small real world examples. Start with the simple Iris Data Set. 

Learn to explore the data and try the following with the tool of choice. Preparing data for data science problems is an art of its own right. Below are the list of techniques you should try your hands at.

  1. Wrangle

    Start by dicing the data into subsets. Understand the variables and their types. Take a look at the variables that might impact the machine learning problem at hand.

  2. Transform

    Try simple data transformations like aggregation, decomposition (splitting the variables) , log transforms.

  3. Visualize

    A key part of solving data problems is to understand the data at hand. Visualization is a wonderful way to understand the data and the hidden gold in them.

  4. Question:

    Majority of the data science problems is to look for answers. Practice asking questions and look for answers in the data.

Applied Data Science Process

Understand the process behind solutions to data science problems.The most common approach to solving data science problems is as follows.

  1. Define the problem: Understand the problem that is being solved
  2. Analyze data: Analyze the data to for patterns and information that could be used to develop a model.
  3. Data preparation:  Prepare the data for modelling.
  4. Model: Start applying machine learning algorithms and validate.
  5. Evaluate:  Evaluate the performance of the model and choose the best performing model.
  6. Deploy: Implement the model in production.

Practice, Practice, Practice

Once you start learn the tools, get your hands at the data ,  practice the applied data science process, it is important to rinse and repeat this process on different datasets across different domains.

Diving Deep

As you start learning the tricks of the trade, it is important to get deep down to the details. The next step is to dive deeper into the algorithms and to understand why they work and how they work. Understand when one is better than the other, under what circumstances they perform better.

Summary

In this post you will learn a step by step approach to learn data science, understand simple approaches to learn and get better at doing applied data science.

 

 

Deep Learning – Simplified

The buzz word is around for a few years in the analytic world, with companies investing heavily to fund the research. From understanding human perception to building self-driven cars deep learning comes with a package of great promises. I was thinking to myself as how I could put these concepts in simple terms which led this blog post. So lets get started.

What is deep learning?

 The foundations of deep learning has it’s inspiration from the ability of the human beings to perceive things as they appear to him. The human perception is  a miracle of nature. Well, lets try answering this – Which of the following pictures has a bi-cycle in it?

dl_first_pic

Well, I quite liked the banana!. But I am sure you would still go for  pictures 2 and 3. It would have taken a second for you to look at the images and make the classification unconsciously. But underlying this decision, we have involved our  entire series of visual cortices (V1 – V5) in the brain each of which contains around 140 million neurons with billions of connections between them forming a network. Yes, we are a living super computer walking on this planet, but fail to recognize the complexity of the problem involved.  Computer scientist who undertook research on computer vision, natural language processing were marveled by this act of the human brain and to help computers mimic the same. This brought in the idea of Artificial Neural Networks (ANN) in the 1980’s

Why not write an algorithm?

 Earlier to deep learning computer scientist wrote machine learning algorithms that take in features from images or text. The major draw back of such approaches in machine learning is that programmers have to constantly tell computers which features they have to use from the raw data. This puts the burden on the programmer to perform feature engineering and no wonder we see that the algorithms are not performing well. It also becomes increasingly difficult for traditional machine algorithms to work in case of learning complex patterns.

Unlike the traditional machine learning algorithms, Deep learning (based on ANN) has gone past this barrier as it uses training examples for the system to learn the features on its own.

What!! But how?

To get started on neural networks work it is good to understand how a artificial neuron called a perceptron functions.  A Perceptron (schematic representation below) takes in inputs (Xi) and has specific weights (Wi) which assign importance to each of these inputs. The neuron computes a function based on the weighted inputs.

ANN_Percentron

For the sake of simplicity, let us assume that the computed function is linear in nature.  If the output of the computed function exceeds a specific threshold or bias the neurons outputs a 1 or 0. Lets look at the example below.

Let us assume we are charted on a weekly exercise plan to burn 1400 calories (threshold) per week. Let us assume the inputs (walking, running, swimming) x1,x2,x3 each burn  calories 100,150,200 respectively. It is for us to decide to plan the weekly schedule of activities to maintain a healthy living. The number of days (weights) we choose to exercise is up to us. Let us model this problem with a linear perceptron. As you could note by choosing the weights (number of days we do an activity) and threshold (calories to burn this week) we could decide the output (healthy week)

ann_example

While the idea of a perceptron is only an inspiration from the human brain, we are still far from understanding and replicating on how it works. A neural network is formed by hooking these neurons as shown below to each other and help making complex decisions.  The structure of a neural network could be divided in to 3 layers as shown below. The left most layer are called input layer which contains the input neurons, the right most layer contains is called the output layer containing the output neurons. Unlike the figure below each of these layers can have multiple nodes. The inner layer of neurons are are called the hidden layer. The first vanilla version of artificial neural networks are called Multi Layer Perceptrons (MLP).

neural_network

The Multi Layer Perceptrons (MLP) is a type of supervised network as it requires an label or desired output to learn. A Multilayer Perceptron (MLP) is a type of neural network referred to as a supervised network because it requires a desired output in order to learn.  As reach in machine layer matured each of these nodes were replaced by more sophisticated algorithms than the perceptrons. The neural network have to initially trained to learn the abstract model. In a neural network the output of one layer are constantly fed forward as inputs to the next layer. This approach is called feed-forward propagation.

 

Each of these nodes learn a function and feed their output forward to the next layer in the network. Not every time a node picks a function to be learned correctly. The difference between the expected output and the current output of a node are called errors (shown below).  These errors prevent the neural networks from learning effectively. Research and usage of neural network seemed to have met a trough until the advent of back propagation algorithm.

backprop

The back-propagation algorithm acts as an error correcting mechanism at each neuron level, there by helping the network to learn effectively. The derivation of how backward propagation helps to solve the problem is beyond the scope of this post.

 Deep Neural Network

A deep neural network (DNN) is an artificial neural network (ANN) with multiple hidden layers of units between the input and output layers. Each of the node learn an effective function and transfer the knowledge forward.  Let us consider the working of one of the deep network architectures used for image classification. Pixels from the input image are converted to feature maps (smaller or abstract representation). These abstract representation are then used  to form simpler forms by the next layer and finally in a more generic representation.

feature_map_layers

Let us take a look at an example of image classification of human faces as shown below. The input layers learn to recognize lines or edges of the face (initial feature map). This information is fed forward to layer of nodes in the hidden layer which then again learns a more abstract concept like the eyes, nose. The final layer then assimilates this learning to form the object model of the face.

learning_by_nn

Research[4] in the topic of deep learning is constantly growing and there are constant updates to architectures and algorithms used in deep neural network. Deep learning has been extensively applied in the fields of image recognition, Natural Language processing, Audio recognition and information retrieval.Details of all the architectures and algorithms are beyond the scope of this post and would be handled in a different post.

Hype or Real?

The concept of neural networks to learn have actually existed for decades, but there were major problems building larger networks. Deep learning is partly hyped or could be a re-branded version of neural network. But the recent hype about deep learning and neural network is because of the ability to perform better than the kernel algorithms on standard data sets. Thanks to high performance computation units like GPU and better parallel algorithms neural network have grown bigger and deeper.

References

  1. Neural networks and deep learning
  2. Deep learning – Wikipedia, the free encyclopedia
  3. Deng, Li et el;Tutorial Survey of Architectures, algorithms and applications of deep learning
  4. Deep Learning, Self-Taught Learning and Unsupervised Feature Learning

Amazing list of Data Science Courses

Data science and Analytics are becoming more popular with companies,colleges and people. Many organizations, universities, come are starting to offer data science courses to help people learn data science. Have put together a list of data science courses in no particular order that would be of interest to you. While this is not an exhaustive list and would keep growing , please feel free to comment if your favorite courses are missing from the list. Happy learning!!

  1. Machine Learning specialization
    • Machine Learning Foundations: A Case Study Approach
    • Machine Learning: Regression
    • Machine Learning: Classification
    • Machine Learning: Clustering & Retrieval
    • Machine Learning: Recommender Systems & Dimensionality Reduction
    • Machine Learning Capstone: An Intelligent Application with Deep Learning
  2. Machine Learning for Data Analysis
  3. Practical Data Science
  4. Data Science A-Z: Real-Life Data Science Exercises Include
  5. Machine Learning on Coursera
  6. Analytics Edge
  7. Harvard Data Science Course
  8. Mining Massive Datasets
  9. Making Sense with Data
  10. Data Science Specialization: Introduced by coursera and John Hopkins university is a comprehensive and yet a gentle introduction to the world of data science. The course offers the below topics.
  11. Genomic Data Science specialization: The course covers the concepts and tools to understand, analyze, and interpret data from next generation sequencing experiments.
    • Introduction to Genomic Technologies
    • Genomic Data Science with Galaxy
    • Python for Genomic Data Science
    • Algorithms for DNA Sequencing
    • Command Line Tools for Genomic Data Science
    • Bioconductor for Genomic Data Science
    • Statistics for Genomic Data Science
  12. Intro to Data Science
  13. Introduction to Computational Thinking Data Mitix
  14. Data Analysis and Interpretation Specialization
  15. Executive Data Science Specialization
  16. Applied Data Science with Python
  17. Applied Data Science with R
  18. Data Analysis in Python with Pandas
  19. Introduction to Python for Data Science
  20. Big Data applications and Analytics
  21. Statistics and Data Science in R from Beginner to Advanced
  22. Apache Hadoop – Machine Learning and Hadoop Eco System
  23. Data Analysis and Statistical Inference
  24. Driving Business Results with Big Data
  25. Data Mining specialization
    • Pattern Discovery in Data Mining
    • Text Retrieval and Search Engines
    • Cluster Analysis in Data Mining
    • Text Mining and Analytics
    • Data Visualization
    • Data Mining Capstone
  26. Intro to Hadoop and Mapreduce

Crack Your Next Data Science Interview

Preparing for a data science interview might seem like a huge mountain to climb with a huge variety of topics piled in front . But it isn’t hard as it seems to be.

Mountain

The time is now!!

Having a wide range of topics to cover, calls for a need to set aside time and prepare meticulously for topics . Interviews can range from explaining logistic regression to a 5 year old to tuning the parameters of a model. Set aside a time every day to prepare and religiously sit down to prepare for on the topic interview.  With consistent effort it is easier to be there on top of the mountain. From experience below are the topics we should be covering to ace your next data science interview

With a wide variety of topic it is entirely possible to get sucked into one of these holes. This makes it necessary to fix SMART goals and prepare towards these goals.

get_sucked

Below are the steps which I personally followed to prepare for my interviews.

  1. Review your background and prepare a list of topics you may want to cover. As data scientist come from different backgrounds such as political sciences, statistics, software engineering. It is important to understand your weak links and to prepare towards strengthening it.
  2. Write down your goals and prepare a schedule to work on the small weak links. By writing your goals you create a subconscious wiring to work towards these goals.
  3. Make a commitment by setting a time aside every day for you to religiously study the topics on your weak links list.
  4. Attend Interviews: Attending interviews is another way to get feedback to understand your week links and iterate over them.
  5. Review your goals: Set weekly review meetings with your self to review your current preparation

While these steps are important below are the topics which are essential for a data scientist to know.

Basic Mathematics

To become a good data scientist one must have the ability to deliver insights from the data. You would be able to deliver insights with descent  understanding of mathematical concepts. Go through a refreshers of linear algebra, probability and statistics theory.

Asking the right questions

This is more learned by practiced than taught. Many employers look in for the curiosity and the ability of the candidate to ask questions that can extract insights from the data. Take up a totally unknown data set and practice asking questions and look for answers for your question. With this approach you would improve your questions and strengthen your abilities to find the answers.

Applied machine learning

It is important to understand the basic algorithms in machine learning. Interviewers focus on how the candidate formulates the problem and his ability to transform business into an analytical problem. If you are new to machine learning, a good place to start understanding these concepts would be to enroll in a course or learn from the web. Do check the data science specialization at Coursera and nano degree’s at Udacity. These are a great place to start.

Learn white board coding

This is similar to a software engineer position where the interviewers test the candidate’s ability to define, analyze, solve and test the problem at hand. It is important to brush up concepts of algorithms and data structure. This has been a part of many product oriented data science interviews where the data scientist are expected to be good programmers. There are tons of websites and books to get you started here.

Get the right tools

Thou there many a wide range of tools to express analytics, the top choice of many data scientists have been python and R. Both the languages have great machine learning libraries. These tools would be good to know and have in your toolbox.

 Be a data hacker

Learn data wrangling and mugging techniques in the language of choice. This helps to get up to speed with any given data set.

 Understand databases

Relational databases are a part of every industry and it is important to learn the basics of databases and how to write efficient queries.

 Learn Data Visualization

The best way to start understanding the data is to visualizing. Choose and learn visualization techniques in a tool of choice. Thou it would not be asked during an interview but it is a must required skillset for a good data scientist.

Practice

Practicing the theoretical concepts you learn with help you develop a better understanding of the concepts and also understand your weakness quickly.

Research about the role

Along with preparing for the interview, it is essential to align your skills to the type of data science role you are looking for.  Think about what kind of data scientist you would want to be and which type of teams you would like to be a part of. Ask appropriate questions to understand the requirements of the role and tailor your needs. Look up the profiles of the people who would be interviewing  to understand their background and performing similar roles at the company. This would help you to be understand the type of questions you could expect during the interview. It is important to identify the type of role the employer is looking to fill in, and focus your preparation towards that direction. Take time to understand the job description and also the background of people who would be interviewing you. Remember to work on your weakness on the chosen type of roles. Below are the simplified types of data scientist employers commonly look for.

Business Savvy Data Scientist

The business savvy data scientist focusses on building analytic solutions to help business users and final decision makers.  They help to understand the underlying problems of a company’s marketing campaign, to understand churn or what interest the customers. Communication and story telling plays a major role for these type of roles as it involves communicating the value to non-technical people. They do not have to build complex models, but must unearth the value from the data to answer the questions of why and how.

Product Savvy Data Scientist

The other type of data scientist focuses on building products to help businesses. They build high complex models using sophisticated statistical and machine learning algorithms. They are very focused on improving the performance of the models where it has direct impact on the company’s product. They require to posses good statistical and solid computer science skills.

Hope the above steps helps you to crack your next data science interview. Don’t wait to make your next leap.

5-handling-success-e1369159215250

 

Resources to get Started