Comprehensive Learning Path–data Science in Python
Journey from a python noob to a kaggler on Python
So, you want to become a data scientist or May is you is already one and want to expand your tool repository. You are landed at the right place. The aim of this page was to provide a comprehensive learning path to people new to Python for data analysis. This path provides a comprehensive overview of steps the need to the use of the Python for data analysis. If you already has some background, or don ' t need all of the components, feel free to adapt your own paths and let us know H Ow made changes in the path.
Step 0:warming up
Before starting your journey, the first question to answer are:
Why use Python?
Or
How would Python is useful?
Watch The first minutes of this talk from Jeremy, founder of Datarobot at Pycon, Ukraine to get a idea of what use Ful Python could be.
Step 1: Setting Up your machine
Now it's made up your mind, it's time to set up your machine. The easiest way to proceed are to just download Anaconda (or go to the right URL to download :http://www.continuum.io/downloads) From Continuum.io. It comes packaged with most of the things you'll need ever. The major downside of taking this route is so you'll need to wait for Continuum to update their packages, even when th Ere might be a update available to the underlying libraries. If you is a starter, that should hardly matter.
If you face any challenges in installing, you can find more detailed instructions for various OS here
Step 2: Learn the basics of Python language
You should start by understanding the basics of the language, libraries and data structure. The Python track Fromcodecademy are one of the best places to start your journey. By end of this course, you should is comfortable writing small scripts on Python, but also understand classes and objects.
specifically learn: Lists, tuples, dictionaries, List comprehensions, Dictionary comprehensions
Assignment: Solve the Python tutorial questions on Hackerrank. These should get your brain thinking on Python scripting
Alternate Resources: If Interactive coding is isn't your style of learning, you can also look at the Google Class for Python. It is a 2 day class series and also covers some of the parts discussed later.
Step 3: Learn Regular Expressions in Python
You'll need to use them a IoT for data cleansing, especially if you is working on text data. The best of learn Regular expressions is to go through the Google class and keep this cheat sheet handy.
Assignment: Do the baby names exercise
If you still need more practice, follow this tutorial for text cleaning. It'll challenge you to various steps involved in data wrangling.
Step 4: Learn scientific libraries in Python–numpy, SciPy, Matplotlib and Pandas
This is the WHERE fun begins! Here are a brief introduction to various libraries. Let ' s start practicing some common operations.
- Practice the NumPy tutorial thoroughly, especially NumPy arrays. This would form a good foundation for things to come.
- Next, look at the SciPy tutorials. Go through the introduction and the basics and do the remaining ones basis your needs.
- If you guessed matplotlib tutorials Next, you are wrong! They is too comprehensive for we need here. Instead look at the Ipython notebook till line (i.e. till animations)
- Finally, let us look at Pandas. Pandas provide DataFrame functionality (like R) for Python. This is also the where you should spend good time practicing. Pandas would become the most effective tool for all mid-size data analysis. Start with a short introduction, ten minutes to pandas. Then move over to a more detailed tutorial on pandas.
You can also look at exploratory data analysis with Pandas and data munging with Pandas
Additional Resources:
- If you need a book on Pandas and NumPy, "Python-Data Analysis by Wes McKinney"
- There is a lot of tutorials as part of Pandas documentation. You can has a look at them here
Assignment: Solve this assignment from CS109 course from Harvard.
Step 5: Effective Data Visualization
Go through this lecture form CS109. You can ignore the initial 2 minutes and what's follows after that's awesome! Follow this lecture up with this assignment
Step 6: Learn Scikit-learn and machine learning
Now, we come to the meat of this entire process. Scikit-learn is the most useful library on Python for machine learning. This is a brief overview of the library. Go through lecture to lecture for CS109 course from Harvard. You'll go through an overview of machine learning, supervised learning algorithms like regressions, decision Trees, Ense Mble Modeling and non-supervised learning algorithms like clustering. Follow individual lectures with the assignments from those lectures.
Additional Resources:
- If There is a book, you must read, it's programming collective Intelligence–a Classic, but still one of the best book s on the subject.
- Additionally, you can also follow one of the best courses on machine learning course from Yaser Abu-mostafa. If you need more lucid explanation for the techniques, you can opt for the machine learning course from Andrew Ng and Foll ow the exercises on Python.
- Tutorials on Scikit Learn
Assignment: Try out this challenge on Kaggle
Step 7: Practice, practice and practice
Congratulations, you made it!
You are now having all the need in technical skills. It is a matter of practice and what better place to practice than compete with fellow Data scientists on Kaggle. Go, dive into one of the "live competitions currently running Onkaggle and give all-you has learnt a try!
Step 8: Deep Learning
Now so you had learnt most of the machine learning techniques, it was time to give deep learning a shot. There is a good chance that's already know what's deep learning, and if you still need a brief intro, here it's.
I am myself new to deep learning, so please take the these suggestions with a pinch of salt. The most comprehensive resource is deeplearning.net. You'll find everything here–lectures, datasets, challenges, tutorials. You can also try the course from Geoff Hinton a try in a bid to understand the basics of neural Networks.
P.S need to use Big Data libraries, give Pydoop and Pymongo a try. They is isn't included here as the Big Data learning path is a entire topic in itself.
Learning Path on R–step by Step Guide to learn Data science on R
One of the common problems people face in learning R is lack of a structured path. They don ' t know, from where to start, what to proceed, which track to choose? Though, there is a overload of good free resources available on the Internet, this could being overwhelming as well as Confu Sing at the same time.
After digging through endless resources & archives, here are a comprehensive learning Path on R to help you learn R fr Om ' The Scratch '. This would help you learn R quickly and efficiently. Time to has fun while lea-r-ning!
Step 0: Warming up
Before starting your journey, the first question to answer is:why use R? or How would R is useful?
Watch this-seconds video from Revolution Analytics to get a idea of how useful R could is. Incidentally Revolution Analytics just got acquired by Microsoft.
Step 1: Setting up your machine
Now it's made up your mind, it's time to set up your machine. The easiest-proceed is-just download the basic version of R and detailed installation instructions from CRAN (Co Mprehensive R Archive Network).
You can then install various other packages. There is 9000 packages in R so this can get confusing. Accordingly, we'll guide to install just the basic R packages first. Here's a link to understand packages called CRAN views. You can accordingly select the sub type of packages that is interested in.
How to the install a package http://www.r-bloggers.com/installing-r-packages/
Some Important packages to learn about:http://blog.yhathq.com/posts/10-r-packages-i-wish-i-knew-about-earlier.html
You should the install these three GUIs with all dependent packages.
- Rattle for Data Mining [Link] or Install.packages ("Rattle", Dep=c ("suggests"))
- R Commander for Basic Statistics [Link] or Install.packages ("RCMDR")
- Deducer (with JGR) for Data visualization [Link]
You should also install RStudio. It helps making R coding much easier and faster as it allows you to type multiple lines of code, handle plots, install and Maintain packages and navigate your programming environment much more productively.
Assignment:
- Install R, and RStudio
- Install Packages Rcmdr, Rattle, and Deducer. Install all suggested packages or dependencies including GUI.
- Load These packages using library command and open these GUIs one by one.
Step 2: Learn the basics of R language
You should start by understanding the basics of the language, libraries and data structure. The R track Fromdatacamp are one of the best places to start your journey. Especially see the free Introduction to R course athttps://www.datacamp.com/courses/introduction-to-r. By end of this course, you should is comfortable writing small scripts on R, but also understand data analysis. Alternately, can also see Code School for R athttp://tryr.codeschool.com/
If you want to learn R offline on your own time–you can use the interactive package swirl fromhttp://swirlstats.com
specifically learn: read.table, data frames, table, summary, describe, loading and installing packages, data visu Alization using Plot command
Assignment:
- Http://r-bloggers.com for the daily newsletter concerning R project.
- Create a github account at http://github.com
- Learn to troubleshoot package installation above by Googling for help.
- Install Package Swirl and learn R programming (see above)
- Learn from http://datacamp.com
Alternate Resources: If Interactive coding isn't your style of learning, you can also look at the same Minute tutorials on R at Http://www.twot orials.com/. It's a video series and also covers some of the parts discussed here. You can also read a comprehensive blog post titled The functions to help you clear a job interview in R here.
Step 3:learn Data manipulation
You'll need to use them a IoT for data cleansing, especially if you is working on text data. The best-of-the-go through the text manipulation and numerical manipulation exercises. Can learn about connecting to databases through the RODBC package and writing SQL queries to data frames through SQLDF Package.
Assignment:
- Read about split, apply, combine approach for data analysis from Journal of statisical software.
- Try learning about tidy data approach for data analysis.
- For connecting to a rdbms-a MySQL database through R
- You really should do a data quality exercise.
- Bored with analyzing numbers alone. Try Sports analysis with a cricket an analysis using R.
If you are still need more practice, you can sign up for a $25/month subscription at Datacamp, this gives all tutorials. Please go to through the slides here for Plyr.
Step 4: Learn specific packages in r–data.table and Dplyr
This is the WHERE fun begins! Here are a brief introduction to various libraries. Let ' s start practicing some common operations.
- Practice the Data.table tutorial thoroughly here. Print and study the cheat sheet for data.table
- Next, you can has a look at the Dplyr tutorial here.
- For text mining, start with creating a word cloud in R and then learn learn through this series of Tutorial:part 1 and Pa RT 2.
- For social network analysis read through these pages.
- Do sentiment analysis using the Twitter Data–check out this and the This analysis.
- For optimization through R read here and here
Additional Resources:
- If you need a book on R for Business Analytics, the "R for Business Analytics by Ajay Ohri.
- If you need a book on learning R quickly, see http://statmethods.net
Step 5: Effective Data visualization through Ggplot2
- Read about Edward Tufte, and his principles, "on" and "make" data visualizations here. Especially read on Data-ink, lie factor and data density.
- Read about the common pitfalls in dashboard design by Stephen Few.
- For learning Grammar of graphics and a practical-do it in R. Go through this link from Dr. Hadley Wickham creator of Ggplot2 and one of the most brilliant R package creators in the WOR LD today. You can download the data and slides as well.
- is interested in visualzing data on spatial analsysis. Go through the amazing Ggmap package.
- Interested in making animations Thorugh R. Look through these examples. Animate package would help Youhere.
- Slidify Supercharge your graphics with HTML5.
Step 6: Learn Data Mining and machine learning
Now, we come to the most valuable skill for a data scientist which are data mining and machine learning. You can see a very comprehensive set of the resources on the data mining in R for http://www.rdatamining.com/. The rattle package really helps the User Interface (GUI) with a easy-to-use graphical. You can see a free open source easy to understand book here at http://togaware.com/datamining/survivor/index.html
You'll go through an overview of algorithms like regressions, decision trees, ensemble modeling and clustering. You can also see the various machine learning options available on R by seeing the relevant CRAN view here.
Additional Resources:
- If There is a book on the data mining using R you want, it's on Rattle
- You can learn the on time series forecasting from the Booklet–a Little book for time series in R.
- Some machine learning in R are here. You can enroll in a free course here.
Step 7: Practice, practice and practice
Congratulations, you made it!
You are now having all the need in technical skills.
- It is a matter of practice and what better place to practice than compete with fellow Data scientists on Kaggle. This practice contest'll help you start at https://www.kaggle.com/c/titanic-gettingStarted
- Read about a more advanced Kaggle analysis here http://0xdata.com/blog/2014/09/r-h2o-domino/
- Stay in touch with what your fellow R coders is doing by subscribing to http://www.r-bloggers.com/
- Interact with them on Twitter using the #rstats hashtag.
- Stuck somewhere? This website are great for learning R quickly as it gives you just the right amount of information.
Step 8: Advanced Topics
Now so you had learnt most of the data analytics using R, it was time to give some advanced topics a shot. There is a good chance so already know many of these, but has a look at these tutorials too.
- The for using R and Hadoop see this tutorial on using Rhadoop.
- A Tutorial on using R with MongoDB.
- Another nice tutorial on Big Data analysis using R in the NoSQL era.
- You can make interactive Web applications using the Shiny from RStudio.
- Interested in learning R and Python syntax relate. Read through this guide.
P.S. in the case you need the Big Data a lot also has a look at the RevoScaleR package from Revolution Ana Lytics. It is commercial and academic usage is free. An example project are given here.
Notes business Analyst using saslearning Data Science on R–step by step Guidedata Science in Python–from a Python noob to A kagglerdata visualization with Qlikview–from starter to a luminarymachine learning with Weka
R8:learning paths for Data science[continuous updating ...]