#转自wx公众号: Python Developer
#问题/answer Source: Quora
English: Roman Trusov
Bole Online column Author-Xiaoxiaoli
Links: http://python.jobbole.com/85704/
"Bole Online Guide": A netizen in Quora asked questions, and added that "I have 10 days of free time, every day want to spend 10 hours to learn the knowledge of data science, should learn something?" Thank you "Bole online excerpts of Roman Trusov reply, very worthy of novice reference.
Man, I envy you so much, not everyone has a chance like you.
10 days and 100 hours of study time should be allocated to the widest possible variety of knowledge. This is also a large investment, so should be taken seriously, according to the results can be obtained internship offer and other as a goal.
To tell the truth, I do not think that learning online courses for your study of what use, class also can give you some sense of accomplishment. If you have a good foundation for math and programming, it's much more interesting to write code directly, run data, and see results than online courses. That's 100 hours! If you can do your best to study in the next 10 days and 100 hours, you can learn 1% of the knowledge required by a world-class expert. Time tight task Heavy, you, ready?
Internal implementations of libraries and algorithms and the future is time-learned, and you can't use them now. The volume of tasks that I have listed below requires you to devote yourself wholeheartedly, and the goal is to give you broad access to some of the key tools in this area.
The first day and the next day
Download stackexchange public data: Download link (https://archive.org/details/stackexchange) (ladder required)
Working with data requires a relational database management system, which is about to be done on the first day:
Install MySQL configuration and import the above data into the database.
Read the basics of SQL. Take the time to do a few quizzes to familiarize yourself with data manipulation. For example, write a script that can extract all the questions that meet the following criteria: questions about Python and SQL, more than three respondents, and the best answer, with more than 10 selected answers under these two topics. You may find that there is a problem with scripting performance.
Read SQL index knowledge, learn about hashing and sorting, and more. Modify the script above so that it can run out of the results immediately.
Write a Python class that can handle the above query statement. This requires learning the Python MySQL driver. You need a tool that can help you extract data from the database and present it in a more convenient form.
Although it is not clear how the master base, but I think the above task even for the novice can be completely completed. You just need some basic Python knowledge to be enough.
The next day can be used to understand the use of pandas to read data and to use NumPy to manipulate numeric data. The documents for these libraries look scary, but they don't have to be read. Just learn how to import a CSV file, add columns to extract data, and merge two databases.
Third Day
While it is common practice to do queries across the entire database, it is important to learn how to do it in a small amount of data and get meaningful results. For example, you can try to randomly extract some data from the entire data set, and then compare the distribution of their scores with the distribution of the scores on the entire dataset.
Now it's time to go further. While holding the entire Stack Exchange database, because I am familiar with the data inside StackOverflow, the following only uses StackOverflow data. One interesting exercise I can think of is how popular programming languages are based on time.
Why is this exercise useful?
How do you extract questions that are related only to the programming language itself without extracting technical-only questions? (for example, I want to ask questions about Python grammar rather than asking questions about how to use Mangodb in Django)
The number of tags asked is very large, in order to finally visualize the results require careful selection of input data
Can learn at least one visual frame
Can generate a lot of beautiful pictures
By doing the example above, it is natural to explore more interesting attributes in the data. Learning to ask questions is a very important ability.
Fourth day and fifth day
Data scientists have been rated "sexiest occupations of the 21st century". You know what else is sexy? Graph theory Oh!
How do the problem labels relate to each other? Is it possible to build a technology-related diagram using only the answers on Stack Overflow? Which criteria should be chosen to calculate how similar the two tags are? How does the visualization of the graph work? Have you tried Gephi?
When you're done, you need to add a description to the diagram generated above. Just one picture itself can provide a limited amount of value, and you need to stare at it until you understand the meaning behind it.
Learn the clustering algorithm (at least K-means and DBSCAN) and the K nearest neighbor algorithm. If you are willing to delve into it, you can also try various graph theory algorithm, calculate the indicators of the graph. It is recommended to use NETWORKX in this library and some related parts of the Scikit-learn library, which can greatly simplify the task.
What's the use of doing this?
You can access data in different formats, such as CSV, GEPHI, Edge collections, and so on.
K-means is a useful algorithm, learned not to suffer, later used
When researching data, discovering meaningful clustering is one of the most important tasks
You can assign tasks to your situation in these two days. I would recommend the first day to play Networkx and Gephi. The next day I do cluster analysis because there are some challenging problems when doing clustering, such as what you need to think about: what vectors should be used to represent the problem label in order to keep the distance between them?!
Sixth day
The sixth day should already have a basic understanding of the Database section. However, the text has not been touched, only the word count (here is cold pun).
Today should be used to simply learn some textual analysis. Learning the latent semantic index is enough, all that is needed is in the Scikit-learn library, and some SQL is needed. The general process is:
Select the data you want to use
Use Scikit to extract text features (recommended for Scikit TF-IDF Vectorizer)
Label text. You can do a simple exercise: predict how much it will score based on the text of the answer. The score of the text here is its label, and the TF-IDF can be used as a feature vector.
It is best to use the NumPy format to prepare a dataset for each hypothesis. Like what:
A score to predict the answer
Answer by topic (You can choose some answers in 20 programming languages as a sample)
Be sure to keep in mind that the collected data sets are cleaned up, and you need to know exactly what's inside. It's easier said than done.
第七、八、九 days
A clean dataset has been obtained the day before. Suppose one is used to classify one for predictions (the difference has been learned on the fifth day). In these days (the translator note: The fifth day written in the original should be a clerical error) the focus of learning regression model. The Scikit Library provides a comprehensive range of tools. You should try at least three of the methods mentioned below:
Linear model. The variety of linear models is vast. First, compare their performance, then read the knowledge of the best linear models and understand the differences between the different models. Tip: Learn to learn elasticnet return. You can read Bishop's "Pattern Recognition and machine learning" book if you're a math student. The book is a good explanation of why the elasticnet regression model is easy to use. If there is no time, you can not look.
Regression tree.
KNN regression. KNN is usually very useful, do not despise these simple methods.
Integrated learning models such as random forest and adaptive enhancement learning.
The main goal of learning is not to immediately become an expert in these algorithms, but to run the code first, use it and then ask the question.
The same approach also applies to classification problems. Think about what indicators should be used to measure the quality of the results. Suppose you want to build an intelligent information platform for news sequencing, how do you assess its quality?
Cross-validation of all models is essential. Read the K-fold (k-fold) cross-validation section to learn how to do K-fold cross-validation with Scikit and then cross-Validate all the models you've built before.
Day
Now that you want to be a data scientist, this experience can not leave the most interesting part of it, that is to show the results.
No matter what form you want to choose to show them, it doesn't matter (it's hard to have a chance to choose a presentation in the future). Whether it is a semi-academic paper form, PPT display, blog post, or a mobile app can be, any choice. Share your story with everyone. Write about what you found in the data set, what assumptions you made, the reasons for the hypothesis, the algorithms used in a simple description, the results of cross-validation in a concise and clear form, and a lot of charts.
This part no matter what degree is not too hard. I promise, if you can really make a good show and show it to your Bole, the introductory offer is just about right.
"Filling": Roman Trusov This answer has 760+ top, and there are a lot of very high reviews. Bo Xiao le excerpts several:
Muhammed Hussain: I didn't answer the question, but after seeing this reply, I decided to take 10 days to learn from my work.
Sethu Iyer: Wow! Never thought 10 days would be so efficient!
Tejas SARMA: Now I know what I can do in the first 10 days of my summer vacation.
Parisa Rai: After reading this answer, how I wish I was a beginner. I think that, in addition to those children's shoes with only 10 days of free time, each beginner can follow the advice of the Lord.
Anastasia Kukanova: You know what else is sexy? This answer!
10 days 100 hours Learning data Science