From http://www.infoq.com/cn/news/2014/07/pycon-2014
This year's Pycon was held in Montreal, Canada on April 9, and Python has been widely used in academia thanks to its rapid prototyping capabilities. The recent official website has released videos and slideshows of the General Assembly tutorial section, including a number of (nearly half) content related to data mining and machine learning, as described in this article.
How to formalize a scientific problem and then use Python to analyze it
There are many very powerful Python data mining libraries, such as the interactive development environment of the Python language Ipython,python machine Learning Library Scikit-learn and network library NETWORKX. But there is no tutorial to tell people how to put their own problems in a good form, so that the scientific method to gradually complete the data mining process. The author of this tutorial has experienced such a painful process and is willing to contribute to more enthusiasts. This tutorial is intended for people who are interested in data analysis and who do not know where to start.
Getting Started with machine learning
A very introductory lecture that introduces the basic concepts of machine learning, such as what is a model, and the basic steps of machine learning: setting goals and benchmarking criteria, collecting and cleaning data, exploring and analyzing, training models, testing models. Based on the linear model, the author gives a method of machine learning using Scikit-learn library in Python language. Finally, the author introduces the application examples of machine learning such as handwriting recognition, search engine, Facebook friend referral, fraud detection, weather prediction, face recognition, etc.
Preliminary study on Bayesian statistical learning
Bayesian statistical models are becoming more common and important, but for beginners there is still a lack of introductory tutorials. This tutorial is intended to provide a Python developer with an interactive primer material. The tutorial begins with a few simple programs that demonstrate the concept of Bayesian statistical learning and then apply it to several specific examples. The material of the tutorials comes from Aolaili publishing House's Think Bayes.
Preliminary information retrieval
In this era of information flooding, how to efficiently obtain useful information is a problem that many people often think about. This tutorial teaches us how to implement a search engine from scratch to get the data we need. The tutorial, which exists as a project, introduces us to some simple theories of search and then teaches us to use the whoosh library to write an index and retrieve the code for the Wikipedia document, so that we can learn from this process how to find the data we want in the noisy data.
Explore machine learning with Scikit-learn library
Machine learning is an important branch of computer science, focusing on how to use previously observed data to make predictions about future data. Machine learning technology is widely and deeply used in many fields, such as search engine optimization, stock price prediction and even the research of the universe itself. This tutorial provides an introduction to the core concepts of machine learning, starting with the broad categories of supervised and unsupervised learning, and stepping into core technologies such as classification, regression, clustering, and dimensionality reduction, and then explaining the more commonly used and classic algorithms, as well as advanced content such as feature selection and model validation. After completing this tutorial, participants will have a clearer understanding of the machine learning itself and the Scikit-learn library. The entire process is done using the API of the Scikit-learn library, which is equipped with an application instance on the real data. The advantage of the Scikit-learn machine Learning Library is its neat, unified, and well-documented programming interface, which also implements a large number of classic and practical machine learning algorithms.
Mining social network APIs in the Ipython notebook program
Social networking sites such as Twitter, Facebook, LinkedIn Web, and so on, have enormous research value in addition to serving the everyday needs of ordinary users. This tutorial takes the example of a social network data mining book as a starting point and explains how to explore and tap into the high-value data behind social networking sites.
The tutorial divides the entire mining process into four steps, as follows:
- Hypothesis: The first step in a data science experiment is to set a goal, answer a question or validate a hypothesis;
- Acquisition: Acquiring and storing the data required during the validation process;
- Analysis: Using basic data mining techniques to analyze the data;
- Summary: The results of the excavation are presented in a simple and clear form;
The entire tutorial, which is based on the Vagrant virtual machine, is preloaded with the required third-party software, and the participants can start with almost no effort, giving them more focus on the data mining itself.
Participate in the Kaggle Data mining contest with Python
This tutorial is designed to enable data mining enthusiasts to understand and participate in a data mining competition. The first step is to quickly learn some classical algorithms through a few simple race questions and data sets. Then in-depth analysis of the Kaggle contest, choose the right features, write the correct algorithm, and finally complete the submission of the results. After learning about three hours of tutorials, enthusiasts can have a good understanding of the top five mining algorithms, and one or two of them will be used with kaggle contests, such as Facebook's recruiting contest, the GE Flight Optimization competition, and the StumbleUpon classification contest.
Using Python to build a data Crawler system
This tutorial is intended to teach Python developers some preliminary data crawler techniques, to talk about three major crawler systems, and then to show how to use them interactively. After learning this tutorial, we can crawl a few different content sites, or even automatically submit form data, and then describe the crawl API, CSV, and XML format data. At the end of the tutorial, the best practices for the current industry crawler system are described.
The Python language application of sociologists
With the advent of the big data age, more and more traditional thorny sociological problems can be verified by big data. This tutorial takes a World Bank dataset as an example to explain the process in detail: First it shows how to pour data from a CSV format file, then visualize the data using the Matplotlib drawing library and present time series data.
How to build a lightweight recommender system using Pydata
Recommendation system is a software system that analyzes large amounts of transaction data or user data to recommend relevant products, information and content to users, and is widely used in people's daily life. This tutorial introduces the concepts and definitions of recommender systems, and then builds a lightweight recommender system in an interactive way. In this process, we will learn about the NumPy and pandas of Python's scientific calculations.
Drawing and Visualization: matplotlib
When many people first heard of Matplotlib, they felt that the images made with it were too simple, and they needed to be beautified with tools such as Photoshop after they were created. The purpose of this tutorial is to correct this erroneous understanding and show us how to gradually beautify the visualizations with matplotlib color, ticker, CM, axes, and more. Starting with an actual geo-information example, by drawing points and polygons, the tutorials teach us how to set the various parts of the matplotlib chart, especially with emphasis on its drawing model, including sub-plots, layouts, etc., and then explains how to make markers, lines, label fonts, positions, and so on.
Making personalized Hacker News reader with Machine learning
Hacker News, the famous Y Combinator's entrepreneurial information website, is loved by programmers. However, the amount of information on the site for individuals is still a lot, so the author uses Scikit-learn machine learning library for himself to build a personalized information selector, to see only the purpose of their favorite articles. The author divides the machine learning process into four parts: acquiring data, processing data, training and debugging models, and using models. First he gets the site data locally via HTTP request and lxml, and then extracts its title, author, sort, number of votes, comments, and other textual features, as well as the mark of the spam article. Then, some simple natural language processing techniques such as word bag (bag of words), n-grams, and deactivation words are used to extract the characteristics of the input model. Finally, we use the Scikit-learn-band support vector machine classifier to learn your preferences and use the output model to predict what you like in the new article.
Ipython in-depth exploration: efficient interaction and parallelization
The Ipython project, which began in 2001, was just a more accessible Python command line at first. Over the past more than 10 years, it has evolved into an interactive development environment with many powerful features. Today's Ipython, consisting of a kernel that executes user code and a communication protocol based on ZEROMQ Message Queuing, makes it possible to support multiple client accesses at the same time, such as the Enhanced Python command line entered Ipython command line, and the graphical interface based on QT , built-in image display, and a Web-based notebook system that contains rich text, icons and even mathematical formulas for presentation. This tutorial starts with Ipython's design ideas and architecture and explains the Ipython high performance, low latency parallel computing environment. In this environment, the compute processes communicate through ZEROMQ Message Queuing, and the copying of big data such as the NumPy array is optimized. The environment can be manipulated interactively, or it can be run in batch processing mode.
Pycon 2014: Machine learning applications occupy half of Python