Python + machine learning + crawler __python

Source: Internet
Author: User
Tags python web crawler nltk
Python's package in this area are very complete:

Web crawler: Scrapy (not very clear)
Data mining: NumPy, scipy, Matplotlib, Pandas (first three are industry standard, fourth analog R)
Machine learning: Scikit-learn, LIBSVM (excellent)
Natural Language Processing: NLTK (Excellent)


Python emphasizes the productivity of programmers and lets you focus on the logic rather than the language itself.
Can you imagine a simple search engine starting from 0 in one afternoon? C + + obviously is not good. Most of your time will be spent on implementing basic data structures and debugging language errors. And with Python, all you have to do is really understand the search algorithm, and then the implementation is really simple.

I think Python is good for algorithmic research, not just data mining. Rapid development allows you to quickly validate your ideas rather than wasting time on the program itself (imagine you've written a week of C + +, tuned a bunch of pointer errors, and finally found the idea itself wrong). When you know that you already have a correct algorithm, to make him run faster, just rewrite the performance bottleneck and embed it in C + +.

Links:
SciPy Stack (http://www.scipy.org/getting-started.html)-General tasks
Spark (http://spark.apache.org/docs/latest/index.html)-Going large
TensorFlow (http://www.tensorflow.org/)-going deep github-jupyter/docker-stacks:opinionated stacks of Ready-to-run Ju Pyter applications in Docker.


Python's fast iterative capabilities allow it to receive favor.

1 crawler scrapy, simple and easy to use. It is easy to construct a distributed crawler with the combination of rq-queue. I used to climb down the whole watercress friend diagram.
2 The algorithms commonly used in data mining are implemented in Python. Shao Zhibo mentioned Scikit learn is a leader. Not only is the document clear, but almost all of the common algorithms are implemented. We made a evemt detection system with Scikit learn. The whole system is written in Python, and the machine learning part uses Python
3) The NLP section is not particularly understood. NLTK is widely used in many university courses.

In Enterprise:
As far as I know, the use of the company

Google: Crawler C + +, data mining C++,NLP C + +. Python is used to process data.
Twitter: All services use Java and Scala,python tools to write fast iterations. For example, the search engine algorithm colleagues wrote a Python client for internal test search quality, I used to write a search word recommendation system, including interface, algorithm and interface, passed the test to write Java.

Web Crawl:
Python web crawler is very powerful, a large number of standard libraries, simple and easy to start, a variety of methods can be achieved.
Using Python to do web crawling and parsing introductory notes
I have done three kinds of urllib2+beautifulsoup, XPath and Scrapy. The first two can be well crawled, parsed and saved with the help of the MongoDB database. Scrapy can output JSON files. PS: Chinese coding is more headache.
Data mining:
Python is a language suitable for data mining.
Machine learning:
Recommend a book "Machine learning Combat", very good. Python code is simple and elegant, easy to start with, and there are many scientific computing software packages. Efficient and reusable Python is a good choice.
Natural Language Processing:
NLTK has been in development for more than 10 years and has more than 100,000 lines of code. However, there is no complete development in Chinese processing. "Python Natural Language Processing" is recommended. Python provides more tools and techniques. Chinese research is really rare, I hope more people will be able to study their own things to give reference, unlike now search is all "Python Natural language Processing" book content.
Python's NLTK Chinese usage and learning materials help you get started

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.