The exploration of Python, machine learning and NLTK Library

Source: Internet
Author: User
Tags nltk

Challenge: Use machine learning to categorize RSS feeds

Recently, I received a task asking to create an RSS feed taxonomy subsystem for the customer. The goal is to read dozens of or even hundreds of RSS feeds and automatically categorize many of their articles into dozens of predefined subject areas. The content, navigation, and search capabilities of the customer's Web site will be driven by this daily automatic feed retrieval and categorization results.

The customer recommends using machine learning and perhaps using Apache Mahout and Hadoop to do the task, because customers have recently read articles about these technologies. However, the customer's development team and our development team are more familiar with Ruby than Java technology. This article describes the technical journey, learning process, and final implementation of the solution.

What is machine learning?

My first question is, "What exactly is machine learning?" "I have heard the term and vaguely know that supercomputer IBM Watson recently used the technology to defeat human competitors in a Jeopardy competition." As a shopper and social networking activity participant, I also know that Amazon.com and Facebook are doing well in providing advice, such as products and people, based on their shopper data. In short, machine learning depends on the intersection of IT, math, and natural language. It focuses on the following three topics, but the customer's solution ultimately covers only the first two topics:

Classification. Assign related items to any predefined category based on a set of training data for a similar project

Suggestions. Suggested items based on observations from similar projects

Cluster. To determine subgroups within a set of data

The choice of Mahout and Ruby

After understanding what machine learning is, the next step is to determine how to implement it. According to the customer's suggestion, Mahout is a suitable starting point. I downloaded the code from Apache and started learning by using Mahout and his brother Hadoop to implement the machine learning process. Unfortunately, I find that even for experienced Java developers, the Mahout learning curve is steep and there is no sample code available. It is also unfortunate that machine learning lacks a framework or gem based on Ruby.

Discover Python and NLTK

I continued to search the solution and encountered "Python" in the result set. As a Ruby developer, although I haven't learned the language yet, I know that Python is a text-based, understandable, and dynamic programming language for similar objects. Although there are some similarities between the two languages, I've neglected to learn Python for years as an extra skill set. Python is my "blind spot", and I suspect many Ruby developers go along with it.

Searching for machine-learning books and more in-depth research into their catalogs, I found a fairly high percentage of such systems using Python as their implementation language and using a library called the Natural Language Toolkit (NLTK, Natural Language Toolkit). With further search, I found that Python's applications were more extensive than I realized, such as Google App Engine, YouTube, and Web sites built using the Django framework. It's even pre-installed on the Mac OS X workstation I use every day! In addition, Python provides an interesting standard library for math, science, and engineering (for example, NumPy and scipy).

I decided to push a Python solution because I found a very good coding example. For example, the following line of code is all the code needed to read an RSS feed over HTTP and print its contents:

Print Feedparser.parse ("Http://feeds.nytimes.com/nyt/rss/Technology")

Quickly mastering Python

The easiest part of learning a new programming language is to learn the language itself. The harder part is understanding its ecosystem: how to install it, add libraries, write code, construct code files, execute it, debug it, and write unit tests. This section briefly describes these topics, and be sure to refer to resources for links to more information.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.