Knowledge about python open-source technology development

Source: Internet
Author: User

Python open source is a very worthwhile project. Only open source can make more people make up for the relevant technologies, the constant innovation in python open-source development allows you to discover more information technologies in open-source projects.

1. Using python for a spider program to capture web pages. With the urllib library, it is really easy. In addition, the corresponding library sgmllib can be used for webpage parsing. However, I still don't know whether python sgmllib can provide standard html code similar to Jtidy, or there are other libraries to do this.

Well-known:

Harvest Man ------ http://code.google.com/p/harvestman-crawler/

HarvestMan is a modular, extensible and flexible web crawler program cum framework written in pure Python. harvestMan can be used to download files from websites according to a number of customized rules and constraints. it can be used to find information from websites matching keywords or regular expressions.

The final goal of the project is to develop a full-fledged semantic personal data mining platform which can be used to retrieve information from the Internet in a highly customizable manner, so that one can fetch information from the web the way he wants it, when he wants it. for this, HarvestMan project will provide support for Web 2.0 and 3.0 technologies such as RSS, RDF, OWL etc. this goal is really big. That's awesome .)

In addition, there are some small projects that can be searched by Google code or sourceforge.net.

2. Operations on PDF files. C ++, c #, and java both have some python open-source class libraries available for use. For example, pdflib, itext, export clown, and export box can parse pdf files and convert formats such as pdf and rtf html xml.

Today I found a python library that can operate on pdf files: pdfminer.

I don't know if there are other libraries. Hope you can add them.

3. With the pdf operation library, you can easily extract the pdf file content.

The open-source Cooperation Organization for Python introduces technical knowledge and usage of the Python language, and implements python open-source application, promotion, and learning in China ..., Share Python experience knowledge, experience, and skills.

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.