These days the mind has been thinking of an application, want to practice pathon. In a word to summarize the function of the application, the general description is this: Automatic collection of the province's various public cultural institutions on the website of the newly released information, and classification presented. All kinds of public cultural institutions, refers to public libraries, cultural centers, museums. The newly released information mainly refers to the Daily News published by each website. The headlines, links, and release times of these news are automatically extracted and focused on your own site display.
Ideas are as follows:
(1) Establish a list of public cultural institutions ' websites;
(2) For each website, determine the page URL where the information to be extracted is located;
(3) The source code of each network is analyzed, and the rules of extracting the corresponding webpage information are established;
(4) In accordance with the rules, on the corresponding page, to extract the required information;
(5) To save the extracted information in some form;
(6) To organize and publish the information kept.
To summarize, you want to extract the specified content on the specified page. Above these functions, a section called "Eight Claw fish collector" has been done very humanized. Our goal is to learn Pathon, so be prepared to try it yourself and see what level you can achieve.
The first and second steps do not have any problem, can be done through a central website or search engine.
The biggest part of the workload should be the third step. Here you need to analyze the pages identified in the second step, each of which can be represented by a regular expression table, resulting in a list of regular expressions, each of which corresponds to a specified page of a Web site. Since most sites have such a page with a list of published information, our goal is to get the content in the link to each of the headings in this information list. Therefore, it should be divided into two tasks: one is to extract the headings in the information list and the corresponding links, and the second is to get the links and extract the contents.
The fourth step can be handled automatically without the need for manual intervention. It is necessary to determine what data structures are taken to hold the extracted information.
The fifth step needs to determine the file format of the saved information, is a text file? XLS form? Or a database file?
The sixth step is another task that focuses on determining what kind of platform to display the contents of the data file in what kind of interface. A traditional web, or waterfall stream, or on a public platform?
----------------------------
Pathon Study (ii)