Requirements: Crawl all the topics of the Watercress Group (topic title, content, author, release time), and reply (best reply, normal reply, Reply _ reply, page back, 0 replies.)
Solution: 1. First crawl under the group, all the topic links, through the positioning of nextpage page to get the total 700+ article topic;
2. Visit the 700+ link, in the +start=0 of the page, get four parts related to the topic (title, content, author, release time), and the best reply, reply;
3. On the basis of 2, to determine whether there is a reply, if there is a reply to further determine whether there is a reply to the page, reply to pages through nextpage to obtain start=100, start=200;
4. Go to the next crawl function, and continue to write the crawled reply to the file in 2;
Solution Ideas:
before: started to build 2 files, Article.txt used to store all topics related content (700+ topic, author information);
At the same time, the establishment of a title named reply file;
After : The unified establishment of the topic title of the article, first write the article related content, and then continue to write back, so easy to read;
The pits encountered:
1. To get the text under a div under direct text,div.span under Text,div.h:
-There are 2 workarounds:
A. Through the XPath//text, which means to get all the text files under the div;
B. With CSS stitching, commas can be separated:
2. Consolidates the method of passing parameters between different functions via meta:
3. Python open file with a variable name
f = open ('%s.txt '% title_end, ' a ')
A: Continue writing
4. Remove the spaces in Str, line breaks, and other symbols
# Remove the whitespace, \t,\n and \ r characters around X.
X1 = X.strip (' \t\n\r ')
5. Strip remove the \ r in the data, '. Join reverses the list back to the string;
# first to remove the \ r in the article, some separate ' \ R ' becomes an empty list element: ", and then use if to judge the next good
Artical_end = []
For x in article:
X1 = X.replace (' \ R ', ')
If x1! = ":
Artical_end.append (x1)
# Convert Artical_end list to string
AR = '. Join (Artical_end)
Python Crawl Watercress Group 700+ topic add back to La Python open file with a variable name