A5 Marketing Assistant September 3 Sales Group share: Article collection and web site crawl

Source: Internet
Author: User

Intermediary transaction http://www.aliyun.com/zixun/aggregation/6858.html ">seo diagnose Taobao guest cloud host technology Hall

A5 Pest Marketing assistant after the sale of the group has a lot of experts. In order to let you learn more things, we regularly organize exchanges and sharing activities, to promote the sharing atmosphere, the experience of the master dug out to help you build connections, more rapid progress. We are committed to the sale of Worms after the group to create Internet sites/marketing experts. Here, what you learn is not just about bugs.

We will hold a sharing event every Saturday 8:30 welcome you to participate on time, but also welcome you to contact me, your experience to share with you (temporarily share a reward T-shirt, later the prize will increase), after the sale of more than hundreds of groups, everyone can share a little valuable experience, this significance is very large. Sharing creates value. Today, sharing people is happy and 22, to bring you on the theme of "article collection and Web site crawl" Some common methods and skills.

Le Xiaoyao--the collection of the article:

The first part of the article collection is a list of the address of the page, the actual is a regular, worms are regular collection, the actual is very simple, [page] variables represent page numbers, page number

  

This everybody basically knows, here is also relatively easy to configure. is usually a line of link code, where the HREF link to the site section, with (. *?) instead. The rules of worm software acquisition are all standard regular expressions, and parentheses indicate that this is an extracted parameter. In fact, it is not necessarily (. *?), and other formulations such as ([^ "]*) are also possible.

  

This means (. *?) Instead of the address, (. *?) This is just the replacement of the address, very simple, also very good understanding.

The 3rd part, the article title and the body text extraction, this section is the most difficult. In fact, it is not difficult to find the title before and after the characteristics of the body before and after the characteristics, and then combined is the. Headings are generally available (. *?) Instead, this regular expression means to match all the characters (excluding line breaks) on the same line, except for the string behind him. Text can be used in general ([\s \s]*?) Instead, the expression means to match all the characters (including line-wrapping, because the text might include a newline symbol), except for the string behind him. A regular expression with parentheses, represented as a parameter to extract the use of, if the source code inside the title in front, select "title before", otherwise the "title in the back." There are only 2 regular expressions in parentheses, and regular expressions can be found in other parts, but they do not need to be extracted for use and cannot be added to parentheses. Add as to the middle of the text and the title, generally may have a lot of irrelevant code content, unified can be replaced by [\s\s]*, this is not parentheses. Irrelevant code content, unified can be replaced by [\s\s]*, this is to pay attention to, the text can be used in general ([\s\s]*?) To replace, the basic figure out these 2 is OK.

For example, we collect http://www.chongseo.com/news/this column of the article, you can write the first part: Http://www.chongseo.com/news/list_2_[page].html, Then the URL to find out, chongseo to teach you the site to improve traffic 10 skills, the text can be used in general ([\s\s]*?) Instead, and then start testing the collection, OK, success.

22--Basic parameters and URL crawl:

1. Basic parameters:

A large item, thread, we should all be able to understand, is not the faster the better, depending on the situation, such as in the registration can choose 30-50 threads, but in the blog mass and question-and-answer class mass when using 1 threads.

b Large, the focus is a custom mailbox settings, this response to the problem is more, I say a little, the new registered mailbox is directly with pop function, must first log into the mailbox to see if there is open, and then set up after the opening, so there will be no mistake.

C Big items need to pay attention to, registered username, remember is 8---12, today there are screenshots of the crowd to ask what will go wrong, the length is too long, no attention.

D big items, there is nothing good to say, we come in the group to ask questions in this respect, I do not say more, directly said crawl.

2, web site crawl understand; second, the verification program, is that you want to crawl the type of the target site, the current worm added an automatic verification, usually people choose this is very good; server type selection a GG, a BD, a yh.

BD Resources and YH resources are relatively small, GG generally grasp the amount is very large, general situation 40 rules crawl more than 10,000 is no problem, crawling by the search instructions, worms with a lot of search instructions, of course, they can also analyze the current mainstream CMS program to write rules, such as DZ forum procedures, intitle:powered by discuz!, this rule is Baidu and GG can be used to crawl DZ forum. You choose to validate automatically when you bind the validator, so that you can choose between DZ NT and dz1.5-2.0.

So how much crawl, of course, a rule is certainly not, we look at this, Beijing powered by discuz! X1.5 inurl:forum.php, the search instruction is to show all the local DZ1.5 forums in Beijing. Baidu can only crawl the first 7 pages, and GG to crawl n pages, but in the GG crawl must use foreign IP to crawl (this point we should all understand, celestial) so I suggest that if you want to grab a lot of web site, spend more than 10 dollars to buy a VPN monthly, so the month down, millions URL is not a problem. Like just said Beijing powered by discuz! X1.5 inurl:forum.php, where to find the key words in Beijing, I teach you a method, to the major input method site to download the thesaurus. Of course, we downloaded, it is impossible to add in, to bulk import instructions. First, we'll copy the downloaded words into Excel for processing. A column copy keywords, b example put rules. Then the two examples are copied to the TXT text, and then the next space replacement, for the content of about 5 spaces, replaced by a space, then the final processing result is such.

  

Then save and start to import the worm. The next step began to crawl, general situation GG Import more than 100, dozen code will come out about 10 times, the premise is to use foreign IP, everyone down after you can go to try. 100 rules can crawl more than 13,000 of the Web site, of course, and you write the rules have a relationship, the rules are wrong, a can not crawl. Well, crawl on these things want to use the usual more to see the worm, more hands, software casually point, not bad. First look at the instructions, and then look at group sharing, not a small problem on the group asked, first of all to check their own where to do wrong, is not according to the instructions, the parameters are right, and then to do again. After the sharing was over, we also interacted and asked questions.

Free group: So the key words do not have to ignore?

A: You can not ignore, the bulk of the time before the introduction of the keyword has been added, and is a lot more than here, if you want to search a single keyword words, you can.

Free group: Cosmetics powered by discuz! X1.5 inurl:forum.php, for example I am looking for cosmetics site, is not such a rule?

A: It is related to the cosmetics site will come out, of course, there will also be irrelevant.

﹎ Ordinary: Can you speak some English web site capture collection aspect?

A: I have not been involved in English, only to do Baidu, you can provide a way! Worm Belt has English web site crawl, you can analyze the English forum is what kind of Web site form to crawl, the easiest way is to check your competitor's website outside the chain, worms have this rule, also very useful, also very practical.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.