Sesame HTTP: how to find the crawler portal and sesame search for the crawler Portal

Source: Internet
Author: User

Sesame HTTP: how to find the crawler portal and sesame search for the crawler Portal

Search for Crawler entries
1. The entry of this task. A better entry for this crawler is our usual search engine. Although there are many types of search engines, they are actually doing one thing, indexing webpages, processing, and then providing search services. During normal use, we usually search directly by entering keywords, but there are still many search techniques. For example, if we search for this task like this, we can get the data we want.

Site: zybang.com

Now we have a try at Baidu, Google, sogou, 360, and Bing respectively:

From the figure above, we can see that the returned data volume is in the millions or even tens of millions.

Therefore, it is obviously better to use the data as the entry point for this task. As for anti-crawler measures, the basic skills of individuals will be tested.

2. Other portals (1) the mobile portal obtains data through the mobile portal of the website, so that data can be obtained more quickly.

The simplest way to find a mobile portal is to use the developer mode of Google's browser, click what the following mobile phone looks like, and refresh it.

This method is not omnipotent. Sometimes we can send the website address to our mobile phone and open it in the mobile browser to check whether the format displayed on the mobile phone is different from that displayed on the computer, if they are different, you can copy the website of your mobile browser and send it to your computer.

(2) website maps are web pages that can be easily crawled by website administrators to notify search engines, therefore, using these website maps, you can more efficiently and conveniently obtain some websites that serve as the next entry. (3) modifying the value in the URL first declares that this technique is not omnipotent. This technique is mainly used to obtain the required data from a request based on the values of some fields in the website. This reduces the number of requests and reduces the risk of being banned from the website, this improves the efficiency of crawlers. The following example shows how to capture all the music data of a singer who crawls QQ music in the following format:

Https: // xxxxxxxxx & singermid = xxxx & order = listen & begin = {begin} & num = {num} & songstatus = 1

The returned data packet is as follows:

Some of the field values are replaced by xxx. Note that the num field here is usually displayed on the next page when many songs are used by a singer, therefore, the begin here should be the corresponding value of the first entry of each page, while num is the number of data records on this page. Generally, we can obtain data one page at a time. The default value of QQ music is 30. So do we have to request at least four times to obtain the complete data?

Of course not. In fact, at this time, we can try to change some values in the URL to see if the returned results will change. Here, we will change the values of num and begin. Setting num is the number of all songs of a singer, and the value of begin is 0. In this case, we will request the modified URL again, you can get the following data:

As shown above, 96 data records are returned.

In this way, we can get all the data through two requests. The first request gets the total number, and then modifies the URL and then re-requests it to obtain all the data. Similar fields include pagesize.

Summarizing the tips above for searching for Crawler portals can help us get twice the result with half the effort. Sometimes we can get data at the minimum cost.

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.