Crawlers get bored these two days.

Source: Internet
Author: User

These two days have been plagued by crawlers. IIS logs are recorded in the database and queried in real time using SQL statements. It is found that even if it is an IP address, it is a process to judge, not at a glance.

1. I used SQL to sort the top 10 in reverse order and found that the crawler with the largest number of accesses to aspx is "should block", because some of them use many IP addresses, on average, each IP Address has a small number of accesses, and even top 30 is not necessarily included.

2. Some crawlers do well, but should not be blocked. They will crawl according to the website. For example, if IIS has been crawled to death, it will detect that it cannot be accessed, stop crawling. If it's just a simple select Top 100 clienthost, count (clienthost) as Count
From iis_log
Where
Target like '%. aspx'
Group by clienthost
Order by Count DESC, clienthost DESC

In fact, the top few crawlers may be high-quality crawlers. Although they are frequently accessed, they are accessed When IIS is still normal. Some crawlers will find that, no matter whether your IIS is dead or not, the frequency of each minute is almost the same. It's so damn. Therefore, we have to combine the group by clienthost that is at least accurate to minutes.

3. Check the status with netstat on the frontend?

For example, netstat-Na | grep TCP | gawk '{print $5}' | SED's/: FFFF: // G' | grep-V ":: "| grep-V": // * "| gawk 'fs = ": "{print $1} '| grep-V" 127.0.0.1 "| sort | uniq-c | sort-NR

This command can be sorted in reverse order by the number of connections, and the output content is similar

69 218.213.241.149
65 116.23.209.15
62 121.32.51.166
57 218.240.137.162
52 123.113.33.243
45 221.238.245.116
45 220.180.129.102
44 222.243.5.91
42 60.209.42.134
39 221.212.195.202
......................

Some people say that the first few digits are all blocked, write a script to automatically add iptables for drop, and then delete it regularly without permanent blocking. Please give it a chance to correct it.

I think this is not perfect, and it is easy to kill by mistake. The reason is:

1. Can I set the number of mails? 30? 40? 50? This day and evening are different.

2. In fact, when I open a page, I output more than 1 IP address on the client. I understand that there are images in the page after all, I can see three output statistics. But I open several pages at the same time and the number of pages reaches dozens. Then I can block myself?

 

Then, how can we combine the netstat status with the specific access content logs in IIS for analysis ??

Continue learning ....

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.