Crawl the sum of millions of user-aware information

Source: Internet
Author: User

The first big mistake is the failure to release the unmanaged resources in time, causing the program to run long after the OutOfMemoryException is thrown.

This small demo The primary unmanaged resource is the Httpwebresopne and stream of HTTP requests, and the other is rediscline. Cause this problem arises not I don't know to release unmanaged resources, but the code is negligent. This writing code habit should be a long time, because the previous program did not run for a long time, the problem is not exposed

At first, it was written like this.

using New StreamReader (Stream, Encoding.UTF8))                                {                                    return  reader. ReadToEnd ();                                }

is executed within the using statement so that the object is not released and returns

Improved:

string string . Empty; using (Stream stream = response. GetResponseStream ())// raw                             {                                usingnew  StreamReader ( Stream, Encoding.UTF8))                                {                                    = reader. ReadToEnd ();                                }                            } ...    .. return Source;

Two recommended methods for treating unmanaged resources

The second big error, will be asynchronous equivalent to multithreading

Async is not equal to multithreading

For I/O intensive you should use Async for CPU-intensive multithreading. Capturing network resources in crawlers is I/O intensive, while HTML parsing is CPU intensive. Since the crawler has to get resources to parse, I don't use asynchronous

    1. Using statement, execution out of the using will be released
    2. Skillfully use finallly statement try{}catch{} Do not forget that finally the call to Dispose method is displayed in finally

Actually, they compiled the same result.

The first pit I stepped on HttpWebRequest the default connection, regardless of how many threads are open or the speed

Need to add in app. Config

<system.net>        <connectionManagement>            <add address="*" maxconnection="100000"></add>        </connectionManagement>    </ System.net>

Defects:

Many tables in SQL Server are 1-to-many relationships due to insufficient data specification resulting in a one-to-many data redundancy

Data display

Use Echart, about Echart please see my blog http://www.cnblogs.com/zuin/p/6122818.html

Male/female ratio

The amount of attention is distributed

10 schools with the largest number of alumni

10 employees with the largest number of companies

Top 10 Majors

or Top10 with the highest number of likes

Crawl the sum of millions of user-aware information

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.