Key Technologies for implementing website content picking on the. NET platform

Source: Internet
Author: User

These days, my boss handed me a task to download all the articles on a website and save them by category.

Although I have heard of such requirements before, I have never done such an application. It is not because you don't want to do it, but because you don't want to do it. First, I think there is no advanced technology. Secondly, I always think that it is useless to pick content from other websites. Excellent content is never copied from others. I think these are all my ignorance. In fact, when there is demand, there will be value. The technology is neither advanced nor advanced. If you do not understand the technology, you cannot understand it.

Well, let's get started.

I first used PHP and wrote web scripts, because my computer will configure the HTTP service and PHP environment, and PHP is basically my most familiar technical platform. I have never thought about a complete tool. I just want a computer to help me do some repetitive work. Even if I have done so, I didn't have a basic analysis design in the software development process. I thought about several things to do in the program, as follows: 1. Download the HTML file. 2. Extract the content of an article from the HTML code. 3. Download the image file in the document content.

First, download the HTML file. For PHP, the key lies in the function file_get_contents (). This function can read the content of the HTTP file through an HTTP address, just like entering an address in the browser, the browser will obtain the returned HTML text, which you can think of as a mini browser. In addition, the rest is to create a text file on the hard disk to save the HTML code. This can be done through the PHP file system function. This incident went smoothly.

Second, extract the content of the document from the downloaded HTML file. In addition to the content of the article, there will be a lot of additional content on the HTML page, such as the navigation menu, sidebar, advertisement, Javascript script, and so on of the website, I only need the title and body of the article. For PHP, the first thing I think of is to use regular expressions for matching, but soon I think this method is too primitive and too low-level. Today, when programming technology is developed, there must be a higher level of processing method. As a result, I flipped through the PHP manual. there is a chapter in the function library for XML processing. Check that there are a large number of functions and classes in it, and I have implemented the XML document as an object for processing, that is, the xml dom. Through these functions and classes, PHP can process HTML documents like JavaScript processing HTML elements in a browser. However, writing a PHP script makes me feel a little tired. I keep changing the script and press F5, and it is not fun to control the page Jump in the code, I suddenly felt that these things should be done by writing a desktop program. So I naturally thought of. net. Wow, I finally got into the question.

. NET is really not a platform with this error. I am not here to help Microsoft advertise how well it works. The first thing above can also be used. NET platform, the principle is the same, but the specific code is different, in. net is generally completed through several classes ,. net is a fully object-oriented programming platform, so everything is an object. to download an HTTP file, we mainly use system. in the. NET namespace, there are two classes: httpwebrequest and httpwebresponse. Httpwebrequest can wrap a URL address and get an httpwebresponse object through the getresponse method. This httpwebresponse object contains the data returned by the HTTP server. In addition, I/O operations on the file system will not be mentioned. After all, it is only about key technologies.

Now let's talk about the second thing in. net, that is, intercepting the content of the article from the downloaded HTML file. As mentioned above, regular expressions are a very low-level method. For simple operations, it is a good choice to use regular expressions, but if you want to find elements, modifying node content and attribute values makes it a disaster to use regular expressions. In. net, xml dom is also supported. These classes are placed in the system. xml namespace.

When using the. NET xmldocument class, one thing is worth mentioning. At the beginning, it was found that the loadxml method of the xmldocument object was very slow. The function of this method is to load an XML string as an xmldocument object. I found it takes about one minute to load an XML document. I think this efficiency is definitely not normal. So I went online to find information and found that it was originally a problem with doctype. I ran loadxml to the W3C website to download the DTD file, which was so strange. The solution is to download the DTD file to the local HTTP service, create the corresponding directory, and resolve the W3C domain name to the local machine in the host file of the Windows system, in this way, the speed becomes faster, and it is almost a flash. If you think it is difficult to get the local HTTP service, you can also directly put it in the file system, then batch modify the HTML doctype, and change the DTD file address to the local relative path.

Okay, the second thing is done in this way.

The third thing is to download the image in the article. This is actually the combination of the first and second. First, you need to use xml dom to read the src attribute of the IMG element and use this attribute value to download the image file. The files downloaded here are a little different from the files in the first case. Because image files are not text files, they must be read and written using binary stream objects, mainly binaryreader and binarywriter objects. During this operation, it is important to note that the Length attribute of the stream object cannot be used to obtain the byte length of the response stream. This will prompt "this stream does not support the search operation ", it may be because of the network, and. net implementation. So how can we get the response stream length? Use the contentlength attribute of the httpwebresponse object. After saving the downloaded image file to a local hard disk, you need to use xml dom to modify the src attribute of the corresponding IMG element to the local relative path so that the webpage can correctly display the local image.

So far, the related technologies have been basically introduced, and the technical points beyond the question are worth mentioning, that is, multithreading technology. A. Net cainiao like me has never written multithreading. Therefore, when writing a program that has been working for a long time, you will find that the window of the program is stuck when it is running. This is because the main thread has been processing data internally and does not respond to the graphic interface until the program finishes processing. In this case, it is difficult to update the window information to display the process processing plan during the process of data processing. The solution is to use multiple threads. When multithreading is used at the beginning, thread security issues will occur, and exceptions that cannot be called across threads will be received. If you want to spend more time completing thread synchronization knowledge, you can first put it. to remove the thread security check of the control in. net. windows. forms. control. the checkforillegalcrossthreadcils value is set to false.

Okay, basically. If you are interested, try it later. It is mainly used to take notes. please correct me if it is not easy to say.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.