This seriesArticleI have already written four articles. If you miss the previous articles, you can find them through the following connections:
Knowledge management system data solution one of the R & D diaries Scenario Design and requirements list
Knowledge management system data solution R & D diary 2 Application Series
Knowledge management system data solution development diary 3 Documentation Solution
Knowledge management system data solution development diary-4-segment data Solution
This article focuses on the download, conversion, and import of webpage data from the Internet to the SQL Server database.
First, let's look at the four applications that Data Loader provides for webpage download.Program
Userid blog obtains all its articles according to the registered user ID.
The URL blog is connected according to the provided URL to download the documents and knowledge in the connection.
Text blog uses HTML text to find the links and download them to the local device.
Follow the default homepage provided by default to find the desired article and download it.
Site rebuild downloads the webpage based on the website provided, converts it to the database format, and imports it to the database.
Site rebuild provides one-stop download, conversion, and import services. Unlike the first four applications, they are only responsible for downloading.
Site rebuild does not search and crawl on a broad web page like a search engine crawler. What I need is useful. What is useful for my current study and work is documentation, materials, and data, rather than all the data. Therefore, we can see that there are only five tabs above to search for web resources. It is also recognized as a good source of resources. Some important websites and webpages may be missed. You are welcome to add them. In addition, these websites only focus on the IT programming industry, not the automotive industry. Recently, I have been looking for some knowledge in the auto parts industry, so I have no foundation and do not know where to start. The so-called line of work, love line of work, it is really like this, do it this line, love its industry, no professional knowledge, no foundation, just like the blind man, useful, it is difficult to start from scratch when useless materials are turned a blind eye.
The Internet is a source of rich knowledge. With the help of the Data solution R & D diary 3 document solution of the knowledge management system, all the document resources on the local computer can be converted into SQL Server database documents, based on the knowledge points and applications mentioned in this article, you can capture a wealth of Internet data and build a large database system for knowledge management.
Downloading webpages is a common practice. When we encounter good materials, we either copy them to the word using Ctrl + C, or directly save them as htm/MHT files, both of these actions will cause the data to be scattered in every corner of the hard disk, unless you have the patience to organize it into Doc/docx, or use professional note-taking software, such as Microsoft one note or ever note. Otherwise, as time goes on, there will be a large number of files on the hard disk that you want to delete and cannot delete, this is often the case. I often do this based on my data collection experience.
The key to downloading a local MHT FileCodeThe snippets are shown below.
CDO. message message = New CDO. messageclass (); message. mimeformatted = true ; ADODB. stream stream; try {message. createmhtmlbody (STR, CDO. cdomhtmlflags. cdosuppressnone, "" , "" ), stream = message. getstream ();} // "exception 0xc0000005 was generated at address 0x0000000076fe1c30 \ r \ n" catch {} stream. savetofile (filename, ADODB. saveoptionsenum. adsavecreateoverwrite); stream. close ();
You can use the keyword CDO to search. This will find many ways to save the webpage as a local MHT file.
The download format of a webpage is the best in MHT format. It can store all images and text in a single file to avoid placing images and CSS in a single folder. The MHT format is also the standard format of the mail. You can change the MHT extension to EML and open it with Outlook Express or live mail. It is an email. The key issue is how to read files in the MHT format and convert them to the RTF format. In this case, there are many articles in codeproject. Please search for them on this website.
The last step is to read and edit the webpage data stored in the database, find the desired file in the document browser, and double-click to open the read-only program.
If you are not satisfied with the content, you can use the editor to open the file and edit it.
There is still a lack of web-based document browsing system. You can use ASP. NET web development technology to directly browse documents on the database server. Because the HTM labels of the web are very different from those of the RTF format, there is no web document system that supports online editing yet.
For MHT files, ie web pages can be directly browsed. However, the more content the file contains, the slower it is to open it, and even causes a false state, I originally planned to write an mhtreader to read and display MHT files. I used the web browser control to open the MHT file. The Code is as follows:
Public Partial ClassMhtreader: FORM {PublicMhtreader (StringFilename) {initializecomponent (); _ filename = filename ;}Private String_ Filename;Private VoidMhtreader_load (ObjectSender, eventargs e ){If(!String. Isnullorempty (_ filename) {webbrowser1.navigate (_ filename );This. Text = path. getfilenamewithoutextension (_ filename );}}}
This plan also gave up because of poor efficiency. It is really slow to open MHT files with IE. Sometimes it is unacceptable to allow script code to run without stopping the settings. Therefore, the MHT format is not converted to the RTF format in the background.
XPS document format exists in a large number in the system. In the printing options of word or word processing software, set the printer to Microsoft XPS document writer, you can convert the current document to An XPS document. This format is often seen in Microsoft documents.
When saving a webpage, if its images are images that reference network resources, they cannot be displayed without internet connection. This often happens when you encounter some good information. Press Ctrl + C to put it in word, save it as an htm/MHT file, copy it to a USB flash drive, and wait for a computer to change, because there is no network connection, the image cannot be displayed when you open it. In this case, my solution is to use Adobe PDF to print documents in PDF format, and save pictures and text in PDF files, or use Microsoft XPS document writer to print the image into An XPS file. In both cases, the image is saved to the specified file. It should not be a problem to convert a PDF file to a doc/docx file. There are a lot of tools in PDF to DOC/docx format on the Internet, free and green. The advantages of both include too many tools. As shown in, I recommend tools Adobe Acrobat professional and nitro PDF professional. Adobe SDK supports calling all Adobe Acrobat functions programmatically, including document format conversion and PDF data extraction. Nitro PDF provides excellent PDF editing functions and is recommended. I heard that the license fee of Adobe SDK is very expensive, PDF focus. net has excellent, dedicated. the API operated by. Net cannot be found even for a trial version. To find the version, you can only convert the first few pages of the document or add a watermark. I have to find another way to use this data solution, it is impossible to find Breakthrough points on the products of the two companies.
Living in tianchao, the biggest benefit is that knowledge is priceless and has no value. You can get a lot of knowledge, technology, productivity and tools without any cost, if you can use these advantages, the productivity value you can create is also huge. Of course, you have to pay for it. If you buy 30 thousand for an ERP software, the customer will say that it is so expensive. I will find a system in the pirated disc, and the cost will be less than five RMB. Some others simply download them online, which consumes some electricity. Yes, God is fair.