Use HtmlAgilityPack to crawl website information and store it in mysql and htmlagilitypack
I plan to make a function for querying the price of medicinal herbs. However, it is very troublesome if I did not find any data by myself at the beginning, therefore, you only need to capture and store data on other websites before starting this operation.
HtmlAgilityPack should be used by many people in c #. It is simple and powerful. I have also used it for several gadgets to crawl information. However, after a long time, the source code is gone and I forgot how to use it. This time, I found the information at and made it slowly!
(But the most troublesome thing is to store data in mysql. I have always used mssql for the. net database, so I encountered a lot of problems when connecting to mysql for the first time .)
1. Use HtmlAgilityPack
- Download the HtmlAgilityPack class library and reference it to your project.
Console project I use here
Add reference for project
Add reference in code
2. Analyze Web pages
- Webpage: http://www.zyctd.com/jiage/1-0-0-0.html
First, let's look at the url changes on each page. After observing, we find this simple:
The first page is: 1-0-0 or 1-0-0-1, indicating the first page
Page 2: and so on
- Then analyze the source code.
Obviously, the data on this page is stored in the ul tag, and there are also class names: <ul class = "priceTableRows">,
Then let's look at the li tag under ul. The html in the li tag is also the same, and then we can start to write code to capture it.
3. capture information
- First, create a class file to store the captured information. Because I directly saved the file to the database using the ado.net object data model.
- The following is a file generated by the ado.net Object Data Model:
// -------------------------------------------------------------------------------- // <Auto-generated> // This code has been generated from the template. //// Manual modification of this file may cause unexpected behavior of the application. // If the code is re-generated, the manual changes to the file will be overwritten. /// </Auto-generated> // ------------------------------------------------------------------------------ namespace test project 1 {using System; using System. collections. generic; public partial class C33hao_price {public long ID {get; set;} public string Name {get; set;} public string Guige {get; set;} public string Shichang {get; set;} public decimal Price {get; set;} public string Zoushi {get; set;} public decimal Zhouzd {get; set;} public decimal Yuezd {get; set ;} public decimal Nianzd {get; set;} public int editDate {get; set;} public string other {get; set ;}}}
- The following is the class written when the test is saved locally:
Using System; using System. collections. generic; using System. linq; using System. text; using System. threading. tasks; namespace test project 1 {public class Product {// <summary> /// Product Name /// </summary> public string Name {get; set ;} /// <summary> /// specification /// </summary> public string Guige {get; set ;} /// <summary> /// market /// </summary> public string Shichang {get; set ;} /// <summary> /// latest Price /// </summary> public string Price {get; set ;} /// <summary> /// trend /// </summary> public string Zoushi {get; set ;} /// <summary> /// weekly rise/fall /// </summary> public string Zhouzd {get; set ;} /// <summary> // monthly rise/fall // </summary> public string Yuezd {get; set ;} /// <summary> // annual rise/fall // </summary> public string Nianzt {get; set ;}}}Below is the main processing code
Using System; using System. collections. generic; using System. linq; using System. text; using System. threading. tasks; using HtmlAgilityPack; using System. IO; using Newtonsoft. json; using Newtonsoft. json. converters; namespace test project 1 {public class Program {// <summary> // local test information class /// </summary> static List <Product> ProductList = new List <Product> (); /// <summary> /// Information Class generated by the database /// </summary> static List <C33ha O_price> PriceList = new List <C33hao_price> (); public static void Main (string [] args) {int start = 1; // start page int end = 10; // end page count Console. writeLine ("Enter the start and end pages, for example, 1-100. The default value is 1-10"); string index = Console. readLine (); // obtain the number of pages entered by the user if (index! = "") {// Number of split pages string [] stt = index. split ('-'); start = Int32.Parse (stt [0]); end = Int32.Parse (stt [1]);} // loop capture for (int I = start; I <= end; I ++) {string url = string. format ("http://www.zyctd.com/jiage/1-0-0-42502.16.html", I); HtmlWeb web = new HtmlWeb (); HtmlDocument doc = web. load (url); // obtain the HTML node = doc. documentNode; string xpathstring = "// ul [@ class = 'pricetablerdone']/li"; // path HtmlNodeCollection aa = node. selectNodes (xpathstring); // obtain the html if (aa = null) {Console in all li tags under every page. writeLine ("error: Current page is {0}", I. toString (); continue;} foreach (var item in aa) {// Add the li tag information to the set string cc = item. innerHtml; test (cc) ;}// write json to the local disk. // string path = "json/test. json "; // using (StreamWriter sw = new StreamWriter (path) // {// try // {// JsonSerializer serializer = new JsonSerializer (); // serializer. converters. add (new JavaScriptDateTimeConverter (); // serializer. nullValueHandling = NullValueHandling. ignore; // construct the write stream of Json.net // JsonWriter = new JsonTextWriter (sw ); //// serialize the model data and write it into the JsonWriter stream of Json.net. // serializer. serialize (writer, ProductList); // ser. serialize (writer, ht); // writer. close (); // sw. close (); //} // catch (Exception ex) // {// string error = ex. message. toString (); // Console. writeLine (error); //} int count = PriceList. count (); // Number of captured information items. writeLine ("Get information {0}", count); Console. writeLine ("start adding to Database"); Insert (); // Insert to Database Console. writeLine ("data added"); Console. readLine ();} /// <summary> /// process the information and add it to the set /// </summary> /// <param name = "str"> html content of the li tag </param> static void test (string str) {// Product product = new Product (); C33hao_price Price = new C33hao_price (); HtmlDocument doc = new HtmlDocument (); doc. loadHtml (str); HtmlNode node = doc. documentNode; // obtain the medicinal material name string namepath = "// span [@ class = 'w1']/a [1]"; // name path HtmlNodeCollection DomNode = node. selectNodes (namepath); // obtain the content based on the path // product. name = DomNode [0]. innerText; Price. name = DomNode [0]. innerText; // Add the content to the object // obtain the type string GuigePath = "// span [@ class = 'w2 ']/a [1]"; DomNode = node. selectNodes (GuigePath); // product Price. guige = DomNode [0]. innerText; // obtain the market name string adsPath = "// span [@ class = 'w9']"; DomNode = node. selectNodes (adsPath); Price. shichang = DomNode [0]. innerText; // get the latest price string pricePath = "// span [@ class = 'w3']"; DomNode = node. selectNodes (pricePath); Price. price = decimal. parse (DomNode [0]. innerText); // obtain the trend string zoushiPath = "// span [@ class = 'w4']"; DomNode = node. selectNodes (zoushiPath); Price. zoushi = DomNode [0]. innerText; // obtain the weekly rise/fall string zhouzdPath = "// span [@ class = 'w5']/em [1]"; DomNode = node. selectNodes (zhouzdPath); Price. zhouzd = decimal. parse (GetZD (DomNode [0]. innerText); // obtain the monthly rise/fall string yuezdPath = "// span [@ class = 'w6']/em [1]"; DomNode = node. selectNodes (yuezdPath); Price. yuezd = decimal. parse (GetZD (DomNode [0]. innerText); // obtain the annual rise/fall string nianzdPath = "// span [@ class = 'w7']/em [1]"; DomNode = node. selectNodes (nianzdPath); Price. nianzd = decimal. parse (GetZD (DomNode [0]. innerText); // Add time Price. editDate = Int32.Parse (GetTimeStamp (); // convert to the timestamp format for php to use // ProductList. add (product); PriceList. add (Price); // Add to object set} // Query static void Query () {var context = new mallyobo360Entities (); var member = from e in context. c33hao_member select e; foreach (var u in member) {Console. writeLine (u. member_name); Console. writeLine (u. member_mobile);} Console. readLine () ;}// Insert static void Insert () {var context = new mallyobo360Entities (); C33hao_price Price = new C33hao_price (); int I = 0; foreach (C33hao_price item in PriceList) {context. c33hao_price.Add (item); context. saveChanges (); I ++; Console. writeLine ("{0}/{1}", I, PriceList. count) ;}//< summary> /// obtain the timestamp /// </summary> /// <returns> </returns> public static string GetTimeStamp () {TimeSpan ts = DateTime. utcNow-new DateTime (1970, 1, 1, 0, 0, 0); return Convert. toInt64 (ts. totalSeconds ). toString ();} /// <summary> /// remove the percentage in the string /// </summary> /// <param name = "str"> processed string </param> // /<returns> </returns> public static string GetZD (string str) {string st = str. substring (0, str. length-1); return st ;}}}
- The above code is mainly stored in the database. The following describes how to save it locally.
4. Store it locally
To store data locally, you only need to change the Price object in the test method to the Product type, and then add it to the ProductList set, then, write the commented // into json and save it to the local device. // The method can be used to cancel the comment.
To be continued ..........