Parse HTML format data, get the data you need based on regular expressions, and store it in a database

Source: Internet
Author: User

For example, to get the contents of <dt></dt><dd></dd> in <dl>, because there is a <a> tag in the <dt></dt> tag, All will be illustrated together. Do not optimize for the first time, learn together.

<! DOCTYPE html>

<title><title>

<body>

<dl class= "Hello" >

<dt> <a href= "sdf" > Xiao Ha's Diary of 1th, 2017-2-26 </a> </dt>
<dd> record Person: Kha 1th </dd>
<dd> Weather: Sunny </dd>
<dd> Mood: <a href= "FDG" > Today outside the Sun, but still very cold, the mood is good! </a></dd>
</dl>

<dl class= "Hello" >

<dt> <a href= "sdf" > Xiao Ha's Diary of 2nd, 2017-2-26 </a> </dt>
<dd> Record Person: Kha 2nd </dd>
<dd> Weather: Sunny </dd>
<dd> Mood: <a href= "FDG" > is sunny, the mood is very good! </a></dd>
</dl>

</body>

First step: First this way the HTML in the format of the string into the database (can not be saved, can be uploaded after the entire file analysis)

String JSON= "html Above";

Step two: Get <dl class= "Hello" > Content with regular expressions

MatchCollection Medl = regex.matches (json, @ "<dl class=" "Hello" > ([\s\s]*?) </dl> ");//The JSON here is a string that needs to be parsed

list<string> mclist = new list<string> ()//for storing the last traversed Entity data
Cyclic DL
for (int i = 0; i < Medl. Count; i++)
{

  Step three: Get the contents of <dl> under <a> tags
Get the A label under DT under the DL
MatchCollection Dedt = Regex.Matches (Medl[i]. Value, @ "(?<=>). * (?=</a>)");

list<string> titlelist = new list<string> ();
foreach (var item in dedt[0]. Value.split (', ')) The//<dt><a> tag contains two contents, journal title and date, where the value is traversed by split split
{
Titlelist. ADD (item);
}
for (int b = 0; b < titlelist. Count; b++)
{
if (b = = 0)
{
Mclist. Insert (0, "journal title:" + Titlelist[b]);
Mclist. Insert (1, "Mood:" + dedt[1). value);//values of the second <a> tag
}
else if (b = = 1)
{
Mclist. Insert (2, "date:" + titlelist[b]);
}
Else
{
}
}

  Fourth step: Get the contents of the <dd> tag under <dl>
Get the DD label under DL
MatchCollection MCDD = Regex.Matches (Medl[i]. Value, @ "(?<=<dd>) ([^<]*) (?=</dd>)");
Cyclic DD Label
for (int j = 0; J < MCDD. Count; J + +)
{
Mclist. ADD (Mcdd[j]. Value), or//the values obtained from the DD tag are stored in mclist (if the value being deposited at this time has an extra escape character available Value.replace ("value to be replaced", "replaced value")
}
Hellobll. Add (Getmodels (mclist));//store data from Mclist in the database
Removes data stored in the mclist from the database and prevents duplication of operations
Mclist. Clear ();

}

The method will correspond the data in the data field, in this. Get the corresponding value in the Getvaluebykey method.

Private Model.hello Getmodels (list<string> data)
{
Model.hello model = new Model.hello ();
Model. Title= this. Getvaluebykey (data, "journal title");//Journal title

Model.date= this. Getvaluebykey (data, "date");//Date
Model. Name = this. Getvaluebykey (data, "record Person");//Record person
Model. Weather = this. Getvaluebykey (data, "weather");//Weather

Model. Mood = this. Getvaluebykey (data, "mood");//Mood

return model;
}


private string Getvaluebykey (list<string> data, string key)
{
string result = data. Find (x = X.startswith (key));
if (!string. IsNullOrEmpty (Result))
{
result = result. Replace (Key, String. Empty);
result = result. Replace (":", "");
result = result. Trim ();
}
return result;
}

Entity class

public class Diary

{

public int id{get;set;}

public string Title{get;set;}

public string Name{get;set;}

public string Date{get;set;] The time here is a string type

public string Weather{get;set;}

public string Mood{get;set;}

}

The regular expression can be optimized, the mood this data is not obtained through the <dd> tag, is obtained through the <a> tag.

Parse HTML format data, get the data you need based on regular expressions, and store it in a database

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.