Analysis of core technologies of ASP. NET video collection stations (included with a cheap trick to deal with search engine spider)

Last Update:2018-12-07 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Many webmasters started from the "garbage station. What is a "garbage station "? To put it bluntly, it is to collect others' data and store it in your own database to aggregate, sort, classify, or add some small changes to it. Then, you can make a program and become your own website. Most of the most popular "Garbage stations" are in articles, because they are easy to collect and have a large amount of data, which is conducive to search engines. In the past two or three years, some people began to do video collection sites, and even some very well-developed video collection systems (such as Marx CMS, which is very professional). They also made a fortune for some sites. However, as more and more people use Marx, the later the website is started, the less I start to get started. The solution is actually simple, that is, when the video collection station is not flooded enough, write your own video collection station, and do not use a general system, just a little SEO, search engines will take care of you ~ O (worker _ worker) o.

The following uses potato collection as an example to describe how to collect videos in the simplest and most crude way.

Success case: http://www.kangxiyoulaile.com/(Kangxi again)

Since the advent of Youtube, the video collection site does not need to collect videos at all-it only needs to collect Flash Player parameters.

For example, in the following video, we only need to collect its parameter "K1hf2uocE1Y. Of course, to be more professional, we need to collect video-related information, such as the video name, video duration, viewing times, comments from netizens, and content descriptions, ^ _ ^, all of them are stored in our own database!

Since it is a waste station, you must have your own classification. Let's start with this! Use the potato search function!

Search "Kangxi" + date, you can get a certain date in the "Kangxi" program, such as "Kangxi to 20090720", we came to the http://so.tudou.com/isearch.do? Kw = % BF % B5 % CE % F5 % C0 % B4 % C1 % CB20090720

Do you understand? Do we regularly let programs open http://so.tudou.com/isearch.do? Kw = % BF % B5 % CE % F5 % C0 % B4 % C1 % CB + 'current date' can achieve automatic acquisition.

How can I use ASP. NET to obtain HTML? This is a non-technical problem. Let's give the code directly.

/// <Summary> /// obtain the webpage content /// </summary> /// <param name = "url"> </param> /// <returns> </returns> public static string GetHtml (string url) {string result = ""; try {WebRequest request = WebRequest. create (url); WebResponse response = request. getResponse (); StreamReader reader = new StreamReader (response. getResponseStream (), Encoding. getEncoding ("GBK"); result = reader. readToEnd ();} catch {result = "";} return result ;}

Next, we will analyze the HTML. In this age, we all adopt div + css, which makes our collection easy. Hey hey, think about it. When Tudou uses div + css as the interface, it will definitely create a css class for each "program? Well, that's right! After analyzing the source file, we found that in the program list, each program uses the css class "pack pack_video_card.

What should we do? Use the entire source file as a string, and then use "<div class =" pack pack_video_card ">" as the separator to cut the string into a string array. In this way, except that the first string is not a video, every other string contains the required video information!

The Code is as follows:

string[] list=html.Split(new string[]{"<div class=\"pack pack_video_card\">"},StringSplitOptions.RemoveEmptyEntries);

Add some simple control conditions and extract information from each string segment into the class.

For example, to collect video thumbnails:

           foreach (string s in list){

               begin = s.IndexOf("src")+5;end = s.IndexOf("</a>")-4;v.ImgUrl = s.Substring(begin, end - begin + 1);

............

}

With this foundation, we can further encapsulate them into some functions for quick collection. For example:

/// <Summary> /// obtain all the video objects between two date segments /// </summary> /// <param name = "beginDate"> </param> /// <param name = "endDate"> </param> /// <param name = "everydayMax"> maximum number of videos per day </param> /// <returns> </returns> public static List <Video> GetVideoByDate (DateTime beginDate, dateTime endDate, int everydayMax) {ByDateVideoList = new List <Video> (); DateTime dt = beginDate; while (dt <= endDate) {ByDateVideoList. addRange (GetTopVideo (GetTudouString (dt. toString ("yyyyMMdd"), everydayMax); dt = dt. addDays (1);} return ByDateVideoList ;}

There is also a small detail. Tudou uses GBK encoding. If we use GBK encoding, the search engine will find that the duplicate data is too large, so we must modify the encoding. Assume that our website uses UTF8 encoding. How can we convert the collected GBK encoding data to UTF8 for display? Refer to the following functions:

public static string ConvertEncoding(Encoding oldEncoding, Encoding newEncoding, string oldString){byte[] oldBytes = oldEncoding.GetBytes(oldString);byte[] newsBytes = Encoding.Convert(oldEncoding, newEncoding, oldBytes);char[] newChars = new char[newEncoding.GetCharCount(newsBytes, 0, newsBytes.Length)];newEncoding.GetChars(newsBytes, 0, newsBytes.Length, newChars, 0);string newString = new string(newChars);return newString;}

Finally, it is very important! Make a simple URL Rewrite to facilitate search engines. According to Google's PR rule, addresses that are closer to the root directory, shorter addresses, and fewer get parameters are more likely to be included in the front.

We can do this by writing it in Global. asax. cs.

  protected void Application_BeginRequest(object sender, EventArgs e)
        {
            robot();

            string Id = Request.Path.Substring(Request.Path.LastIndexOf('/') + 1);
            if (Id.Length==16)
            {
                Server.Transfer("~/V.aspx?Id=" + Id.Substring(0,11));
            }
        }

So, originally http://www.kangxiyoulaile.com/v.aspx? Id = 3 IPFQqeKtKc address.

Http://www.kangxiyoulaile.com/3IPFQqeKtKc.aspx to access. If you change all internal links with parameters on the website to the latter, you can bypass the search engine.

We can also make some optimizations for search engines. For example, the following code identifies a search engine spider. We can make some changes to the page data after determining that the visitor is a search engine spider .. Hey .. The trick is too cheap, so we will not describe the details below. You can modify the following code ......

/// <Summary>
/// Determine whether a spider has been there
/// </Summary>
/// <Returns> </returns>
Protected bool robot ()
{
Bool brtn = false;
String king_robots = "mailto: Baiduspider + @ Baidu % 7CGooglebot @ Google % 7Cia_archiver @ Alexa % login @ Alexa % 7CASPSeek @ ASPSeek % login @ Yahoo % 7Csohu-search @ Sohu % login ";
String ls_spr;
Ls_spr = Request. ServerVariables ["http_user_agent"]. ToString ();
Char [] delimiterChars = {'| '};
Char [] x = {'@'};
String [] I1 = king_robots.Split (delimiterChars );
For (int I = 0; I <I1.Length; I ++)
{
String [] spider = I1 [I]. Split (x );
If (ls_spr.IndexOf (spider [0]. ToString ()>-1)
{
Brtn = true;
Logrobots (spider [1]. ToString () + "|" + Request. Path + "| ");
Break;
}
}
Return brtn;
}

Okay! All the key technologies have been analyzed. Let's just click here, and we're on our own! O (partition _ partition) o

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Analysis of core technologies of ASP. NET video collection stations (included with a cheap trick to deal with search engine spider)

Contact Us

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support