Java-based implementation of simple web crawler-download Silverlight video

Source: Internet
Author: User
Tags website performance visual studio 2010

Recently wandering on csdn, I saw a WP area on the homepage. After clicking it, I found various development materials about Microsoft's Windows Phone 7. After a whim, click the "four-day video tutorial for Windows Phone 7 Development" link and click the video. You will find that you need to install Silverlight to watch the video. So I want to download them first, but nnd finds that these things are played using the Silverlight framework and cannot be downloaded. But who are you, cool X and cool x programmers? Why can't we do anything? Right-click the source code and you can see

Initparams: "SVP _ {0} = profiles, videouri = http://download.microsoft.com/download/7/1/0/710733A5-5BE6-436E-AC7D-A265170CBB4B/Series_Introduction_Day_1_Part_1_subtitle.wmv,ThumbUri=http://msdn.microsoft.com/zh-cn/windowsphone/hh395103.1_1b profiles, Title = series introduction, hidefullbrowser = true, embedenabled = true, embedcsswidth = 400, embedcssheight = 320, embedxapuri = http://msdn.microsoft.com/objectforward/default.aspx,Author=,Brand=,Locale=,StartMode=AutoLoad,Persistence=None,MSNVideoUUID=,PTID=,HeaderColor=#06a4de,HighlightColor=#06a4de,MoreLinkColor=#0066dd,LinkColor=#0066dd,LoadingColor=#06a4de,GetUri=http://msdn.microsoft.com/areas/sto/services/labrador.asmx,FontsToLoad=http://i3.msdn.microsoft.com/areas/sto/content/silverlight/Microsoft.Mtps.Silverlight.Fonts.SegoeUI.xap;segoeui.ttf
Okay, please refer to the videouri = watermark in the second line. However, there are 70 or 80 videos on the website. You cannot open them one by one and view the source code to copy the URL Ending with WMV to thunder and download it. The answer is of course no, no, no. Who makes us a great and bitter programmer. You should know how the search engine retrieves information. Yes, it is a Web Crawler. It is a program or script that automatically crawls World Wide Web information according to certain rules. Although many of our predecessors write open-source and powerful web crawlers on the Internet, you can write them by yourself based on your learning attitude. Let's take this address as the initial node: worker.
<A href = "http://msdn.microsoft.com/windowsphone/hh768215"> (1) From xNa to slxna </a> <a href = "http://msdn.microsoft.com/windowsphone/hh768217"> (2) add FAS to xNa </a> <a href = "http://msdn.microsoft.com/windowsphone/hh768227"> (3) add FAS to a typical Silverlight application </a> <a href = "http://msdn.microsoft.com/windowsphone/hh768228"> (4) add Times Square </a> <a href = "http://msdn.microsoft.com/windowsphone/hh768230"> (6) use real-time Camera Raw Data </a> <a href = "http://msdn.microsoft.com/windowsphone/hh768231"> (7) use push notifications for Times Square and depth toast </a>
<A href = "http://msdn.microsoft.com/windowsphone/hh968968"> (8) use code to prioritize local databases </a>

The above addresses are what we need, but how can we capture them all? First, we need to download all the content on the webpage of the initial node. For convenience, the contents of captured Web pages are not written to a file.

public static  String getNetPage(String starturl){//starturl="http://msdn.microsoft.com/zh-cn/windowsphone/hh182984"StringBuilder page =new StringBuilder("");try {URL url=new URL(starturl);URLConnection connection=url.openConnection();InputStream in=connection.getInputStream();InputStreamReader reader=new InputStreamReader(in,"UTF-8");BufferedReader bufferedReader=new BufferedReader(reader);String line=null;while ((line=bufferedReader.readLine())!=null) {page.append(line).append("\n");}System.out.println(page);} catch (MalformedURLException e) {// TODO Auto-generated catch blocke.printStackTrace();} catch (IOException e) {// TODO Auto-generated catch blocke.printStackTrace();}return page.toString();}

OK, the webpage content is available. Here is how to extract the desired address. Here we use a regular expression. What is the regular expression below.

public static ArrayList<String> getUrlFromPage(String page){ArrayList<String> list=new ArrayList<String>();String regex="http://msdn.microsoft.com/windowsphone/hh[0-9]{6}";Pattern pattern=Pattern.compile(regex);Matcher matcher=pattern.matcher(page);while (matcher.find()) {String url=matcher.group();          System.out.println(url);          list.add(url);}return list;}
Careful readers may find that many addresses on the webpage that are not video addresses are also captured by us.
<A href = "http://msdn.microsoft.com/windowsphone/hh395103">  </a> This is certainly not the result we want. The title of the video is displayed after each video address. Can we capture the title together and match it with the video address.
public static ArrayList<HashMap<String,String>> getUrlFromPage(String page){ArrayList<HashMap<String,String>> list=new ArrayList<HashMap<String,String>>();String regex="http://msdn.microsoft.com/windowsphone/hh[0-9]{6}\">[^x00-xff][0-9]+[^x00-xff][^6]*</a>";Pattern pattern=Pattern.compile(regex);Matcher matcher=pattern.matcher(page);while (matcher.find()) {String url=matcher.group();          System.out.println(url);          HashMap<String,String> map=new HashMap<String,String>();          String[] s=url.split("\">");          map.put("index",s[1]);          map.put("url",s[0]);                    list.add(map);}return list;}

The following is the printed result.

Http://msdn.microsoft.com/windowsphone/hh768215> (1) From xNa to slxna </a> http://msdn.microsoft.com/windowsphone/hh768217> (2) Add FAS to xNa </a> http://msdn.microsoft.com/windowsphone/hh768227 "> (3) add FAS to a typical Silverlight application </a> http://msdn.microsoft.com/windowsphone/hh768228 "> (4) Add a square </a> http://msdn.microsoft.com/windowsphone/hh768229"> (5) add background agents </a> http://msdn.microsoft.com/windowsphone/hh768230 "> (6) use real-time Camera Raw Data </a> http://msdn.microsoft.com/windowsphone/hh768231"> (7) use push notifications for blocks and deep toast </a> http://msdn.microsoft.com/windowsphone/hh968968 "> (8) use code to prioritize local databases </a> http://msdn.microsoft.com/windowsphone/hh968969"> (9) background audio </a> http://msdn.microsoft.com/windowsphone/hh395103 "> (1) series introduction </a> http://msdn.microsoft.com/windowsphone/hh398252"> (2) install Visual Studio 2010 express for Windows Phone </a> http://msdn.microsoft.com/windowsphone/hh398439 "> (3) compile your first Windows Phone 7 Application </a> http://msdn.microsoft.com/windowsphone/hh417688"> (4) windows Phone 7 simulator overview </a> http://msdn.microsoft.com/windowsphone/hh417718 "> (5) Explain your first application </a> http://msdn.microsoft.com/windowsphone/hh417729"> (6) manage project files and understand compilation and deployment </a> http://msdn.microsoft.com/windowsphone/hh417730 "> (7) Visual Studio 2010 express for Windows Phone ide overview </a> http://msdn.microsoft.com/windowsphone/hh417732"> (8) use the project </a> http://msdn.microsoft.com/windowsphone/hh417735 "> (9) Declare variables and assignments </a> http://msdn.microsoft.com/windowsphone/hh417890"> (10) accept input and assign values from the text box </a> http://msdn.microsoft.com/windowsphone/hh417892 "> (11) If judgment statement </a> http://msdn.microsoft.com/windowsphone/hh417893"> (12) operator, expression, and statement </a> http://msdn.microsoft.com/windowsphone/hh417894 "> (13) Switch judgment statement </a> http://msdn.microsoft.com/windowsphone/hh417895"> (14) for iteration statement </a> http://msdn.microsoft.com/windowsphone/hh417896 "> (15) Create and call a simple helper method </a> http://msdn.microsoft.com/windowsphone/hh417897"> (16) homework </a> http://msdn.microsoft.com/windowsphone/hh417898 "> (17) homework solution </a> http://msdn.microsoft.com/windowsphone/hh417899"> (1) Processing strings </a> http://msdn.microsoft.com/windowsphone/hh417901 "> (2) use datetime </a> http://msdn.microsoft.com/windowsphone/hh417902 "> (3) understand and create classes </a> http://msdn.microsoft.com/windowsphone/hh417903"> (4) use. classes in Net Framework class libraries </a> http://msdn.microsoft.com/windowsphone/hh417904 "> (5) Understanding namespaces </a> http://msdn.microsoft.com/windowsphone/hh417905"> (6) using collections </a> http://msdn.microsoft.com/windowsphone/hh417906 "> (7) object and set initializeditem </a> http://msdn.microsoft.com/windowsphone/hh417908 "> (8) work in The XAML designer and code window </a> http://msdn.microsoft.com/windowsphone/hh417909"> (9) understand XAML syntax </a> http://msdn.microsoft.com/windowsphone/hh417910 "> (10) Silverlight layout controls </a> http://msdn.microsoft.com/windowsphone/hh417911"> (11) process Silverlight events </a> http://msdn.microsoft.com/windowsphone/hh417912 "> (12) silverlight input controls </a> http://msdn.microsoft.com/windowsphone/hh417913 "> (13) homework </a> http://msdn.microsoft.com/windowsphone/hh417914"> (14) homework solution-Part 1 http://msdn.microsoft.com/windowsphone/hh417916 (15) homework solution-Part 2 http://msdn.microsoft.com/windowsphone/hh417917 (1) use Image controls </a> http://msdn.microsoft.com/windowsphone/hh417918 "> (2) Process resources and styles </a> http://msdn.microsoft.com/windowsphone/hh417923"> (3) browse and pass data between XAML pages </a> http://msdn.microsoft.com/windowsphone/hh417924 "> (4) use the application bar </a> http://msdn.microsoft.com/windowsphone/hh417925"> (5) using canvas as a dialog box </a> http://msdn.microsoft.com/windowsphone/hh417926 "> (6) Understanding standalone storage </a> http://msdn.microsoft.com/windowsphone/hh417927"> (7) standalone storage, ListBox, and data templates </a> http://msdn.microsoft.com/windowsphone/hh417928 "> (8) logical deletion and task switching </a> http://msdn.microsoft.com/windowsphone/hh418008"> (9) add different input value ranges </a> http://msdn.microsoft.com/windowsphone/hh418009 "> (10) GPS, location API, and call Web Services </a> http://msdn.microsoft.com/windowsphone/hh418010"> (11) image background, direction changes, and control visibility </a> http://msdn.microsoft.com/windowsphone/hh418011 "> (12) homework </a> http://msdn.microsoft.com/windowsphone/hh418012"> (13) homework solutions </a> http://msdn.microsoft.com/windowsphone/hh418013> (1) Introduction </a> http://msdn.microsoft.com/windowsphone/hh418014> (2) Start activities </a> http://msdn.microsoft.com/windowsphone/hh418015> (3) mainpage initial Settings </a> http://msdn.microsoft.com/windowsphone/hh418016 "> (4) Create annotation naming conventions </a> http://msdn.microsoft.com/windowsphone/hh418017"> (5) bind note class to ListBox datatemplate </a> http://msdn.microsoft.com/windowsphone/hh418018 "> (6) add annotation page initial Settings </a> http://msdn.microsoft.com/windowsphone/hh418019"> (7) call terraservice Web Services </a> http://msdn.microsoft.com/windowsphone/hh418020 "> (8) Save New comments </a> http://msdn.microsoft.com/windowsphone/hh418021"> (9) viewedit page initial Settings </a> http://msdn.microsoft.com/windowsphone/hh418022 "> (10) navigation between mainpage and viewedit page </a> http://msdn.microsoft.com/windowsphone/hh418023"> (11) switch to edit mode on the viewedit page and save changes </a> http://msdn.microsoft.com/windowsphone/hh418024 "> (12) delete comments feature for the viewedit page </a> http://msdn.microsoft.com/windowsphone/hh418025"> (13) add help screens on mainpage </a> http://msdn.microsoft.com/windowsphone/hh418026 "> (14) storage application state Part 1-mainpage </a> http://msdn.microsoft.com/windowsphone/hh418027"> (15) storage Application Status Part 1-add page </a> http://msdn.microsoft.com/windowsphone/hh418028 "> (16) storage application status Part 2-viewedit page </a> http://msdn.microsoft.com/windowsphone/hh418029"> (17) debug blank file name issues </a> http://msdn.microsoft.com/windowsphone/hh418030 "> (18) code cleanup, exception handling, and market preparation </a> http://msdn.microsoft.com/windowsphone/hh418031"> (19) related content </a>

But this is not the expected result. It doesn't matter. Come on.
public static String getWMVFromURL(String page){String suburl = null;String regex="http://download.microsoft.com/download[^\\u4e00-\\u9fa5]+\\.(wmv)";Pattern pattern=Pattern.compile(regex);Matcher matcher=pattern.matcher(page);while (matcher.find()) {suburl=matcher.group();}return suburl;}

Below are the main functions
public static void main(String[] args){String url="http://msdn.microsoft.com/zh-cn/windowsphone/hh182984";String page=getNetPage(url);ArrayList<HashMap<String,String>> list;list=getUrlFromPage(page);for (int i = 0; i < list.size(); i++) {String suburl=list.get(i).get("url");String subpage=getNetPage(suburl);String wmvurl=getWMVFromURL(subpage);System.out.println("wmvurl="+i+"="+wmvurl);}}

You can optimize this by using multiple threads. In addition, it should be noted that if the above method cannot capture the video address correctly, you need to add such information as User-Agent: user browser and version, Referer: source Address or cookie of the request URL: used to identify the user and store information such as user data. This is to prevent web crawlers from frequently accessing the website and affecting website performance. The reason why I wrote this blog post is to provide an idea of how to capture web URLs. Please correct me if there are any errors or mistakes in the article. It would be a joke if you see it.


Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.