The java platform uses the jsoup Development Kit to capture youku video playback addresses, image addresses, and other information.

Source: Internet
Author: User

/*************************************** **************************************** *************
* Author: conowen @ Dazhong
* E-mail: conowen@hotmail.com

* Site: http://www.idealpwr.com/

* Shenzhen power thinking Technology Development Co., Ltd.
* Http://blog.csdn.net/conowen
* Note: This article is original and only used for learning and communication. For more information, indicate the author and its source.

**************************************** **************************************** ************/

I. Project Purpose

Recently, the project was designed to collect and aggregate online videos and write a small program for information crawling on Internet videos. Taking the youku online video website as an example, a java platform application is implemented, dynamically capture Internet video information and save it to a local xml file to build a multimedia playback source center.


Ii. Third-party project libraries:

1. jsoup (HTML code parser)


Jsoup is a Java HTML Parser that can directly parse a URL address and HTML text content. It provides a set of very labor-saving APIs that can be used to retrieve and manipulate data through DOM, CSS, and operations similar to JQuery.

The main functions of jsoup are as follows:

· Parse HTML from a URL, file, or string;

· Use the DOM or CSS selector to find and retrieve data;

· HTML elements, attributes, and text can be operated;

Jsoup is released based on the MIT protocol and can be safely used in commercial projects.

Official Address: http://jsoup.org/


2. jdom (XML is built on Parsing tools)

With jdom, you can easily build xml files that comply with the specifications, and jdom provides quick parsing of xml files.

Official Address: http://jdom.org/


Iii. General Development Process:

For example, the youku (youku) online video playback website has implemented internet video aggregation, which is its soku. soku is used as an example below.


Such as the TV series corresponding url: http://www.soku.com/channel/teleplaylist_0_0_0_1_1.html

You can view the HTML code analysis on this page in a browser.

 
 
  • Opening scene
  • Fleeing
  • Starring:Chen zhanpeng/Wu zhuoxi/Chen yinzhi
  • As an international metropolis, Hong Kong may be attacked at any time. In order to prevent possible terrorist activities in the territory, the anti-terrorism team was at 2009...
  • 9.2 points
    • 1
    • 2
    • 3
    • 4
    • ...
    • 21
    • 22
    • 23
    • 24
    • Show all
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14
    • 15
    • 16
    • 17
    • 18
    • 19
    • 20
    • 21
    • 22
    • 23
    • 24
    • 1
    • 2
    • 3
    • 4
    • ...
    • 21
    • 22
    • 23
    • 24
    • Show all
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14
    • 15
    • 16
    • 17
    • 18
    • 19
    • 20
    • 21
    • 22
    • 23
    • 24
    Playback Source:

You can obtain the video name, cover image address, update status, and playback source address.


Get video information:

1. demo of jsoup parsing:

To obtain the content of a page, you can use the following methods:

Document doc = Jsoup.connect("http://example.com/").get();

After obtaining the doc, you can perform the parsing operation.

Of course, you can also set some connection parameters, such as the browseruserAgent, Timeout, page memory size, and so on.

Parsing pages is as simple as parsing xml. You can use the tab or class label to obtain the corresponding content.


2. jsoup captures youku video information


Public void getVideoInfo (String pageUrl) {// call try {doc = Jsoup once per page. connect (pageUrl ). maxBodySize (1024*1024*10 ). timeout (6000 ). get (); // Added a maximum body response size to Jsoup. connection, to // prevent running out of memory when trying to read extremely // large // statements. the default is 1 MB .} catch (IOException e) {// TODO Auto-generated catch blockgetVideoInfo (pageUrl); System. out. println ("c Onnect error "); e. printStackTrace ();} divs_info = doc. getElementsByClass (" p_link "); // url of the video album, such as if (divs_info! = Null) {if (divs_info.size () <= 0) {divs_info = doc. getElementsByClass ("v_link"); // video playback url, such as INFO} urls = divs_info.select ("a [href]"); if (null! = Urls) {int I = 0; for (Element urlElement: urls) {videoTitles. add (urlElement. attr ("title"); videoUrl. add (urlElement. attr ("abs: href"); I ++ ;}} divs_thumbs = doc. getElementsByClass ("p_thumb"); // obtain the album image if (divs_thumbs! = Null) {thumbs = divs_thumbs.select ("img [original]"); if (thumbs. size () <= 0) {divs_thumbs = doc. getElementsByClass ("v_thumb"); thumbs = divs_thumbs.select ("img [original]");} if (null! = Thumbs) {int I = 0; for (Element thumb: thumbs) {videoThumbUrls. add (thumb. attr ("abs: original"); I ++ ;}} divs_pgm_source = doc. getElementsByClass ("pgm-source"); // obtain the update information // divs_pgm_source.select (query) if (divs_pgm_source! = Null) {for (Element thumb1: divs_pgm_source) {sourceId = thumb1.select ("span"); sourceUrl = thumb1.select ("a"); List
 
  
VideoSourceStatus = null; List
  
   
VideoSourceUrl = null; List
   
    
VideoSourceId = null; // Save the obtained data for building the xml file if (null! = SourceId) {videoSourceId = new ArrayList
    
     
(); For (Element thumb2: sourceId) {videoSourceId. add (thumb2.attr ("id");} videoSourceIdList. add (videoSourceId);} if (null! = SourceUrl) {videoSourceStatus = new ArrayList
     
      
(); For (Element thumb2: sourceUrl) {videoSourceStatus. add (thumb2.attr ("status");} videoSourceStatusList. add (videoSourceStatus);} if (null! = SourceUrl) {videoSourceUrl = new ArrayList
      
        (); For (Element thumb2: sourceUrl) {videoSourceUrl. add (thumb2.attr ("href");} videoSourceUrlList. add (videoSourceUrl) ;}}try {Thread. sleep (2000);} catch (InterruptedException e) {// TODO Auto-generated catch blocke. printStackTrace ();}}
      
     
    
   
  
 

4. Construct a video xml file:


1. Create an xml file

XmlHelper. createXml (str, videoTitles, videoUrl, videoThumbUrls, videoSourceIdList, videoSourceStatusList, videoSourceUrlList, pageNum); // create xml
2. xml file creation process

Public void createXml (String fileName, List
 
  
VideoTitles, List
  
   
VideoUrl, List
   
    
VideoThumbUrls, List
    
     
> VideoSourceIdList, List
     
      
> VideoSourceStatusList, List
      
        > VideoSourceUrlList, int pageNum) {// create the root node Element root = new Element ("videoInfo "); // create the node Element pageElement = new Element ("page") for each page; // set the page number pageElement. setAttribute ("page", "" + pageNum); Document Doc = new Document (root); for (int I = 0; I <videoTitles. size (); I ++) {// create a node videoIdElement VideoIdElement = new Element ("videoId"); // Add a property id to the videoId node; VideoIdElement. setAttribute ("id", "" + (I + 1 + (pageNum-1) * videoTitles. size (); // enter the video information value VideoIdElement. addContent (new Element ("videoTitle "). setText (videoTitles. get (I); VideoIdElement. addContent (new Element ("videoUrl "). setText (videoUrl. get (I); VideoIdElement. addContent (new Element ("videoThumbUrls "). setText (videoThumbUrls. get (I); for (int j = 0; j <videoSourceIdList. get (I ). size (); j ++) {Element sourceElement = new Element ("source"); sourceElement. setAttribute ("id", "" + videoSourceIdList. get (I ). get (j); sourceElement. setAttribute ("status", "" + videoSourceStatusList. get (I ). get (j); sourceElement. setAttribute ("url", "" + videoSourceUrlList. get (I ). get (j); VideoIdElement. addContent (sourceElement);} // Add each sub-video to pageElement on each page. addContent (VideoIdElement);} // Add the video on each page to the root node. addContent (pageElement); Format format = Format. getCompactFormat (); format. setEncoding ("UTF-8"); // setEncoding sets the encoding format. setIndent (""); XMLOutputter XMLOut = new XMLOutputter (format); try {XMLOut. output (Doc, new FileOutputStream (fileName);} catch (FileNotFoundException e) {// TODO Auto-generated catch blocke. printStackTrace ();} catch (IOException e) {// TODO Auto-generated catch blocke. printStackTrace ();}}
      
     
    
   
  
 

Note: When Using jdom to build xml, note that if this parameter is not set, the xml constructed by jdom will not wrap the line, leading to xml file disorder and difficulty in understanding it. You can use the following settings to wrap an xml file.

Format format = Format.getCompactFormat();format.setEncoding("utf-8"); format.setIndent("");

The approximate time required for parsing each page is 2 ~ 3 s. The page is large, depending on network conditions.





<喎?http: www.bkjia.com kf ware vc " target="_blank" class="keylink"> VcD4KPHA + ubm9qLP2wLS1xHhtbM7EvP7I58/CPC9wPgo8cD48cHJlIGNsYXNzPQ = "brush: java;"> You from the stars Http://www.soku.com/detail/show/XMTEyNDE0NA== Http://g3.ykimg.com/0516000052AD289A675839358A07B6AA Food for slaves Http://www.soku.com/detail/show/XMTA5MTQ1Mg== Http://g4.ykimg.com/0516000052F4A2C56758390A8D0C4E55 Diaosi men Http://www.soku.com/detail/show/XMTA4MzkwNA== Http://g1.ykimg.com/05160000519310F4670C4A1AE002FEB1 Diaosi men's Season 3 Http://www.soku.com/detail/show/XMTE0NzU2OA== Http://g4.ykimg.com/051600005305D18E6758397D8206CC34



Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.