Spider-web is the web version of the crawler, which uses XML configuration, supports crawling of most pages, and supports the saving, downloading, etc. of crawling content.
Where the configuration file format is:
?
123456789101112131415161718192021222324252627282930313233343536373839404142434445 |
<?
xml version
=
"1.0" encoding
=
"UTF-8"
?>
<
content
>
<
url type
=
"simple"
>
<!-- simple/complex -->
<
url_head
>http://www.oschina.net/tweets</
url_head
>
<
url_start
></
url_start
>
<
url_end
></
url_end
>
<
url_suffix
></
url_suffix
>
</
url
>
<
analysis type
=
"list"
>
<!-- single/list -->
<
elem name
=
"title"
>
<
attr type
=
"key" num
=
"1"
>
<!-- tag/class/key -->
<
name
>link</
name
>
<!-- $http://my.oschina.net/(.)* -->
<
pro
>http://my.oschina.net/(.)*/[0-9]*</
pro
>
</
attr
>
<
attr type
=
"class" num
=
"2"
>
<!-- tag/class/key -->
<
name
>tweet</
name
>
<
pro
>a</
pro
>
</
attr
>
<
attr type
=
"class" num
=
"3"
>
<
name
>txt</
name
>
<
pro
>a</
pro
>
</
attr
>
<
attr type
=
"tag" num
=
"4"
>
<
name
>a</
name
>
<
pro
>a</
pro
>
</
attr
>
</
elem
>
<
elem name
=
"content"
>
<
attr type
=
"key" num
=
"1"
>
<!-- tag/class/key -->
<
name
>link</
name
>
<!-- $http://my.oschina.net/(.)* -->
<
pro
>http://my.oschina.net/(.)*/[0-9]*</
pro
>
</
attr
>
<
attr type
=
"class" num
=
"2"
>
<!-- tag/class/key -->
<
name
>tweet</
name
>
<
pro
>a</
pro
>
</
attr
>
<
attr type
=
"class" num
=
"3"
>
<
name
>txt</
name
>
<
pro
>a</
pro
>
</
attr
>
</
elem
>
</
analysis
>
<
target type
=
"download"
>
<!-- download/text -->
</
target
>
</
content
>
|
Depending on the page settings, you can support a more popular page crawl.
Gllfeixiang/spider-web Star 1 | Fork 3 Crawler Web version
Issues:No issue.
recently submitted:
- 7092aa088 Basic Molding Gllfeixiang 8 month ago
- B3953d9de Crawler Web version Gllfeixiang 9 month ago
- 8D5EDE1DC Initial commit Gllfeixiang 9 month ago
Download zip Master Branch code last update: 2014-12-02
Spider-web is the web version of the crawler, using XML configuration