CTS flow tips

Source: Internet
Author: User
# Run CTS Flow
# Try the sample flow readdocumentbasic. Flow
# Develop a CTS flow using ESP Crawler
Need to enable callback
For both the nullwriter and espwriter # configure a pipeline to collaborate with CTS Flow
Get CTS stages ctsannotationsimporter and ctsparser from feedingoverlay package.
Don't use ctsannotationsimporter and related scope processing stage (scopifier and xmlifier), especially for CJK,
It seems not working right now, otherwise will get "fixml has illegal UTF-8 byte sequences ".

Create a pipeline named crawlercts using sitesearch pipeline as template
1. instance a staqe named ctsparsercrawler Based on ctsparser
Generatescopesfromannotations 0

2. Put ctsparsercrawler after docinit
3. Remove the follow stages:
Documentretriever
Urlprocessor
Decompressor
Formatdetector
Simpleconverter
Flashconverter
Pdfconverter
Xpsconverter
Searchexportconverter
Fasthtmlparser # for CJK, don't remove

Languageandencodingdetector
Encodingnormalizer
 
# If you need webanalyser, don't remove
Waattributelookup
Walinkrankanchortextformatter
Wacrawlerlinkfilter
Warankdocument
4. The ctsannotationsimporter is not necessary if you don't need scope searching
 
TIPS: Define your collection name in the Mapper Operator
 
# Using ESP crawler with CTS
C:/ESP/etc/crawlerglobaldefaults. xml
...
<Section name = "CDE">
<Attrib name = "contentdistributors" type = "list-string">
<Member> localhost: 17078 </member>
</Attrib>
</Section>
...
Nctrl stop Crawler
Nctrl start Crawler
Configure crawler's feeding destinations parameter on the admin Gui
(What the FSIs document said will not work, because if no feeding destination define,

Export config file will be empty for this group parameters)
Name: CDE
Target collection: cntv1; fsistraining. crawlingvideo
Destination: CDE
Pause ESP feeding: No
Primary: Yes
 

Crawleradmin.exe-G cntv1> crawler_cntv1.xml
Notepad./cawler_cntv1.xml
# Confirm the feeding destination parameter
Section name = "feeding">
<Section name = "CDE">
<Attrib name = "Collection" type = "string"> cntv1; fsistraining. crawlingvideo </attrib>
<Attrib name = "destination" type = "string"> CDE </attrib>
<Attrib name = "paused" type = "Boolean"> NO </attrib>
<Attrib name = "primary" type = "Boolean"> Yes </attrib>
</Section>
</Section>
# Change the start_uris and include_uris to define where you are going to craw
<Attrib name = "start_uris" type = "list-string">
<Member>
Http://kejiao.cntv.cn/nature/kexueshijie/classpage/video/20100812/101064.shtml </member>
</Attrib>
<Section name = "include_uris">
<Attrib name = "exact" type = "list-string">
<Member>
Http://kejiao.cntv.cn/nature/kexueshijie/classpage/video/20100812/101064.shtml </member>
</Attrib>
</Section>
Run the flow from vs or FSIs admin Gui

Remove the crawler datasource definition from the collection cntv1
Crawleradmin-F./cawler_cntv1.xml
Enterprise crawler 6.7.8-Admin Client
Copyright (c) 2008 fast, a Microsoft (r) Subsidiary
Added collection config (s): Scheduled Collection for crawling
# Watching crawler from command winodws
Crawleradmin -- Status
Enterprise crawler 6.7.8-Admin Client
Copyright (c) 2008 fast, a Microsoft (r) Subsidiary
Collection status feed status active sites stored docs Doc Rate
-------------------------------------------------------------------------------
Cntv10 idle feeding 0 1 N/
Cntv11 idle feeding 0 1 N/
Cntv12 idle feeding 0 1 N/
Cntv13 idle feeding 0 1 N/
Cntv14 idle feeding 0 1 N/
Cntv8 idle feeding 0 2 N/
Cntv9 idle feeding 0 5 N/
0 12 0.0 DPS



# Watching doclog
Doclog-l
Doclog-A http: // xxx/xxxx/

# Watching CTS flow log from
C:/users/FSIs service/appdata/local/FSIs/nodes/FSIs/contentenginenode1/logs/contentprocessing
# Watching crawler log from
C:/ESP/var/log/Crawler
# Adding spy stage into ESP pipeline to monitor

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.