This is a creation in Article, where the information may have evolved or changed.
Crawl the Watercress movie Top250
Crawlers are standard, and it's interesting to see the data at that moment. The first one from the most simple and basic crawler began to write it!
Project Address: Https://github.com/go-crawler ...
Goal
Our target site is the Watercress film Top250, it is estimated that everyone looks familiar.
This crawl takes 8 fields for a simple summary analysis. The specific fields are as follows:
Simple analysis of the target source
- Page Total 25 article
- with pagination (10 pages total) and pagination rules are normal
- The data field sort for each item is regular and unchanging
Begin
Due to the small amount, our crawl steps are as follows
- Analysis page, get all the pagination
- Analyze the page and cycle through all the page's movie information
- Crawling of movie information into storage
Installation
$ go get -u github.com/PuerkitoBio/goquery
Run
$ go run main.go
Code Snippets
1. Get all Paging
func ParsePages(doc *goquery.Document) (pages []Page) { pages = append(pages, Page{Page: 1, Url: ""}) doc.Find("#content > div > div.article > div.paginator > a").Each(func(i int, s *goquery.Selection) { page, _ := strconv.Atoi(s.Text()) url, _ := s.Attr("href") pages = append(pages, Page{ Page: page, Url: url, }) }) return pages}
2, analysis of the Watercress film information
func ParseMovies(doc *goquery.Document) (movies []Movie) { doc.Find("#content > div > div.article > ol > li").Each(func(i int, s *goquery.Selection) { title := s.Find(".hd a span").Eq(0).Text() ... movieDesc := strings.Split(DescInfo[1], "/") year := strings.TrimSpace(movieDesc[0]) area := strings.TrimSpace(movieDesc[1]) tag := strings.TrimSpace(movieDesc[2]) star := s.Find(".bd .star .rating_num").Text() comment := strings.TrimSpace(s.Find(".bd .star span").Eq(3).Text()) compile := regexp.MustCompile("[0-9]") comment = strings.Join(compile.FindAllString(comment, -1), "") quote := s.Find(".quote .inq").Text() ... log.Printf("i: %d, movie: %v", i, movie) movies = append(movies, movie) }) return movies}
Data
What do you think of the data that you are looking at, really curious: =)