Find pictures related to Web content (ii) Reddit's approach

Source: Internet
Author: User

As mentioned earlier, content aggregation sites such as Sina Weibo, Twitter, Facebook and other websites are just needed for Web page thumbnails. To make the content of the share fascinating, the picture thumbnails of the pages are essential. The young people's gathering place, the social news website Reddit is also one such website, because they have the source code of their website on GitHub, I can easily understand their practice.

Find the algorithm for thumbnail image of Web page, can be found here: https://github.com/reddit/reddit/blob/0fbea80d45c4ce35e50ae6f8b42e5e60d79743ca/r2/r2/lib/media.py

The _find_thumbnail_image (self) function is the one that implements this function

Beautiful Soup is a python library that can extract data from HTML or XML files. It enables you to use your favorite converters to navigate, find, and modify documents in the usual way.

Content_Type, content = _fetch_url (self.url) # If it's an image. It's pretty easy to guess what we should thumbnail.if Content_Type and "image" in Content_Type and Content:return Self.url If Content_Type and "HTML" in Content_Type and Content:soup = beautifulsoup.beautifulsoup (content) Else:return None

_fetch_url will request the link URL to get the link file type, and the linked content.
You can see from the _fetch_url function that the type of the file is obtained through the header of the HTTP response. The file type is specified by the Multipurpose Internet Mail Extension type (Multipurpose Internet Mail extensions,mime).
If the URL points to a file is a picture (image) directly return the URL, if the point to the file is Hypertext Markup Language (HTML, Hypertext Markup Language) in the BeautifulSoup package to the HTML source code parsing, If the other file type returns none.

# allow the content author to specify the thumbnail:# <meta property= "og:image" content= "/http ..." >og_image = (sou P.find (' meta ', property= ' Og:image ') orsoup.find (' meta ', attrs={' name ': ' Og:image '})) if Og_image and og_image[' content ']:return og_image[' content ']# <link rel= "image_src" href= "/http ..." >thumbnail_spec = Soup.find (' link ', rel= ' Image_src ') if Thumbnail_spec and thumbnail_spec[' href ']:return thumbnail_spec[' href ']

Next, determine whether the user (the author of the Web page) specifies a thumbnail image. The method used is the open graph scheme, which was described earlier

<meta property= "og:image" content= "http://..." >

<link rel= "image_src" href= "http://..." >

The META tag or the link note can specify the thumbnail of the page, and if the page contains both tags, it's done, just return the image's source address. This is convenient, but there are obvious deficiencies. So do not test whether the image is valid, some sites jerry-back is not a webpage related to the thumbnail image, but the site Logo,stackoverflow is a typical. But then again, the probability of this particular situation is quite small.

# OK, we have no guidance from the author. Look for the largest# image in the page with a few caveats.  (see below) Max_area = 0max_url = Nonefor Image_url in Self._extract_image_urls (soup): # When isolated from the context of a Webpage, protocol-relative# URLs is ambiguous, so let's absolutify them now.if image_url.startswith ('//'): Image_url = Co Erce_url_to_protocol (Image_url, self.protocol) size = _fetch_image_size (Image_url, Referer=self.url) if not size: Continuearea = size[0] * Size[1]

Next is a loop that, after finding all the pictures of a webpage through _extract_image_urls, iterates through each picture and finds the largest one.

Specifically, some restrictions are added.

# Ignore Little Images
If area < 5000:
G.log.debug (' ignore little%s '% image_url)
Continue

# Ignore excessively long/wide images
If Max (size)/min (size) > 1.5:
G.log.debug (' Ignore dimensions%s '% image_url)
Continue

# penalize images with ' sprite ' in their name
If ' Sprite ' in Image_url.lower ():
G.log.debug (' penalizing sprite%s '% image_url)
Area/= 10

The area of the picture must be greater than 5000 pixels, the width-to-length ratio must be less than 1.5, and the URL will be punished if it contains a sprite, dividing the area by 10

_fetch_image_size (Image_url, Referer=self.url) is a difficult place, in order to find the size of each picture, you must download the picture. A small trick is that the size of the picture as part of the image file format is often written in the image file of the head, only need to download a portion of the image can be obtained size. If you want to know more, you can analyze that function.

If area > max_area:max_area = Areamax_url = Image_url

It's over now. Reddit method in a sentence to summarize is, believe that the page specified thumbnail, not to find the largest picture, while limiting the minimum area and width-to-length ratio.

This is the effect they actually have:

Find pictures related to Web content (ii) Reddit's approach

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.