Using machine learning algorithms to find thumbnails of web pages

Source: Internet
Author: User
Tags closing tag new set

The articles in the blog are Meelo original, please make sure to indicate the address in the link form

Describe a Web page

The world is now in an era of exploding. , microblogging, news sites, people in the haystack in the Sea of information to pick the information they are interested in. How do we determine which information might be of interest? Recall that you will find a title, a summary, and a thumbnail image. With titles, summaries, and thumbnails, you can guess the content of a Web page well. Open Baidu search engine, random search a keyword, each search results is also the three elements of the composition.

Then a natural question is how the search engine finds the title, summary, and thumbnail of the page.

Finding the title of a webpage is actually a very simple question. This is going to start with the principle of the Web page. The Web page is actually a special file, and the familiar Word document is similar. The Word document has the title, the body, the text may have the different font, the color, the webpage also is similar. But the design of the Web page is more free, he is not designed by the visual interface, but by a special programming language, HTML (Hyper Text Markup Language) to write. Basically HTML is a series of tags, each marked by a start tag, and a closing tag. For example <title> that this is the beginning of the title, </title> that this is the end of the title, yes, the end tag is only a backslash more than the start tag. Each page must contain a title, which is very reasonable, this title may not be the title you see on the page, but it will be related to the content of the Web page.

So how to look for thumbnails. The standard of the Web page does not stipulate the need to include thumbnails, and all the endless trouble begins. There are different solutions in different places.

Facebook's solution is to define a new standard, a new set of tags on the Web page to mark the page's thumbnails and keywords, a program called The Open Atlas program. Facebook launched the program on Facebook's most advanced F8 developer conference, which was a big day. But a company-defined standard is hard for everyone to accept, and the fact is that the "Open Atlas Program" penetration rate in China is very low.

To fundamentally address this problem, or to define a universally accepted standard, it is almost impossible, or a way to go.

At this point the vision to machine learning. If you pay attention to a little bit of technology, you should be aware of the recent machine learning is very fire, the fire to what extent, even the dance square dancing aunt are talking about that defeated once go world champion of Alphago. Alphago behind the Mystery of machine learning algorithms, yue depth enhancement learning. This is not the focus of our attention, the computer out of swaddling, to overcome the human experience a long process. But the basic principle is very simple. One of the basic problems that machine learning needs to solve is predicting the future.

Machine Learning Decryption

Suppose we want to predict the price of a house in Beijing. For machine learning, the first to collect a large number of known housing prices, of course, the need for data to show the housing situation, such as the size of the house, the number of bedrooms, distance from the city center, the length of the building. If you want to predict as accurately as possible, the description of the House should be as comprehensive as possible, taking into account the actual impact of housing prices factors. In machine learning terminology, the data describing the house is called the feature (feature), the house price to be predicted is called the target, the pre-acquired features and the corresponding target are called training data (training). Mathematically, a function is obtained by machine learning, the input is a feature, the output is the target, and this function is sometimes called the model. Different machine learning algorithms will get different functions, no one algorithm is optimal, in different situations will choose different algorithms, this is not the focus of this article. One of the simplest machine learning algorithms is the least squares (least square), and maybe you've heard about the least squares in other places earlier, but you may not realize how powerful it is.

The functions obtained by the machine learning algorithm make the error of training data as low as possible. According to a principle of machine learning "extensibility", "extensibility" means that a low-error prediction can also be obtained for new data.

So looking for thumbnails can become a machine learning problem. Before this need to clarify what is the thumbnail, a page with an average of dozens of images, including a few categories, such as the news of the map This can be used as thumbnails, website logo, QR Code, advertising and so on. The problem is somewhat different from the previous forecast of the price of a house, that is, the goal is no longer a continuous value, but a discrete two-part, a picture that is related to the content of a Web page, a thumbnail image, and an unrelated image.

The difficulty of the problem

One joke says that the main task of machine learning Engineers is "feature engineering". If you don't know what feature engineering is, feature engineering is actually looking for features that can predict a target. Predicting the price of a house, if it misses the size of the room, is sure to predict the result is very inaccurate.

The same is true for this problem, which is the key to success or failure. However, it is not easy to find features, first of all, the process of extracting features must be done automatically by the program, rather than manual statistics, and secondly, the features that can be extracted directly are very few, the most obvious feature of the picture is the width and height, finally, the standard HTML of the Web page is very flexible, the picture can be annotated or not

Feature engineering there is no good way to solve. You can only use the most direct method to analyze the HTML code of the associated picture. For example, I found a very good feature is the area of the image, the page logo and the two-dimensional code compared to the news of the map will be relatively small, so the area of the image is more likely to be related to the picture.

Another revelation in the search for features is that the features of the image are not only determined by the image itself, but also related to the environment surrounding the image, which solves the problem of very few features of the image itself. Another good feature I found is whether the picture is in a large paragraph of text. Picture folder in the large paragraph of text, meaning that the picture is in the body of the Web page, indicating that a high probability is a related picture. In the large paragraph of text in the HTML code is reflected in the vicinity of the picture is a lot of the label <p> of the paragraph.

HTML is very flexible, but there is no perfect solution. I can only analyze the main web site, so that extract features of the code to cover as many cases as possible. Still take the judge whether the picture is in a large paragraph of text as an example, this feature is actually a numeric value, representing the number of labels <p> of paragraphs around the picture. The picture is sometimes in the same level as the <p> tag, sometimes in the direct subtree of the <p> tag, sometimes in the subtree of the <p> tag subtree.

With the description of each feature, writing a program is not a big problem. One of the techniques used is the Document Object model, which is actually a text document, which is very difficult to manipulate directly, and has a specialized program to convert HTML into DOM. The DOM makes it easy to navigate to each picture, get the parent tag of the picture, and get the child label of the picture.

Try it.

Finally get a complete system, enter a link to a picture, the page can return to the most relevant pictures of the forecast. There are several options to choose from, the first is to choose which machine learning algorithm, this article does not detail, you can use the two algorithms are logistic regression and decision tree, and then choose to return all the relevant pictures or the most relevant picture. All returned pictures will have a prediction-related probability. The system can be experienced on the web.

Using machine learning algorithms to find thumbnails of web pages

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.