Introduction
The need to extract information from Web pages is increasing, and its importance is becoming increasingly apparent. Every few weeks, I myself want to fetch some information on the webpage. Last week, for example, we were thinking about building an index of popularity and opinion about various data science online courses. Not only do we need to find new courses, we also grab reviews of the courses, summarize them and build some metrics. This is a problem or product whose effectiveness depends more on the techniques of web crawling and Information extraction (datasets) than on the data summarization techniques we used before.
How to extract Web page information
There are some ways to extract information from a Web page. Using an API may be considered the best way to extract information from a Web site. Almost all large sites, like Twitter, Facebook, Google, Twitter, and StackOverflow, provide APIs to access the site's data in a more structured way. If you can get the information you need directly from the API, this approach is almost always better than the Web crawl method. Because if you can get structured data from the data provider, why build an engine to fetch the same data yourself?
Unfortunately, not all websites offer APIs. Some websites are reluctant to allow readers to crawl large amounts of information in a structured way, and others are unable to provide APIs because of a lack of relevant technical knowledge. In such a case, what should be done? Well, we need to get the data through a web crawl.
Of course there are other ways like RSS feeds, but because of the restrictions on usage, I will not discuss them here.
What is Web crawling?
Web crawling is a computer software technology that obtains information from a Web site. This technique focuses on transforming unstructured data (HTML formatting) in the network into structured data (databases or spreadsheets).
Web crawls can be implemented in different ways, from Google Docs to almost any programming language. Thanks to Python's ease of use and rich ecosystem, I chose Python.
The BeautifulSoup Library in Python can help with this task . In this article, I'll use the Python programming language to show you the easiest way to learn web crawling.
For readers who need to extract Web page data from a non-programmatic way, you can go to Import.io to see it. It has a graphical user interface based on the driver to run the Web page crawl basic operation, computer fans can continue to read this article!
The libraries needed to crawl the web
We all know that Python is an open source programming language. You may be able to find many libraries to implement a feature. Therefore, it is very necessary to find the best library. I tend to use the BeautifulSoup (Python library) because it's simple and intuitive to use. To be precise, I'll use two Python modules to crawl the data:
? Urllib2: It is a python module used to get the URL. it defines functions and classes, implements URL operations (Basic, Digest authentication, redirection, cookies, etc.) for more information, please refer to the Documentation page.
? BeautifulSoup: It's a magical tool for extracting information from Web pages. you can use it to extract tables, lists, paragraphs, or filters from a Web page.
In this article, we will use the latest version, BeautifulSoup 4. You can view the installation Guide on its documentation page.
BeautifulSoup doesn't help us get the page, which is why I use URLLIB2 and BeautifulSoup library together. In addition to BeautifulSoup, Python has other methods for fetching HTML. Such as:
? mechanize
? Scrapemark
? scrapy
Basics-Familiarity with HTML (tags)
We need to handle the HTML tag when we crawl the Web page. Therefore, we must first understand the label carefully. If you already know the basics of HTML, you can skip this section. Here is the basic syntax for HTML:
The various tags of the syntax are interpreted as follows:
1.<! DOCTYPE html>:html document must start with type declaration
2.HTML documents are written between
3.the visible part of the HTML document is written between the <body> and </body> tags
4. HTML header use label definition
5.HTML paragraph Usage <p> tag definition
Other useful HTML tags are:
1.HTML link using <a> tag definition , "<a href=" http://www.test.com "> This is a test link .com</a>"
2.HTML table using <Table> definition, row <tr> representation, row <td> divided into data
The 3.html list starts with <ul> (unordered) and <ol> (ordered), with each element in the list starting with <li>
If you are unfamiliar with these HTML tags, I recommend learning HTML tutorials on W3Schools. This will have a clear understanding of the HTML tag.
Crawling Web pages with BeautifulSoup
Here, I will crawl the data from the Wikipedia page. Our ultimate goal is to grab a list of Indian states and federal capitals, as well as some basic details, such as information about the establishment, the former capital, and other information that makes up this wikipedia page. Let's take a step-by-step study of this project:
1. Import the necessary libraries
2. Use the "prettify" function to see the nesting structure of HTML pages
As shown above, you can see the structure of the HTML tag. This will help you understand the different available tags and understand how to use them to crawl information.
3. Working with HTML tags
A.soup.<tag>: Returns the content between the start and end tags, including the label.
B.soup.<tag>.string: Returns the string within the given label
C. Find the link in the tags <a>: We know that we can tag a link with a tag <a>. Therefore, we should take advantage of the SOUP.A option, which should return links that are available within the Web page. Let's do this for a second.
As shown above, you can see that there is only one result. Now, we'll use "Find_all ()" To crawl all of the links in <a>.
All links are displayed, including headings, links, and other information. Now, to show only the links, we need to use get's "href" attribute: Iterate through each tab and return to the link.
4. Find the correct table: When we are looking for a table to fetch information about the state capital, we should first find the correct table. Let's write instructions to crawl all the information in the table tabs.
Now, in order to find the correct table, we will use the table's property "class" and filter out the correct table. In Chrome, you can find the class name of the correct table by right-clicking on the desired page table, –> checking the element –> copy the class name, or by the output of the above command.
5. Extract the information into the Dataframe: here, we will traverse each line (TR), and then assign each element of the TR (TD) to a variable and add it to the list. Let's take a look at the HTML structure of the table (I don't want to grab the table header info <th>)
As shown above, you will notice that the second element of <tr> is within the < th > tag and not within the <td> tag. Therefore, we need to be careful about this. Now to access the values of each element, we will use the "find" option for each element text=true. Let's take a look at the code:
Finally, our data in Dataframe is as follows:
Similarly, various other types of web crawls can be implemented with BeautifulSoup. This will ease the effort to collect data manually from the Web. You can also look at other properties, such as. parent,.contents,.descendants and. Next_sibling,.prev_sibling, and various properties for tag name browsing. These will help you to effectively crawl the page.
But why can't I just use regular expressions (Regular Expressions)?
Now, if you know the regular expression, you might think you can use it to write code to do the same thing. Of course, I've had this problem, too. I used beautifulsoup and regular expressions to do the same thing and found:
The code in BeautifulSoup is more powerful than writing with regular expressions. Code written with regular expressions is changed as changes are made to the page. Even though the beautifulsoup needs to be adjusted in some cases, the beautifulsoup is relatively better.
The regular expression is much faster than BeautifulSoup, and for the same result, the regular expression is 100 times times faster than the BeautifulSoup.
Therefore, it boils down to the comparison between speed and code robustness, where there is no omnipotent winner. If the information you are looking for can be crawled with simple regular expression statements, you should choose to use them. For almost all of the complex work, I usually recommend using BeautifulSoup instead of regular expressions.
Conclusion
In this article, we used Python's two libraries BeautifulSoup and URLLIB2. We also understand the basics of HTML and implement a crawl of the Web page by solving a problem in a step-by-step manner. I suggest you practice it and use it to collect it from the Web.
Crawling Web pages with Python