Unstructured data and structured data extraction---regular expression re modules

Source: Internet
Author: User
Tags processing text

Page parsing and data extraction

Generally speaking, we need to crawl the content of a website or an application to extract useful value. The content is generally divided into two parts, unstructured data and structured data.

    • Unstructured data: First data, then structure,
    • Structured data: First structure, then data
    • Different types of data, we need to deal with it in different ways.
Unstructured data processing text, phone numbers, e-mail addresses
    • Regular expressions
HTML file
    • Regular expressions
    • Xpath
    • CSS Selector
Structured data processing JSON file
    • JSON Path
    • Convert to Python type for operation (JSON class)
XML file
    • Convert to Python type (xmltodict)
    • Xpath
    • CSS Selector
    • Regular expressions

Why you should learn regular expressions

In fact, there are four main steps in a reptile:

    1. Clear goals (know where you're going to go or search the site)
    2. Crawl (crawl all the content of the site)
    3. Take (remove data that doesn't work for us)
    4. Process data (stored and used in the way we want)

We actually omitted the 3rd step in yesterday's case, the "take" step. Because the data we have down is the entire Web page, the data is huge and confusing, and most of the stuff doesn't concern us, so we need to filter and match it to our needs.

Then for the text of the filter or rule matching, the most powerful is the regular expression, is the Python reptile world is an indispensable weapon of God.

What is a regular expression

Regular expressions, also known as regular expressions, are often used to retrieve and replace text that conforms to a pattern (rule).

A regular expression is a logical formula for a string operation, which is a "rule string" that is used to express a filter logic for a string, using predefined specific characters and combinations of these specific characters.

Given a regular expression and another string, we can achieve the following purposes:

  • Whether the given string conforms to the filtering logic of the regular expression ("match");
  • With regular expressions, get the specific part ("filter") we want from the text string.

Regular expression matching rules

Python's RE module

In Python, we can use the built-in re module to use regular expressions.

It is important to note that regular expressions are escaped with special characters, so if we want to use the original string, we just need to add an R prefix, example:

r‘chuanzhiboke\t\.\tpython‘
The general use steps of the RE module are as follows:
    1. Use compile() a function to compile the string form of a regular expression into an Pattern object

    2. A match object is obtained by matching the text by a Pattern series of methods provided by the object to find the matching result.

    3. Finally, use the Match properties and methods provided by the object to get information and perform other actions as needed
Compile function

The compile function compiles regular expressions and generates a Pattern object, which is typically used in the following form:

import re# 将正则表达式编译成 Pattern 对象pattern = re.compile(r‘\d+‘)

Above, we have compiled a regular expression into the pattern object, then we can use the pattern of a series of methods to find the text matching.

Some common methods of Pattern objects include:

  • Match method: Search from start position, one match
  • Search method: Search from any location, one match
  • FindAll method: Match all, return list
  • Finditer method: Match all, return iterator
  • Split method: Split string, return list
  • Sub method: Replace
Match method

The match method is used to find the head of a string (you can also specify a starting position), which is a match, and returns if a matching result is found, rather than finding all matching results. Its general form of use is as follows:

match(string[, pos[, endpos]])

Where the string is the strings to be matched, POS and endpos are optional parameters, specifying the starting and ending positions of the string, and the default values are 0 and len (string length). Therefore, when you do not specify POS and Endpos, the match method defaults to the header of the string.

When the match succeeds, a match object is returned and none is returned if there is no match.

>>>Import re>>> pattern = Re.compile (R ' \d+ ')# to match at least one number>>> m = Pattern.match (' One12twothree34four ')# Find head, no match>>>Print mNone>>> m = Pattern.match (' One12twothree34four ',2,10)# Start matching from ' E ' position, no match>>>Print mNone>>> m = Pattern.match (' One12twothree34four ',3,10)# starting from ' 1 ' position matching, exactly matching >>>  Print m # returns a Match object <_sre. Sre_match object at 0x10a42aac0>>>> M.group (0) # can omit 0 ' " >>> M.start (0) # can omit 03< Span class= "Hljs-prompt" >>>> m.end (0) # can omit 05>>> M.span (0) # can omit 0 (3, 5)      

On top, when the match succeeds, a match object is returned, where:

    • The group ([Group1, ...]) method is used to obtain one or more grouping matching strings, which can be used directly when the entire matched substring is to be obtained, either group () or group (0);

    • The start ([group]) method is used to get the starting position of the substring in the entire string (the index of the first character of the substring), the default value of the parameter is 0;

    • The end ([group]) method is used to get the end position of the grouped matched substring in the entire string (the index of the last character of the substring + 1), the default value of the parameter is 0;
    • The span ([group]) method returns (Start (group), End (group)).

Take a look at one more example:

>>>Import re>>> pattern = Re.compile (R ' ([a-z]+) ([a-z]+) ', Re. I)# Re. I means ignoring case>>> m = Pattern.match (' Hello World Wide Web ')>>>Print m# match succeeded, return a match object <_sre. Sre_match Object at0x10bea83e8>>>> M.group (0)# returns the entire substring of the match success' Hello World '>>> M.span (0)# returns the index of the entire substring that matches successfully (0,11)>>> M.group (1)# returns the first packet matching successful substring' Hello '>>> M.span (1)# returns the index of the first subgroup that matched the successful substring (0,5) >>> m.group (2) # Returns the second packet matching successful substring >>> M.span ( 2) # returns the second packet matching successful substring (6, 11) >>> m.groups () # equivalent to (M.group (1), M.group (2), ...) ( ' Hello ',  ' world ') >>> M.group (3) # there is no third grouping traceback (most recent call last): File Span class= "hljs-string" > "<stdin>", line 1, in < Module>indexerror:no such group            
------------------------------------------------------------------------------------------------------Search Method

The search method is used to find any location of a string, it is also a match, and returns if a matching result is found, rather than finding all matching results, which is generally used as follows:

search(string[, pos[, endpos]])

Where the string is the strings to be matched, POS and endpos are optional parameters, specifying the starting and ending positions of the string, and the default values are 0 and len (string length).

When the match succeeds, a match object is returned and none is returned if there is no match.

Let's take a look at the example:

>>>Import re>>> pattern = Re.compile (' \d+ ')>>> m = pattern.search ( ' one12twothree34four ') # here if you use the match method does not match >>> m<_sre. Sre_match object at 0x10cc03ac0>>>> m.group ()  ' >>> m = pattern.search ( One12twothree34four ', 10, 30) # Specifies the string interval >>> m<_sre. Sre_match object at 0x10cc03b28>>>> m.group ()  ">>> M.span () (13, 15)              

Let's look at one more example:

# -*- coding: utf-8 -*-import re# 将正则表达式编译成 Pattern 对象pattern = re.compile(r‘\d+‘)# 使用 search() 查找匹配的子串,不存在匹配的子串时将返回 None# 这里使用 match() 无法成功匹配m = pattern.search(‘hello 123456 789‘)if m: # 使用 Match 获得分组信息 print ‘matching string:‘,m.group() # 起始位置和结束位置 print ‘position:‘,m.span()

Execution Result:

matching string: 123456position: (6, 12)
------------------------------------------------------------------------------------------------------FindAll Method

The match and search methods above are all a match, as long as a match is found and the result is returned. However, most of the time, we need to search the entire string to get all the matching results.

The FindAll method is used in the following form:

findall(string[, pos[, endpos]])

Where the string is the strings to be matched, POS and endpos are optional parameters, specifying the starting and ending positions of the string, and the default values are 0 and len (string length).

FindAll returns all matched substrings in a list, and returns an empty list if there is no match.

Take a look at the example:

import repattern = re.compile(r‘\d+‘)   # 查找数字result1 = pattern.findall(‘hello 123456 789‘)result2 = pattern.findall(‘one1two2three3four4‘, 0, 10)print result1print result2

Execution Result:

[‘123456‘, ‘789‘][‘1‘, ‘2‘]

First look at a chestnut:

# re_test.pyimport re#re模块提供一个方法叫compile模块,提供我们输入一个匹配的规则#然后返回一个pattern实例,我们根据这个规则去匹配字符串pattern = re.compile(r‘\d+\.\d*‘)#通过partten.findall()方法就能够全部匹配到我们得到的字符串result = pattern.findall("123.141593, ‘bigcat‘, 232312, 3.15")#findall 以 列表形式 返回全部能匹配的子串给resultfor item in result: print item

Operation Result:

123.1415933.15
------------------------------------------------------------------------------------------------------Finditer Method

The behavior of the Finditer method is similar to the behavior of FindAll, and it also searches the entire string for all matching results. But it returns an iterator that sequentially accesses each match result (Match object).

Take a look at the example:

#-*-Coding:utf-8-*-Import Repattern = Re.compile (R ' \d+ ') Result_iter1 = Pattern.finditer (' Hello 123456 789 ') Result_iter2 = Pattern.finditer (' One1two2three3four4 ', 0, print type (result_iter1) printtype (result_iter2)print ' Result1 ' for M1 in Result_iter1: # M1 is the Match object print ' matching string: {}, Position: {} '. Format ( M1.group (), M1.span ())print ' result2 ... ' form2 in result_iter2: print ' matching string: {} , Position: {} '. Format (M2.group (), M2.span ())            

Execution Result:

<type ‘callable-iterator‘><type ‘callable-iterator‘>result1...matching string: 123456, position: (6, 12)matching string: 789, position: (13, 16)result2...matching string: 1, position: (3, 4)matching string: 2, position: (7, 8)
------------------------------------------------------------------------------------------------------Split method

The Split method returns the list after splitting the string by a substring that can be matched, using the following form:

split(string[, maxsplit])

Where Maxsplit is used to specify the maximum number of splits and does not specify that all will be split.

Take a look at the example:

import rep = re.compile(r‘[\s\,\;]+‘)print p.split(‘a,b;; c   d‘)

Execution Result:

[‘a‘, ‘b‘, ‘c‘, ‘d‘]
------------------------------------------------------------------------------------------------------Sub Method

The sub method is used for substitution. It is used in the following form:

sub(repl, string[, count])

Where repl can be a string, or it can be a function:

    • If Repl is a string, it uses REPL to replace each matched substring of the string and returns the substituted string, and Repl can also refer to the grouping using the ID, but not the number 0;

    • If Repl is a function, this method should only accept one argument (the Match object) and return a string for substitution (the returned string cannot be referenced in a group).

    • Count is used to specify the maximum number of replacements, not all when specified.

Take a look at the example:

Import rep = Re.compile (R ' (\w+) (\w+) ') # \w = [a-za-z0-9]s = ' Hello 123, hello 456 'print p.sub (R ' Hello World ', s) # use ' Hello World ' to replace ' hell O 123 ' and ' Hello 456 'print p.sub (R ' \2 \1 ', s) # Reference Group def func(m): return ' Hi ' + ' + M.group (2)print P.sub (func, s)print P.sub (func, S, 1) # Replace at most    

Execution Result:

hello world, hello world123 hello, 456 hellohi 123, hi 456hi 123, hello 456
------------------------------------------------------------------------------------------------------Matching Chinese

In some cases, we want to match the characters in the text, one thing to note is that the Chinese Unicode encoding range is mainly in [U4e00-u9fa5], this is mainly because this range is not complete, such as does not include full-width (Chinese) punctuation, but in most cases, should be sufficient.

Suppose you now want to put the string title = U ' Hello, hello, world ' in Chinese to extract it, you can do this:

import retitle = u‘你好,hello,世界‘pattern = re.compile(ur‘[\u4e00-\u9fa5]+‘)result = pattern.findall(title)print result

Notice that we have added two prefix ur to the regular expression, where r means the original string, and U is the Unicode string.

Execution Result:

[u‘\u4f60\u597d‘, u‘\u4e16\u754c‘]
Note: Greedy mode vs. non-greedy mode
    1. Greedy mode: As many matches as possible (*) on the premise that the entire expression matches successfully;
    2. Non-greedy mode: as few matches as possible if the entire expression matches successfully (?) ;
    3. The number of words in Python is greedy by default.
Example one: source string: abbbc
    • A regular expression that uses a greedy number of words ab* , matching the result: abbb.

      *decided to match B as much as possible, so all B after a has appeared.

    • A regular expression that uses a non-greedy quantity word ab*? to match the result: a.

      Even though the previous one * , but ? decided to match as little as possible B, so there is no B.

Example two: source string: aa<div>test1</div>bb<div>test2</div>cc
    • Regular expressions that use greedy quantities of words:<div>.*</div>

    • Matching results:<div>test1</div>bb<div>test2</div>

The greedy pattern is used here. The </div> entire expression can be successfully matched to the first "", but because of the greedy pattern, you still try to match to the right to see if there is a longer substring that can be successfully matched. After matching to the second " </div> ", there is no substring that can be successfully matched to the right, the match ends, and the matching result is " <div>test1</div>bb<div>test2</div> "

    • A regular expression that uses a non-greedy quantity word:<div>.*?</div>

    • Matching results:<div>test1</div>

The regular expression two adopts a non-greedy pattern, matching the entire expression to a successful match to the first " </div> ", because the use of non-greedy mode, so the end of the match, no longer try to the right, the match result is " <div>test1</div> ".

Regular expression Test URLs

Unstructured data and structured data extraction---regular expression re modules

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.