HTML pages usually specify an encoding. How to obtain the encoding is the first step in processing HTML pages, because wrong encoding will inevitably cause problems to be addressed later. Here I wrote a regular expression in Python:
Import re
A = ["<meta http-equiv =" Content-Type "content =" text/html; charset = UTF-8 "/> ",
'<Meta http-equiv = Content-Type content = "text/html; charset = gb2312"> ',
'<Meta http-equiv = "Content-Type" content = "text/html; charset = iso-8859-1"> ',
'<Meta http-equiv = "Content-Type" content = "text/html; charset = gb2312"/> ',
'<Meta http-equiv = "Content-Type" content = "text/html; charset = UTF-8"/> ',
'<Meta http-equiv = "Content-Type" content = "text/html; charset = gb2312"/> ',
'<Meta http-equiv = "Content-Type" content = "text/html; charset = gb2312"/>'
]
B = "<meta [] + HTTP-equiv = [" ']? Content-Type ["']? [] + Content = ["']? Text/html; [] * charset = ([0-9-a-za-z] +) ["']? "
B = Re. Compile (B, re. ignorecase)
For ax in:
R1 = B. Search (ax)
If r1:
Print r1.group ()
Print r1.group (1), Len (r1.group ())
Else:
Print 'not Match'