1. Tag extraction in HTML
<("[^"]*"|'[^']*'|[^'">])*>
2. Extract the URL and link text in the <A> </a> label.
While ($ html = ~ M {A \ B ([^>] +)> (.*?) </A >}ig) {my $ guts = $1; my $ link = $2; if ($ guts = ~ M {\ B href # href attribute \ s * = \ s * # A blank character (? : # Its value is "([^"] *) "# Double quotation mark string | '([^'] *) '# single quotes string | ([^' "> \ s] +) # or other text)} xi) {my $ URL =$ +; print" $ URL with link text: $ link \ n ";}}
3. Verify the HTTP URL
It is divided into two parts: Host Name and path.
The host name is the content between "^ http: //" and the first/(if any). The path is other content.
「 ^ Http: // ([^/] + )(/.*)? $ 」
If ($ url = ~ M {^ http: // ([^/:] +) (:( \ D + ))? (/.*)? $} I) {my $ host = $1; my $ Port = $3 | 80; # If yes, use $3, otherwise, the default value is 80my $ Path = $4 | "/"; # If yes, use $4; otherwise, the default value is "/" print "Host: $ host \ n "; print "port: $ port \ n"; print "Path: $ path \ n";} else {print "not an http url \ n ";}
4. Search for the URL framework from plain text. You can add a sub-expression that matches the host name.
\b((ftp|https?)://[-\w]+(\.\w[-\w]*)+|(?i: [a-z0-9] (?:[-a-z0-9]*[a-z0-9])? \. )+(?-i: com\b| edu\b| biz\b| gov\b| in(?:t|fo)\b| mil\b| net\b| org\b| [a-z][a-z]\b))( : \d+ )?(/[^.!,?;"'<>()\[\]{}\s\x7F-\xFF]*(?:[.!,?]+ [^.!,?;"'<>()\[\]()\s\x7F-\xFF]+)+)?
From proficient Regular Expressions