Find the domain name from the URL, the first thought is to use regular, and then find the appropriate class library. With the regular parsing there are many incomplete places, the URL has a domain name, domain name suffix has been increasing and so on. Google found several methods, one is to use the module in Python and the combination of the regular to resolve the domain name, the other is to enable the third party to write a good analytic module directly resolve the domain name.
The URL to resolve
Copy Code code as follows:
URLs = ["http://meiwen.me/src/index.html",
"Http://1000chi.com/game/index.html",
"Http://see.xidian.edu.cn/cpp/html/1429.html",
"Https://docs.python.org/2/howto/regex.html",
"" "https://www.google.com.hk/search?client=aff-cs-360chromium&hs=TSj&q=url%E8%A7%A3%E6%9E%90%E5%9F%9F% E5%90%8DRE&OQ=URL%E8%A7%A3%E6%9E%90%E5%9F%9F%E5%90%8DRE&GS_L=SERP.3 ... 74418.86867.0.87673.28.25.2.0.0.0.541.2454.2-6j0j1j1.8.0 ..... 0...1c.1j4.53.serp.. 26.2.547.IUHTJ4UOYHG "" ",
"File:///D:/code/echarts-2.0.3/doc/example/tooltip.html",
"Http://api.mongodb.org/python/current/faq.html#is-pymongo-thread-safe",
"Https://pypi.python.org/pypi/publicsuffix/",
"http://127.0.0.1:8000"
]
Using the Urlparse+ regular method
Copy Code code as follows:
Import re
From Urlparse import Urlparse
Tophostpostfix = (
'. com ', '. La ', '. Io ', '. Co ', '. Info ', '. Net ', '. org ', '. Me ', '. mobi ',
'. us ', '. Biz ', '. xxx ', '. Ca ', '. co.jp ', '. com.cn ', '. net.cn ',
'. org.cn ', '. mx ' , '. tv ', '. ws ', '. Ag ', '. Com.ag ', '. Net.ag ',
'. Org.ag ', '. am ', '. Asia ', '. at ', '. being ', '. com.br ', '. net.br ',
'. Bz ', '. com.bz ', '. net.bz ', '. CC ', '. com.co ', '. net.co ',
'. Nom.co ', '. de ', '. Es ', '. com.es ', '. nom.es ', '. org.es ',
' EU ', '. FM ', '. Fr ', '. GS ', '. In ', '. co.in ' , '. Firm.in ', '. gen.in ',
'. ind.in ', '. net.in ', '. org.in ', '. it ', '. Jobs ', '. JP ', '. Ms ',
'. com.mx ', '. nl ', '. Nu ', '. co.nz ', '. net.nz ', '. org.nz ',
'. Se ', '. TC ', '. tk ', '. TW ', ' . com.tw ', '. idv.tw ', '. org.tw ',
'. HK ', '. co.uk ', '. me.uk ', '. org.uk ', '. VG ', '. com.hk ')
REGX = R ' [^\.] +('+'|'. Join ([H.replace ('. ', R ' \. ') for h in Tophostpostfix]) + ') $ '
Pattern = Re.compile (regx,re. IGNORECASE)
Print "--" *40
For URL in URLs:
Parts = urlparse (URL)
Host = Parts.netloc
m = Pattern.search (host)
res = M.group () if M else host
Print "UNKONW" if not res else res
The results of the operation are as follows:
Copy Code code as follows:
Meiwen.me
1000chi.com
see.xidian.edu.cn
python.org
google.com.hk
Unkonw
mongodb.org
python.org
127.0.0.1:8000
can be basically accepted.
Urllib to resolve domain names
Copy Code code as follows:
Import Urllib
Print "--" *40
For URL in URLs:
Proto, rest = Urllib.splittype (URL)
Res, rest = Urllib.splithost (rest)
Print "UNKONW" if not res else res
The results of the operation are as follows:
Copy Code code as follows:
Meiwen.me
1000chi.com
see.xidian.edu.cn
docs.python.org
www.google.com.hk
Unkonw
api.mongodb.org
pypi.python.org
127.0.0.1:8000
will bring www. Also need to further analysis before you can
Using a third party module TLD
Copy Code code as follows:
From TLD import Get_tld
Print "--" *40
For URL in URLs:
Try
Print get_tld (URL)
Except Exception as E:
Print "Unkonw"
Run Result:
Copy Code code as follows:
Meiwen.me
1000chi.com
xidian.edu.cn
python.org
google.com.hk
Unkonw
mongodb.org
python.org
Unkonw
The results will be acceptable.
Other parsing modules that you can use:
Tld
Tldextract
Publicsuffix