Python implements several methods of extracting domain names from URLs _python

Source: Internet
Author: User
Tags mongodb tld in python

Find the domain name from the URL, the first thought is to use regular, and then find the appropriate class library. With the regular parsing there are many incomplete places, the URL has a domain name, domain name suffix has been increasing and so on. Google found several methods, one is to use the module in Python and the combination of the regular to resolve the domain name, the other is to enable the third party to write a good analytic module directly resolve the domain name.

The URL to resolve

Copy Code code as follows:

URLs = ["http://meiwen.me/src/index.html",
"Http://1000chi.com/game/index.html",
"Http://see.xidian.edu.cn/cpp/html/1429.html",
"Https://docs.python.org/2/howto/regex.html",
"" "https://www.google.com.hk/search?client=aff-cs-360chromium&hs=TSj&q=url%E8%A7%A3%E6%9E%90%E5%9F%9F% E5%90%8DRE&OQ=URL%E8%A7%A3%E6%9E%90%E5%9F%9F%E5%90%8DRE&GS_L=SERP.3 ... 74418.86867.0.87673.28.25.2.0.0.0.541.2454.2-6j0j1j1.8.0 ..... 0...1c.1j4.53.serp.. 26.2.547.IUHTJ4UOYHG "" ",
"File:///D:/code/echarts-2.0.3/doc/example/tooltip.html",
"Http://api.mongodb.org/python/current/faq.html#is-pymongo-thread-safe",
"Https://pypi.python.org/pypi/publicsuffix/",
"http://127.0.0.1:8000"
]

Using the Urlparse+ regular method

Copy Code code as follows:

Import re
From Urlparse import Urlparse

Tophostpostfix = (
    '. com ', '. La ', '. Io ', '. Co ', '. Info ', '. Net ', '. org ', '. Me ', '. mobi ',
    '. us ', '. Biz ', '. xxx ', '. Ca ', '. co.jp ', '. com.cn ', '. net.cn ',
    '. org.cn ', '. mx ' , '. tv ', '. ws ', '. Ag ', '. Com.ag ', '. Net.ag ',
    '. Org.ag ', '. am ', '. Asia ', '. at ', '. being ', '. com.br ', '. net.br ',
    '. Bz ', '. com.bz ', '. net.bz ', '. CC ', '. com.co ', '. net.co ',
    '. Nom.co ', '. de ', '. Es ', '. com.es ', '. nom.es ', '. org.es ',
    ' EU ', '. FM ', '. Fr ', '. GS ', '. In ', '. co.in ' , '. Firm.in ', '. gen.in ',
    '. ind.in ', '. net.in ', '. org.in ', '. it ', '. Jobs ', '. JP ', '. Ms ',
     '. com.mx ', '. nl ', '. Nu ', '. co.nz ', '. net.nz ', '. org.nz ',
    '. Se ', '. TC ', '. tk ', '. TW ', ' . com.tw ', '. idv.tw ', '. org.tw ',
    '. HK ', '. co.uk ', '. me.uk ', '. org.uk ', '. VG ', '. com.hk ')

REGX = R ' [^\.] +('+'|'. Join ([H.replace ('. ', R ' \. ') for h in Tophostpostfix]) + ') $ '
Pattern = Re.compile (regx,re. IGNORECASE)

Print "--" *40
For URL in URLs:
Parts = urlparse (URL)
Host = Parts.netloc
m = Pattern.search (host)
res = M.group () if M else host
Print "UNKONW" if not res else res

The results of the operation are as follows:

Copy Code code as follows:

Meiwen.me
1000chi.com
see.xidian.edu.cn
python.org
google.com.hk
Unkonw
mongodb.org
python.org
127.0.0.1:8000

can be basically accepted.

Urllib to resolve domain names

Copy Code code as follows:

Import Urllib

Print "--" *40
For URL in URLs:
Proto, rest = Urllib.splittype (URL)
Res, rest = Urllib.splithost (rest)
Print "UNKONW" if not res else res

The results of the operation are as follows:

Copy Code code as follows:

Meiwen.me
1000chi.com
see.xidian.edu.cn
docs.python.org
www.google.com.hk
Unkonw
api.mongodb.org
pypi.python.org
127.0.0.1:8000

will bring www. Also need to further analysis before you can

Using a third party module TLD

Copy Code code as follows:

From TLD import Get_tld

Print "--" *40
For URL in URLs:
Try
Print get_tld (URL)
Except Exception as E:
Print "Unkonw"

Run Result:

Copy Code code as follows:

Meiwen.me
1000chi.com
xidian.edu.cn
python.org
google.com.hk
Unkonw
mongodb.org
python.org
Unkonw

The results will be acceptable.

Other parsing modules that you can use:

Tld
Tldextract
Publicsuffix

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.