automatically submit Baidu via Python crawl URL URLs
Yesterday, colleagues said, you can manually submit Baidu such index will go up.
And then I thought about it. Should I get a py and then submit it automatically? Thought of it. Or get one of those.
python code is as follows:
Import os import re import shutil reject_filetype = ' Rar,7z,css,js,jpg,jpeg,gif,bmp,png,swf,exe ' #定义爬虫过程中不下载的文件类型 def getinfo ( webaddress): # ' #通过用户输入的网址连接上网络协议, get the URL I'm here for my own domain global reject_ filetype url = ' http//' +webaddress+ '/' #网址的url地址 print ' getting>>>>> ' +url Websitefilepath = os.path.abspath ('. ') + '/' +webaddress #通过函数os. Path.abspath get the absolute path of the current program, and then use the URL entered by the user to get the folder where the downloaded pages are stored if os.path.exists (Websitefilepath): #如果此文件夹已经存在就将其删除, because if it exists, Then the crawler will not succeed shutil.rmtree (Websitefilepath) #shutil. Rmtree function is used to delete a folder (containing files) outputfilepath =&Nbsp;os.path.abspath ('. ') + '/' + ' output.txt ' #在当前文件夹下创建一个过渡性质的文件output .txt fobj = open (Outputfilepath, ' w+ ') command = ' wget -r -m -nv --reject= ' +reject_filetype+ ' -o ' +outputfilepath+ ' ' +url # Crawl Web site tmp0 = os.popen (command) with the wget command. ReadLines () # The function Os.popen executes the command and stores the result of the run in the variable tmp0 print >> fobj,tmp0 # Write output.txt in allinfo = fobj.read () target _url = re.compile (R ' \ ". *?\" ', Re. Dotall). FindAll (allinfo) #通过正则表达式筛选出得到的网址 print target_url target_num = len (Target_url) fobj1 = open (' Result.txt ', ' W ') #在本目录下创建一个result. txt file, which stores the resultingContent of for i in range (target_num): if len (Target_url[i][1:-1]) <70: # This target_url is a dictionary form, if the URL length greater than 70 will not be recorded in the inside print > > fobj1,target_url[i][1:-1] #写入到文件中 else: print "NO" fobj.close () fobj1.close () if os.path.exists (Outputfilepath): #将过渡文件output. txt delete os.remove (Outputfilepath) #删除 if __name__== "__ Main__ ": webaddress = raw_input (" Input the website address (without \ "http:\ ") >") getinfo (webaddress) print "Well done."
After execution, there will be the following URL
Then get an unsolicited script, I entered the Baidu website to find the address I submitted
Wrote a garbage script, originally wanted to integrate into the py. But think about it, or not.
[Email Protected]i9rkbm4yvq43laz script]# cat baiduurl.sh cd/script && curl-h ' Content-type:text/plain '--data -binary @result. txt "http://data.zz.baidu.com/urls?site=https://www.o2oxy.cn&token=P03781O3s6Ee" && Curl-h ' Content-type:text/plain '--data-binary @result. txt "http://data.zz.baidu.com/urls?site=https:// Www.o2oxy.cn&token=P03781O3s6E "
The results of the implementation are as follows:
[[Email protected] script]# sh baiduurl.sh {"remain": 4993750, "Success": 455}{"remain": 4993295, "Success": 455}
And then did a scheduled task.
Take a look. Getting URL URLs is slow, maybe 10 minutes
Automatically submit Baidu via Python crawl URL URLs