The main use of multi-process and multi-threaded knowledge, the final results are saved to the CSV file format, if necessary, can be changed to a database version.
For a brief introduction to the library used, please refer to the official documentation:
- Xpinyin. Pinyin : Converts the input Chinese into pinyin
- Concurrent.futures.ProcessPoolExecutor : Multi-process
- Concurrent.futures.ThreadPoolExecutor : Multithreading
#-*-coding:utf-8-*-#@Author: Studog#@Date: 2017/5/24 9:27ImportRequestsImportlxml.html as HTMLImportCSV fromXpinyinImportPinyinImportOSImportConcurrent.futuresclassGanjispider (object):def __init__(self): self.city= Input ("Please enter city name: \ n") P=Pinyin () city_name= P.get_initials (self.city,"'). Lower () Self.url='http://{0}.ganji.com/v/zhaopinxinxi/p1/'. Format (city_name) Self.save_path= R'E:\data\ganji.csv'File_dir=Os.path.split (Self.save_path) [0]if notOs.path.isdir (File_dir): Os.makedirs (File_dir)if notos.path.exists (Self.save_path): Os.system (R'echo >%s'%Self.save_path)defget_job (self): Flag=True with open (Self.save_path,'W', newline="') as F:writer=Csv.writer (f) writer.writerow (['Position name','Monthly Salary','Minimum Education','Work Experience','Age','Recruit Number','Work Place']) whileflag:html=html.fromstring (Requests.get (self.url). Text) content= Html.xpath ("//li[@class = ' Fieldulli ']/a/@href") Next_page= Html.xpath ("//li/a[@class = ' Next ']/@href") with Concurrent.futures.ProcessPoolExecutor () as Executor:executor.map (Self.get_url, con Tent)ifNext_page:self.url=Next_page[0]Else: Flag=FalsedefGet_url (Self, html_page): HTML=html.fromstring (Requests.get (html_page). Text) Job_list= Html.xpath ("//dl[@class = ' job-list clearfix ']/dt/a/@href") with Concurrent.futures.ThreadPoolExecutor () as Executor:executor.map (Self.get_info, Job_list) defGet_info (Self, job_url): HTML=html.fromstring (Requests.get (job_url). Text) name= Html.xpath ("//li[@class = ' fl ']/em/a/text ()") Info= Html.xpath ("//li[@class = ' fl ']/em/text ()") [1:] Address= Html.xpath (("//li[@class = ' fl w-auto ']/em//text ()")) ifName andLen (info) = = 5 andaddress:info[2] = info[2].strip () address[2] = address[2].strip () address="'. Join (address) info.append (address) Name.extend (info)Print(name) with open (Self.save_path,'a', newline="') as F:writer=Csv.writer (f) writer.writerow (name)if __name__=='__main__': GJ=Ganjispider () gj.get_job ( )
Use Python to crawl go work information in a Windows environment.