Two Python programs take care of NCBI data search and save the results to Excel

Source: Internet
Author: User

Recently there are a lot of mass spectrometry data, but a lot of protein has been studied, in order to find and bait protein-related, but also the specific hypothetical protein, decided to write a Python program to filter out unwanted information, retain the desired information.

Scheme:

1, identify the putative proteins in the specific proteins in all mass spectrometry data and sort them by the high and low score.

2, according to the protein serial number to find the hypothetical protein may contain domain, write to Excel file.

3, just do it.

The first step is to use the nature of the set to weight, with re regular expression to find the serial number, write to Excel with OPENPYXL, according to the score sort.

#mass spectrometry protein de-weightImportReImportOpenpyxlreg= Re.compile (r'mgg_\d{5}')defread_csv (name): With open (name,'R') as F:csv_data=f.read () csv_num=Re.findall (Reg, Csv_data)returnSet (csv_num)defWrite_excel (file,filename): WB=OPENPYXL. Workbook () WS=wb.active ws['A1'],ws['B1'],ws['C1'],ws['D1'],ws['E1']  =' Serial number','Protein Information','score','Coverage Level','Molecular Weight'Ws.freeze_panes='A2'     forNuminchUnique: forLineinchOpen (file):#Print (line+ ' * * * * *)            ifNuminchLine and 'hypothetical protein' inchline:ws.append ((Line.split (',') [1:])) Wb.save (filename)if __name__=='__main__': ATG3= Read_csv (r'C:\Users\zhuxueming\Desktop\ATG3.csv')#Add the required alignment files, and the absolute path. Excel in. csv format is requiredVPS9 = Read_csv (r'C:\Users\zhuxueming\Desktop\vps9.csv')#Add the required alignment files, and the absolute path. Excel in. csv format is requiredK3g4 = Read_csv (r'C:\Users\zhuxueming\Desktop\K3G4.csv')#Add the required alignment files, and the absolute path. Excel in. csv format is requiredUnique = atg3-(vps9| K3G4)#for data filtering, atg3-(vps9| K3G4) represents a collection of data in ATG3 that are not in VPS9 and K3G4. -Number for difference set, | For the set    #UNIQUE_VPS9 = vps9-(k3g4| ATG3)Write_excel (R'C:\Users\zhuxueming\Desktop\ATG3.csv', R'C:\Users\zhuxueming\Desktop\unique_Atg31.xlsx')#The first is the file that needs to be compared, the second is the output Excel and the path

>>>

The second step, based on the sequence number in Excel, finds the domain information on NCBI and writes it to the new Excel.

ImportRequestsImportReImportOPENPYXL fromBs4ImportBeautifulsouphead= {'user-agent':'mozilla/5.0 (Windows NT 10.0; WOW64) applewebkit/537.36 (khtml, like Gecko) chrome/63.0.3239.132 safari/537.36'}url= R'https://www.ncbi.nlm.nih.gov/gene/?term='WB=OPENPYXL. Workbook () WS=wb.activews['A1'],ws['B1'],ws['C1'],ws['D1'],ws['E1']  ='hypothetical protein information','score','Structural domain 1','Structural domain 2','Structural Domain 3'Ws.freeze_panes='A2'defseq_data (file, filename): forLineinchOpen (file):if 'Mgg_' inchLine:score= Line.split (',') [2] MGG= Re.search (r'mgg_\d{5}', line). Group (0) Full_url= URL +MGG L=[] L.append (MGG) l.append (score)Try: Res= Requests.get (Full_url,headers =head) res.encoding='Utf-8'Soup= BeautifulSoup (Res.text,'lxml') Domain= Soup.find_all ("DD", class_='Clearfix') #获取标签内容 foreachinchdomain:l.append (Each.text)exceptbaseexception:Passws.append (L) #写入excel wb.save (filename) #保存if __name__=='__main__': Seq_data (R'C:\Users\zhuxueming\Desktop\unique_Atg31.csv', R'C:\Users\zhuxueming\Desktop\ATG3_special_hyp_protein_domain.xlsx')                

Two Python programs take care of NCBI data search and save the results to Excel

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.