#Coding:utf-8ImportOS fromBs4ImportBeautifulSoup#JSP PathFolderPath ="e:/whm/google/src_jsp" forDirpath,dirnames,filenamesinchOs.walk (folderpath): forFileNameinchFileNames:ifFilename.endswith (". JSP"): Soup=beautifulsoup (Open (Os.path.join (dirpath,filename)),"Html.parser") if(Soup.header is notNone): Soup.header.extract ()#Property Selector ... Only the first element that conforms to the rule can be selected if(Soup.find (attrs={'role':'Banner'}) is notNone): Soup.find (Attrs={'role':'Banner'}). Extract ()if(Soup.find (attrs={'class':"col-xs-3"}) is notNone): Soup.find (Attrs={'class':"col-xs-3"}). Extract () with open (Os.path.join (dirpath,filename),"w+") as file:#the Pretify () method returns a glorified HTML string encode (' Utf-8 ') specifying the encoding--File.write (Soup.prettify (Formatter=none). Encode ('Utf-8'))
A bug occurred when working with JSP pages ... So.. Do not use BeautifulSoup to handle script pages such as JSP and PHP ... Need to use regular to write ... This is the conclusion that I have been groping for half a day .....
Python beautifulsoup Basic usage