Study on garbled characters of Invoke-webrequest Invoke-restmethod

Source: Internet
Author: User
Tags linux http server utf 8






PowerShell invoke-webrequest invoke-restmethod garbled encoding Sharset CharacterSet



Original, the world's only, explains the invoke-webrequest invoke-restmethod garbled reason, gives the solution.



PowerShell Missionary original article 2016-05-01 permitted reprint, but must retain the name and provenance, otherwise investigate legal liability





------------"first chapter coding knowledge points"-----------------


The encoding type, and the encoded value, are indivisible pairs. All garbled characters are generated because only the encoded values are known, and the encoding type is not known! Such as:



Encoded value "4F 58" and the encoding type "UTF16" together, only to know that the above content is "PS missionary".



This is also the reason Microsoft invented the use of "BOM Header" in the text.



Some bad people I've met before, freaks. They are annoying, Microsoft uses the text "BOM Header", not to be used. Do not use some wild path of artifice guessing coding. Then it will result in:



1) There must be a certain probability of guessing errors. This is intentionally to oneself garbled eat.



2) Some documents, such as HTML, may be combinations of multiple encodings. Perhaps a separate charset encoding is used in "<>". In this single-file multi-coding scenario, there is more chance of guessing errors.



3) without the "BOM head" of the. py document, must use coding: the like. They are the same thing, they are the identity of the encoding type.



You can not use "BOM head", also do not use "coding", pure guess! Script coding is unknown, parsing Chinese comments error, resulting in the operation is not deserved! Rather than "BOM header" and "coding", the Py script does not work






"BOM Header" only solved, plain text file garbled. When transferring strings, you must also follow the encoding type. Once the encoding type is missing or unknown, garbled characters are generated.


--------------------"second Chapter preface"--------------------


(in PowerShell) two reptiles, two reptiles, run fast, crawl Web very good ~ ~ ~



A COM version of IE, based on the WebRequest class in. NET, are granny, not strange ...



Although very old, but also fast climbing ...






If your system is WIN8, or win8 above, or Win7 installs PowerShell 4.0, 5.0, then PowerShell comes with such two commands, "Invoke-webrequest" and "Invoke-restmethod 】。 The first command returns the object, and the second one returns the (Entire page) string.



These two commands sometimes return garbled, for a long time, I think, is this command has a decoding bug, but later found that the results with its own-outfile parameter output to the file, the encoding is correct. In other words, we don't know how to decode it. Only slow methods written to disk can be used.



Later I read the blog Park Friends "nickname: Old Horse said programming" of these two posts, figured out, thank him! Also please take a look at these two garbled article repair class.



Http://www.cnblogs.com/swiftma/p/5420145.html



Http://www.cnblogs.com/swiftma/p/5430007.html





------------"The third chapter of the text"-----------------


Garbled command version:



All versions of PowerShell.






Garbled reason: About 90% or more are the problem.



The page is encoded as UTF8, but after receiving the code, the source code of the Web page, the type of code is mistaken, or lost. The UTF8 code page source code, mistakenly believed to be the encoding of ISO8859-1 encoding type, this UTF8 again converted to UTF8, and then presented to us.






Bug Repro PowerShell command:


#(Invoke-webrequest-uri ' http://www.msn.com '). Baseresponse.characterset #  UTF8 web page,but return iso-8859-1Invoke-restmethod-uri ' Http://www.msn.com '








FIX: Inverse conversion of the above code.






Bug fix PowerShell command:


$utf 8= [System.text.encoding]::getencoding (65001) $iso 88591= [System.text.encoding]::getencoding (28591)#ISO 8859-1, Latin-1$wrong _string= Invoke-restmethod-uri '/httpwww.msn.com '$wrong _bytes=$utf 8. GetBytes ($wrong _string)$right _bytes= [System.text.encoding]::convert ($utf 8,$iso 88591,$wrong _bytes)#Take a closer look here$right _string=$utf 8. GetString ($right _bytes)#Take a closer look hereWrite-host$right _string





Conclusion:



Originally, I unilaterally think that this may be related to the Linux HTTP server, but later found that http://www.msn.com is the IIS website, Microsoft Official website, this URL also has this garbled phenomenon, finally determined that this is invoke-webrequest Invoke-restmethod, the two-command bug. Then to Microsoft to submit a bug, this garbled final elimination, or rely on Microsoft.



Welcome to the top of this bug:



https://windowsserver.uservoice.com/forums/301869-powershell/suggestions/13685217- Invoke-restmethod-and-invoke-webrequest-encoding-b



Q: How do I crawl data with PowerShell before this bug is fixed?



A: Please see my article: Reprint will not garbled, PowerShell network Spider http://www.cnblogs.com/piapia/p/5093201.html


------------"The fourth Chapter PostScript: The analysis enumerates the commonly used code type of webpage"-----------------


Wincodepage Name



936 GBK



54936 GB18030



GB18030 uses variable-length encoding, with some characters two bytes and four bytes. In two-byte encoding, the byte represents the same range as GBK. In four-byte encoding, the value of the first byte is from 0x81 to 0xFE, the value of the second byte is from 0x30 to 0x39, the value of the third byte is from 0x81 to 0xFE, and the value of the fourth byte is from 0x30 to 0x39. When parsing the binary, how do you know if it is two bytes or four bytes to represent one character? Look at the range of the second byte, if the 0x30 to 0x39 is four bytes, because the second byte in the two byte encoding is larger than this one.



932 Japanese



949 Korean



950 Big5



20127 us-ascii US 7bit



1252 Iso-8859-1



28591 ISO 8859-1 also known as Latin-1



Utf-16



1201 Utf-16 Big-endian



12000 utf-32



12001 utf-32 Big-endian



65001 Utf-8



gb2312,gbk,gb18030, is compatible with each other.    Since the pages are all simple Chinese, they can be thought of as the same encoding. So commonly used (web!) Encoded only, Gbk,big5,utf8,iso 8859-1, 1252, so commonly used (text!) ) encoding only, Gbk,big5,utf8,iso 8859-1,1252,utf16le,



Excerpt from: https://msdn.microsoft.com/zh-cn/library/system.text.encodinginfo.codepage.aspx





------------"The fifth chapter related issues"-----------------





Q: How do I get the page encoding?



A: Download the page and find the CharSet keyword in the page.


$ URL = '/http $ ' page encoded string $ url "content-type.*charset" # such as this Baidu page, some pages do not have "' N" line break





Q: How do "invoke-webrequest" and "Invoke-restmethod" get the page encoding?



For:



This method of acquisition is unreliable and some are wrong. PowerShell Missionary Note


(Invoke-webrequest-# returns utf-8(Invoke-webrequest-# return GB2312(Invoke#text/html(Invoke-webrequest-uri/  http/ # iso-8859-1 (Invoke-webrequest-uri http://www.scielo.br). Baseresponse.characterset








Q: How do I pass a value to a Web page?



For:


$text =$postData = [System.text.encoding]::utf8. GetBytes ($text) Invoke$postData"text/plain; Charset=utf-8"








Study on garbled characters of Invoke-webrequest Invoke-restmethod


Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.