English Wikipedia
https://dumps.wikimedia.org/enwiki/
Wikipedia
https://dumps.wikimedia.org/zhwiki/
List of all languages
Https://dumps.wikimedia.org/backup-index.html
Extract processing can use Wikiextractor extract the body (because the number of pages too much, the structure is very chaotic, the extraction will be a little defective, and then processed)
Https://github.com/attardi/wikiextractor
Run command: Python wikiextractor.py-b 500m-o output_file_name input_file_name.xml
Notice:
1. Recommended processing of extracted files
2. If you are running under Windows, you need to set fileinput. The FileInput () parameter uses UTF-8 encoding, as follows:
input = Fileinput. FileInput (Input_file, openhook=fileinput.hook_encoded ("Utf-8"))
会冲掉原来的设置(跟压缩文件类型相关?):fileinput.FileInput(openhook=fileinput.hook_compressed)
Reference: https://docs.python.org/3.5/library/fileinput.html
Wikipedia corpus acquisition and extraction process by python3.5