Under my directory: Web Crawler by myself # self-crawled Data Set CN 5 depth # The data set crawled using nutch 0.7.2 is restricted to the CN domain, includes links and text cn-2010-01-01 # crawls a dataset with nutch 0.7.2, which is restricted to the CN domain, including links and text dlut.edu. cn2010-01-01 # Use the dataset crawled with nutch 0.7.2 to restrict linkexchange 2010-09 In the dlut.edu.cn domain # Start from some link exchange directories and crawl outward, you can use this to find a large number of link exchange sites (Note: During the crawling process, the role of the robot.txt is modified, or the role of the robot.txt is omitted, because these links exchangesite points prevent the search engine index from being used to protect themselves) ecml pkdd 2010 discovery challenge data set # This dataset is the challenge of epkdd last year. This dataset is used for webpage quality and Webpage Classification (higher than Web spam ), however, we can also use it as a web spam dataset law datasets # a bunch of Web data without spam-related tags. Community Can be used with a web09-bst # This is a large Web dataset released in 09, someone has done the spam mark, can be used as the Web spam dataset webbspamcorpus # This dataset is to take some links in spam through certain filtering as Web spam, you can use this data to mark the WEBSPAM-LIP6-2006 # is a relatively old, specifically used for Web spam research dataset WEBSPAM-UK2006 # is a relatively old, specifically used for Web spam research dataset WEBSPAM-UK2007 # this is also a specialized for Web spam research dataset, however, the number of spam pages marked in this dataset is a little small and may not meet the actual situation. We recommend that you do not use this dataset alone for experiments, several more datasets can be used: Social bibsonomy dumps # This is a data set dumped from the bibsonomy database. All the data is marked and is a good data set for social spam research, I signed an agreement with the other party for this DataSet. Please do not spread it out, it can only be used for lab purposes dataset for statistics and social network of YouTube videos # This dataset is mentioned in a paper, delicious used to study YouTube # The data is crawled from delicous and used to study the social spam dataset. The first three of them were obtained by myself, the last Twitter crawled by someone else # The following two Twitter datasets are under this directory: Wiki # The following are two Wikipedia datasets. In addition, the official website of Wikipedia also provides splog # A Spam blogs dataset, old news # A news dataset, emails crawled from some news sites # email dataset other AOL query clickthrough # A few publicly available users click in one of the group directories: twitter # includes tweets and Twitter graph, which are composed of two datasets.