I now need to automate data collection on the list of articles in a Web site and the actual content in the list, which can get the ID of each article, And each article through a unified interface (parameter with that article ID to get to the corresponding JSON) there is a part of the data need to collect and then analyze data.
Is there any more mature frame or wheel that can meet my needs? (Multi-threading, and can run at 7x24 hours, because the number of acquisitions is huge)
In addition to ask, how to store the collected content (million to tens of millions), the data there are some digital data, the need for statistical analysis, with MySQL can it? Or is there any other more mature and simple wheels that can be used?
Reply content:
I now need to automate data collection on the list of articles in a Web site and the actual content in the list, which can get the ID of each article, And each article through a unified interface (parameter with that article ID to get to the corresponding JSON) there is a part of the data need to collect and then analyze data.
Is there any more mature frame or wheel that can meet my needs? (Multi-threading, and can run at 7x24 hours, because the number of acquisitions is huge)
In addition to ask, how to store the collected content (million to tens of millions), the data there are some digital data, the need for statistical analysis, with MySQL can it? Or is there any other more mature and simple wheels that can be used?
If it is data analysis.
Map-reduce Doing log analysis
Dpark can solve PV and UV analysis
Spark is good, too.
The production data report can be analyzed and displayed with pandas.
If it is data acquisition. There's a lot of tools.
What do I think you're going to do with a search engine? The volume ratio is large. Recommended for distributed things.
Using MySQL is not very realistic ...
Boy, aren't you a reptile's need?
Reptile Frame: scrapy
Database selection: You can use MySQL to index the level of the 500-year war.
You can also try to use MongoDB
You don't speak any language or environment. Multi-threaded words, the current general use of Nodejs, Python. Both can use storage data such as MySQL. Millions of never a problem.
Have you ever played Python selenium + Phantomjs?
This scrapy of the Python language is still