Zheng @ playfun RT 20091124
When talking about social data mining, there is a little bit of insight. Please refer:
When looking for new value from Social Data Mining in mainland China, we generally consider two points:
1. Is there enough data;
2. How the data proves to be valid/valuable, or how you can clean the data.
Generally, most idea loses when it encounters the first problem.
Oneriot or its pulse rank is a bit interesting, because no matter what you search for, there is enough data in English. There are very few data, and there is no meaning for rank or sorting. So I once said that one of the characteristics of the vertical field that machine intelligence can enter is "Information Sources: rich enough network information, with many fragments and scattered". If there is little data, machine intelligence is not required at all, once you hire an editor, you can get it all done, and there is little data change. If your machine processes the data that has been produced for half a day, other websites will be able to copy/paste to you in the twinkling of an eye.
After the first point, but there is no feature as the entry point, the first is to directly test your machine's parallel processing and indexing capabilities. Second, you need to spend a lot of time processing junk data, this is a waste of energy, because you could have done something else. So for machine intelligence. You need to take shortcuts to narrow the computing scope from a massive collection. This is the basic solution.
That is, "in the case of massive data volumes, you must first use features and rules to filter and clean data 』.
Recommended reading:
1. Semantics and features