September 23, IBM and CSDN jointly announced the "POWER 8 limit Performance Challenge" officially launched. The competition is mainly for the majority of CSDN registered developers, the competition in the way of cloud computing for developers to provide the Power 8 development environment, developers use the Power 8 features, based on different scenarios for application development. The contest not only enables more developers to make full use of power 8, but also provides a stage for developers and technical people to showcase themselves.
As Hou, IBM's vice president of Greater China, said in the event of the Grand Prix, the reason to support such a contest was to attract more developers to develop new algorithms that would activate the Power 8 engine.
"U can u Up" is the slogan of the challenge, where developers can sign up to register, apply for resources, and complete challenges, and the organizer will ultimately earn gift rewards based on their cumulative score. During the competition, the organizer will regularly announce the challenge topics and use the monthly format to rank the participants.
The first issue of the challenge is "blog anti-spam", the specific task for CSDN to provide a huge amount of blog data, and at a specific rate mixed with garbage articles, contestants need to develop the appropriate system to extract the garbage blog. It is necessary to note that the main research program is the correct rate of the algorithm and processing speed, the development of language, development tools are not limited.
So far, hundreds of developers have signed up and participated in the contest, in order to let more developers understand the progress of the contest, recently, we interviewed one of the contestants Yi Yun Computer Technology Co., Ltd. ceo/President Huang Wenshu , hope through his participation experience, Attract more technology talent to participate in the competition.
The following is the interview content:
1. Can you tell me about your technology development experience? What areas of technology are currently being focused on?
Huang Wenshu: University stage: I graduated from the Department of Electronic Science and Technology of Zhengzhou University in 2009 and participated in the Mathematics Modeling Contest (2007 National First Prize) and ACM/ICPC (2008 Hefei Division Bronze Award) during the school period.
Because in itself is not a computer major, so in the university in addition to algorithms, data structure, the computer language is limited to the preliminary C + + and MATLAB use.
work stage: 2009 after graduation into the bank work, mainly used. NET series technology for the development of some internal systems. Do some technical work outside of the bank, including network management, server maintenance and so on.
2012 began self-study transformation using PHP, also completed some of the development of internal banking systems. The main projects include the independent completion of the "Performance appraisal System", "Automated approval system" and so on.
Since then during the work to receive some enterprise website development and other projects, basically with the implementation of WordPress, familiar with the depth of custom WordPress (template, plugin) for website construction. So far more than 10 sites have been built and maintained through WordPress.
Start-up phase: 2014 began its own business, technology transformation Python, mainly using the Django framework for development, the front-end, HTML5 and other technologies have preliminary understanding and practical experience, and understand the basic Linux server environment deployment.
The main areas of focus today are web development-related topics, including front-end technology, and the entire process of web development. In addition, I have a strong interest in algorithmic design, each year will participate in the major companies to hold the algorithm competition, and obtained a good result (2009 into the star of Baidu Final) (such as Google Code Jam, Baidu Star, TopCoder Open and so on)
2. What is the key to distinguishing between garbage ID data and normal ID data? Can you take a look at the basic idea of a design algorithm?
Huang Wenshu: There are three main types of spam data to judge now:
- Content is missing or confusing: the content of the blog is too short, there is no substance, or it is filled with meaningless words;
- Has a large number of problematic output outside the chain: This is mainly from the purpose of consideration, spam blog is through this means to the output of the chain, in order to achieve fraud search engine, illegal promotion and other cheating effect, so from the chain of quantity, quality, similarity to distinguish, can identify this kind of spam blog;
- Theme irrelevant: According to this CSDN blog, the normal blog and the topic of Spam blog is a big difference, mainly through the word frequency statistics to identify. This is the main algorithm for identification, first through the small batch of manual annotation to extract part of the normal blog and Spam blog, all the articles filter, participle and calculate the keyword "Inverse text frequency index (IDF)", and through the normal blog and Spam blog to extract feature vectors, and then for each blog, Comparing its keyword eigenvector with the cosine angle of the two eigenvectors, it is classified into two categories, which has a fairly good recognition of the content-related blogs.
3. What are the computational models used in this algorithm design idea? Are there unique innovation highlights?
Huang Wenshu: The main computational models used include "Chinese word segmentation technique", "frequency-inverse-text rate index (TF-IDF)" and "Cosine theorem Text classification method".
The main reference is Google Wu's "Mathematical Beauty" introduced in the algorithm, and reference to some Google scholar found above on the splog of the paper on the spam features of the blog some description.
4. What is the main reason to choose this model relative to other designs? Is there any possibility of further optimization (consideration of the competition, considerations based on IBM Power 8, design time considerations, iterative update considerations, etc.)?
Huang Wenshu: mainly because of the problem itself, because it is to participate in the competition, and the sample is CSDN technical blog, so good blog will have a strong theme relevance, so the cosine theorem clustering method can achieve good results. At the same time, as long as the eigenvector preprocessing, in the Power 8 can be used in a multi-process way to achieve better efficiency.
5. Can the algorithmic design based on this idea give full play to IBM POWER 8 's concurrency computing advantage? Where does the faith come from?
Huang Wenshu: because the main time-consuming algorithms are independent of each other (page parsing, word segmentation, vector angle calculation), it is possible to perform multi-process calculations to take advantage of the performance benefits of IBM POWER 8.
6. What are your top technical points for IBM POWER 8? Can you talk about the future of technology trends in this area?
Huang Wenshu: before the Power 8 architecture was not much in-depth, but since the first contact with the game, it has a unique advantage in computing performance, so for my work itself, I want the POWER 8 platform to provide services on more cloud platform services, Let this kind of web developer have a better choice.
7. What do you think about the development of multithreading and concurrent programming technology? What do you think can be improved?
Huang Wenshu: This is of course the trend, because from the main frequency above the feeling there is not much space can be excavated, in order to improve the performance of computing, only through parallel computing, distributed algorithm implementation, but also in recent years, the rise of the big data technology wave, on the other hand support this direction. In the future development of computer application, parallel algorithm and distributed computing will become the mainstream in the mainstream.
8. How do you feel about participating in this algorithm challenge? Are there any good suggestions for this activity?
Huang Wenshu: Csdn and IBM colleagues are very conscientious, patiently answering various questions and solving the various problems that arise in the deployment.
Special Thanks @ Ouyang, because in the algorithm I began to use MongoDB, and no external network deployment MongoDB has a lot of problems, he still spent a lot of time off to help me find a variety of packages, documents, hardships and finally put the environment deployed well, thank you very much. I also appreciate the belief that he has done a lot of meaningful work for the communication and arrangement of this competition.
As the game progresses, some of the necessary links that are not perfect are gradually being perfected. Thank you very much for CSDN and IBM for this platform.
Entry Guide
First, the specific way and process of participation are as follows:
- Mix spam and normal blog posts at a specific rate, and contestants need to write algorithms to separate the Spam blog IDs;
- Contestants can use any development language to complete the challenge;
- Data Source storage location: The root directory under the blog folder.
Second, the selection criteria are mainly four aspects:
- The lower the false negative rate, the better;
- The lower the rate of error, the better;
- The higher the correct rate, the better;
- Program run time.
Third, after the completion of the test, the participants need to submit:
- The ID of the spam blog;
- Source code
- Program run time.
Competition Official website: http://reg.powerlinux.csdn.net/
Register Now: http://reg.powerlinux.csdn.net/cview/reg/?project_id=973&identy_id=1011
- This article from: Linux
- This article link: http://www.ahlinux.com/mainte/9394.html
Interview with Power 8 Programming Challenge contestant Huang Wenshu: The path of programming algorithms for non-junior college students