"Csdn Live Report" December 2014 12-14th, sponsored by the China Computer Society (CCF), CCF large data expert committee contractor, the Chinese Academy of Sciences and CSDN jointly co-organized to promote large data research, application and industrial development as the main theme of the 2014 China Data Technology Conference (big Data tec on the Hnologyconference 2014,BDTC 2014) and the second CCF large data academic conference in Beijing new Yunnan Crowne Plaza grand opening.
2014 China Large Data Technology conference the second day morning Big Data Technology Forum, Baidu Big Data department deputy director Shi Zhenghua and China Mobile Group Business Support System Department project manager He Hongling jointly chaired the morning Forum. NetEase ntse/tnt engine leader Hu, Intel Big Data chief architect Dai King, VMware Senior Business Solution Architecture Shi Jiewen Qing, Sohu Mobile research and development manager Liu, Admaster Technology vice President LU billion 5 experts mainly focus on large data technology to launch a speech.
Large Data Technology Forum moderator: Baidu Big Data Department deputy director Shi Zhenghua
Large Data Technology Forum moderator: China Mobile group Business Support Systems Department project manager He Hongling
HU: NetEase Database compression technology
December 14, in the Big Data Technology forum in the morning speech, NetEase ntse/tnt engine leader Hu published the theme for "NetEase Database compression Technology" speech. Hu that the ideal compression technology should be no matter what compression technology you provide, it should be intelligent, in the data compression, decompression and in the compressed data access to achieve high efficiency, regardless of the way to compression and decompression, to maintain flexibility.
NetEase ntse/tnt engine leader Hu
For database compression features, Hu summed up five points:
Lossless compression technology. Generally speaking, database compression adopts general algorithm, data distribution influence. It represents the randomness of the content, which represents the limit of the so-called compression, in fact, different data on the use of different compression algorithm, may result in its data final compression effect will be very different. If I was a very, very redundant data, it might end up with a very good compression effect; Hardware is also a very important factor for compression. Because whether it is storage, CPU, or memory, with the rapid development, it is actually the choice of compression algorithm, there will be a very large direction. Compressed units. Compressed units on the database compression has a very large impact, it can do like the entire file-level compression, or a data-table-level compression, or to do like block level, page-level compression. Even smaller to row-level, to an attribute-level compression. The use of different compression units, but also quite for such compression, will produce a relatively large difference. Throughput requirements. Database compression requires very high throughput rates, and no matter which algorithm you choose, you cannot sacrifice throughput too much. If you use a file compression strategy, if each visit to extract the entire file, it may be a loss.
Finally, Hu introduces NetEase Big data compression way is in the global Establishment dictionary, through distinguishes the data attribute to carry on the flexible on-line compression, the decompression, the access efficiency compared with the traditional compression enhancement twice times to 10 times times. Next, the HU plan realizes the automatic partition function of the collection dictionary through more statistic information.
Dai King: Next generation large data analysis based on Spark software stack
Intel Large data chief architect Dai King
Dai King, Intel's Big data chief architect, said that deep analysis of large data is broadly divided into two categories: similar to SQL data analysis, relational cloud computing, real-time, fast data analysis speed. He believes that the use of spark to build the next generation of large data analysis, users can build a new application scenarios and new analytical applications, and illustrate the spark and SQL structured data, the way
Hive and parquat data processing.
Che Wenqing: 12306: NoSQL practice of changing traditional ideas to solve problems
Che Wenqing 12306 As an example to launch a speech, introduced how to use the NoSQL construction system, order query system, and to achieve 10,000 orders per second order inquiries, more than 10 minutes to update the frequency of the ticket.
VMware Senior Business Solution Architecture Shi Jiewen
Che Wenqing that the traditional design of the system architecture can not be resolved, 12306 Web site data traffic problems. When the system is switched, the SQL database is extracted, sent to the NoSQL cluster, the data quantity carries on the parallel operation, starts the old and new system work load between 90%-10%, after the normal operation can completely put in the new system to run.
Liu: A news client recommendation system based on full web content
Liu for news clients encountered content classification quality identification graphics and text, video, audio, games, data sparse, content cold start, user cold start, noise processing: Three vulgar content, such as the handling of thorny issues of the way to start a speech.
Sohu Mobile research and development manager Liu
First of all, Liu introduced Sohu Mobile end News recommended two features:
advertising System. Advertisement system, Sohu pursues conversion rate, auxiliary index ROI, user effect. Search the system. Search engine, the pursuit of content understanding, content crawling, text keyword theme extraction, text classification, topic classification, content indexing, garbage filtering, page rank, cheat and so on.
Subsequently, Liu introduced the news recommendation system of the three vulgar content of the treatment, he said: "The recommendation system appears three vulgar content, can increase the conversion rate of 18%-20%, although can temporarily increase the click, but the user's stickiness has a great impact." We sift through the classification of user reading distribution, user attribute distribution statistics and refinement to screen the three vulgar contents. After the overall treatment, the conversion rate dropped to 15%, recommended total increase of 20%, user frequency also has a 20% increase. ”
Lu Yi Lei: The practice of Hadoop in advertising monitoring technology
LU billion thunder around the advertising marketing data flow, advertising monitoring technology features, advertising monitoring data differences, advertising data Mining platform framework, ADH in advertising marketing data Mining special, Admaster data analysis platform to launch a speech at six.
Admaster Technology Vice President Lu Yi Lei
In the speech, LU billion said that Adh is for the advertising industry made of Hadoop, he has the following five features:
log information or data in Hadoop, will automatically generate the required data format, built-in advertising algorithms, Mr can provide Hadoop services, for HBase to make the transformation, such as project sequencing, project indexing and so on to make corresponding optimization, optimization Hadoop scheduling system, integrated spark.
In the advertisement monitoring data, Lu billion thunder summed up: Different IP library system led to different geographical conclusions, monitoring the different timing of code deployment, monitoring mechanism and the difference between the definition of indicators, mobile app more unstable network environment and so on are the main factors leading to data differences.
More highlights, please pay attention to the live topic 2014 China Large Data Technology Congress (BDTC), Sina Weibo @csdn cloud computing, subscribe to CSDN large data micro-signal.