Contract advertising system-common advertising system open source tools

Source: Internet
Author: User
Tags hypertable

Common advertising system open source tools common advertising system open source tools

Is it possible to quickly build an advertising system with open source tools? To be sure, the basic is possible. Many of the important modules, such as those in the architecture mentioned in the previous chapter, can be implemented with open source tools.

        in the actual advertising system we will use a lot of open source tools. I'll tell you a couple of open source issues, respectively. Basically all large ad system companies use the hadoop Platform, in addition to google. hadoop is the basis of modeling in the advertising platform, hadoop There are many sub-projects, and some sub-projects and hadoop have a close connection, Some of the sub-projects are only in hadoop framework system deep dug out some idea. For a brief introduction, hadoop now has only two core items, hdfs and mapreduce , formerly hadoop has many sub-projects, the middle two columns previously Hadoop sub-projects, but many are now independent and become top-notch projects. hadoop is itself a platform for the storage and computing of big data.

The tools in the

Graph can be divided into two categories, one is offline data processing, the other is online processing. Offline data processing, commonly used is hbase , which is based on hadoop the Columnstore database, not the relational database, is the Span lang= "en-US" >nosql database. There are many databases similar to hbase functions, such as google bigtable , and hbase corresponds to the hypertable , hypertable is c language writing tools, efficiency than hbase higher, and Facebook Open source cassandra . They are all solutions to the problem of semi-structured storage on big data, which may be used in real-world systems.

        Oozie is the hadoop on the Process control tools, such as our daily log processing, to wait for the log after the receipt of a variety of analysis, such as some ctr forecast, and some audience targeting, some by bi system, they may have some dependencies between these analyses. hadoop provides a tool to manage these dependencies is oozie,oozie can be considered to be relatively low-level, can be used api development of a framework, I personally feel is very bad use, also did not see how many people this tool, but it is more perfect in design.

There are two very important scripting languages on Hadoop,Pig and Hive. They are similar in function to the need to use unstructured data on Hadoop, which can be accessed and processed in a SQL-language way. So when you pull some simple data reports, you don't have to write a MapReduce program to get the data, only scripts that write pig or hive. the difference between pig and hive is that if your data is separated by a delimiter, you can write the pig script directly to access it, so it's more direct and convenient to use. hive needs to be pre-processed to create index-like data to operate. So Pig is closer to the programmer's habit, andhive is better suited for bi.

Mahout is a tool for data mining and machine learning with Map/reduce on Hadoop, and I personally feel that other algorithms are not very useful except for a few algorithms. But this is also the essence of machine learning algorithm, machine learning algorithm needs to adjust according to the data and actual situation, processing, so if not understand the internal implementation, adjust the model is more troublesome.

online section, most commonly used is zookeeper , Zookeeper is an open source solution for consistency issues in a distributed environment, which corresponds to the more famous google chubby . zookeeper and chubby The theoretical basis is not exactly the same. chubby is strictly in accordance with paxos algorithm to achieve, paxos is Lamport is a complete algorithm for solving the consistency problem in distributed environment, and zookeeper is a simplified version of this algorithm, it

Avro know people may not be many, because it is not commonly used, it is commonly used in Facebook Thrift, it solves the distributed environment of cross-language communication problems, very useful, and Avro is only Google's Protobuffer is a similar tool for a replacement product implemented by Hadoop's authors.

S4 analogy to the more commonly used Twitter Storm, is our architecture in the last chapter of the flow-based computing platform for the log to quickly process the feedback of a computing platform. S4 was originally developed by Yahoo! For the purpose of search ad system analysis, andstorm was designed to meet some of the fast computing tasks on Twitter, such as the fast calculation of fans and thenumber of repost. The two are different, theS4 is primarily optimized for throughput, it does not use disk at all, and Storm uses a small amount of disk operations because it guarantees data consistency and guarantees that each piece of data is processed at least once, but it is not guaranteed to be processed only once.

Chuhwa corresponds to Facebook's Scribe, which is the Data Highway mentioned in the previous chapter architecture, which is a distributed log collection tool, An advertising system has a lot of advertising engine, these delivery engines, Data highway in real-time to the multi-server logs quasi-real-timecollection together, quasi-real-time to the Hadoop or storm, there is now a common is Flume.

Elephant-bird is an interesting gadget, it is used with pig, if our system is heavily used thrift or Protobuffer, its structure is serialized, binary, You don't have direct access to pig,Twitter provides elephant-bird tools to solve this problem, and with Elephant-bird,Pig can access the serialized data directly.

The shadow representation in the picture is a useful tool for my personal feeling, but it only represents my personal opinion. Here I would also like to introduce some of my personal views on the open source community, now the open source community has been and some years ago has been completely different, and now is no longer a quality is not guaranteed, content is more miscellaneous environment. The reason is that many big companies are sparing no effort in supporting open source projects, more typically Twitter,yahoo!, Facebook. We think open source tools are optional because these tools are verified on Twitter,Facebook, and other companies have a hard time testing this massive data, such as Facebook testing scribe, the data reached 10T per second , so the quality and reliability of open source tools is very high.

Cross-language service building tools

When building an advertising system, the first problem that may be encountered is that the multi-module language is not unified,Facebook is relatively open in this respect, it provides a tool Thrift, which allows engineers to develop their own favorite or customary language, Thift is a fast-growing tool for cross-language services. Its use is very simple, the first step is to use a struct to define language-independent communication data structure, in IDL language description, such as the following KV, it has two fields one is a 32-bit int type key, one is A string of type value. Thrift will convert the definition of the IDL language to the definition of the language you need, such as the C language.

struct KV

{

1:optional i32 key=10;

2:optional string value= "x"

}

The building service is also written in the IDL file, such as Kvcache is a service of the memory KV pair, it has several interfaces,set,get and Delete, these definitions are language-independent, you implement your own logic according to the definition.

Service Kvcache

{

void set (1:I32 key, 2:string value);

String get (1:32 key); void Delete (1:i32 key);

}

If not for special protocol layer optimization,thrift can meet the requirements. And it can implement backward compatibility of the structure and interface (backward compatible), the type of tools that have Hadoop Avro and Google's protobuffer.

Contract advertising system-common advertising system open source tools

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.