- Advertising engine
- Overall design
- Search Service
- Advertising search Process
- Advertising orientation
- There are few options under the directional dimension, which can be enumerated, such as orientation including, gender, age, network, System.
- Our system is now dealing with this kind of directional
- Directional dimension has cascade relationship, province, city, district
- The directional dimension is oriented by n km near a coordinate
- The retrieval service is a copy of the database
- CTR Calculation
- Calculate a second price
- ADX ads
- Exposure Services
- Billing Services
- Billing Service Master Preparation
- Floating point problem
Overall design of advertising engine
Our basic architecture is the client request API, which then sends RPC requests by the API to our services, which are managed through the registry
Retrieval Service , indexed by AD in the database, subscribing to Redis Channel to align in-memory ads with the database by notification mechanism
Exposure Service , receive exposure and click, aggregate results are pushed to message queue after aggregation
Billing Services , billing services and payment systems to interact, mainly responsible for the deduction of advertisers money, the initiation of insufficient balance, the budget of the Downline
- The following services are introduced separately
Search service AD Retrieval process
- Parameter parsing, mainly to filter out some illegal requests
- Start the ADX ad request, this step first to which DSP to send a request to a simple orientation, and then through the Flow control module, confirm that can send the request, will send an HTTP request asynchronously, and then immediately return, continue the following logic
- Ad targeting, start screening your ads according to orientation, get all eligible ad IDs
- Ad filtering, filtering out non-qualifying ad IDs based on some logical rules
- CTR is calculated and sorted by ECPM, which calculates the click-through rate for all footage of all ads, and finally sorts by ecpm
- Select ad ads and calculate a second price
- With the ADX ad auction, before sending an ad request asynchronously, there is a need to wait for each request to return the result
- Format the ad content and format the client's protocol according to different types of ad formats
- Ad exposure Click parameter encryption, to encrypt some information, especially the exposure price and click price of the advertisement
Advertising orientation
Issues that need to be addressed
A user has some attributes, such as gender, age, network environment, device type, geographic information, and so on, and advertisers want to put their own ads in a particular crowd, according to the user's attributes, retrieve the available advertising process to complete the ad-oriented
For example, advertising 2 does not meet the orientation conditions, advertising 3, advertising 1 meet
There are few options under the directional dimension, which can be enumerated, such as orientation including, gender, age, network, System.
Ad-Directed table
Ad ID |
Sex |
Age Segment |
Network |
Operating System |
AD 1 |
Unlimited |
Unlimited |
Unlimited |
Unlimited |
Ad 2 |
Woman |
0-18 |
Unlimited |
Unlimited |
Ad 3 |
Man |
0-18 |
Unlimited |
Ios |
According to the above-directed table, it looks a bit like a database, assuming the figure of the small people's traffic to the database, the query statement, should be
WHERE (gender = male or female = unlimited)
and (age =0-18 or age = unlimited)
and (network =wifi or network = unlimited)
and (operating system =ios or operating system = unlimited)
If there is no "unlimited", then it seems that the combined index is the best choice, we combine the gender-age-network-operating system into an index so that the index space is
- 2 Sex x 5 segment Age x 2 Kinds of network environment x 2 operating Systems = 40 possible, each of which may correspond to a list of IDs, in order to use the combined index, the or statement must be removed, can be unlimited ad redundancy written to all the index, such as the most extreme example is AD 1, all are not limited to, Then all 40 possible combinations have AD 1 redundancy
- Our system is the first to retrieve, the advantage is that only need to query an index to take out the targeted ads, the bad place is also obvious is to increase the index data redundancy
Where statements only look at gender one dimension, or left and right sides are equivalent queries can definitely use the index to find two sets and then fetch the set, and then take the intersection with other dimensions
- In our program, we also simulate this kind of first-and-second operation, the btree used in the data to organize the index, the hash table used in our program, is the key of the given query, returns an indexed content
Collection Operations
How to draw a union can refer to a combined sort of two ordered arrays
How to take the intersection, there are generally three ways
- General method, select small set cooperation for base to do binary search in large set
- Large collection intersection small set, small set if can be put into memory can be used to put small set in memory with build hash index, large set of cooperation as base, hash in small set to find
- Large collection, let two are ordered after the intersection of m+n complexity, find a set of the smallest, and then in another set skip is smaller than this number of all data
Our system is now dealing with this kind of directional
An inverted hash index is created for each dimension in memory, and then we do not have to do any redundancy in a dimension, so that there is no need to do a set, each dimension uses the least data dimension data as the driver table, and then hash detection in the results of the other sets, and if not present, remove the ad.
Note that our orientation only makes one adid memory copy to the context of a certain traffic retrieval, and the copy is the smallest set, and the other thing is to do constant hash detection, to orient the collection to other dimensions and narrow it down.
The above-mentioned orientation method mainly uses inverted index, there are some techniques for intersection, inverted index in the processing of the equivalence query and dynamic multi-dimensional combination of time is very suitable, but in the processing range of the query is not very good, such as our age if we support the orientation of any age range, that is, to deal with the scope of the query, Some orderly structures like the balance tree would be a better choice.
Directional dimension has cascade relationship, province, city, district
The ads in the regional orientation conditions are as follows
Ad Area Orientation Table
Ad ID |
Province |
City |
Area |
AD 1 |
Unlimited |
Unlimited |
Unlimited |
Ad 2 |
Beijing |
Beijing |
Chao yang |
Ad 3 |
Shanghai |
Shanghai City |
Unlimited |
Suppose a user in Beijing, downtown Beijing, Chaoyang to search for ads, SQL is as follows:
where (province = unlimited and city = unlimited and area = unlimited)
or (province = Beijing and City = Downtown and district = unlimited)
or (province = Beijing and City = Downtown and district = Chaoyang)
Before we said that you can not limit this situation to all data items, so that avoid or operation, the province of the city this can not be redundant for two reasons
- One is not sure where redundancy is, because the provincial option may not be fixed
- There is redundant if the data in the direction of the provinces and cities are not limited to the amount of redundancy is too large, to the country every county and county are redundant once
Observe that the SQL is first and then or, before the analysis of multiple and suitable for the combined index, so if we can take the province-city-District as the value of an index query, the equivalent of 3 times we check the combined index, and then take the set, the resulting collection also need to do with other sets of intersection, And because the set is dynamically changing because it is or, we have to copy the list of ad IDs and get a temporary table.
How to optimize temporary tables, I think of two methods, but there is no
- Fixed query criteria cache for temporary table cache or later temporary tables
- According to the set Operation law a intersection (b and c) = (a intersection b) and (a intersection c), so that the last operation of this dimension can be guaranteed to a relatively small, b,c is always constant, do hash filtering can be.
The directional dimension is oriented by n km near a coordinate
First the coordinates can be turned into Geohash, and then n kilometers can be redirected and then calculated using a filtered method
There are a few points to note:
- Geohash precision, the smallest bit should be able to cover the nearby n km, about 4 bits of Geohash can cover the 20km,3 bit can cover 78KM
- Geohash because you can only locate a approximate, so you need to put the ads in the vicinity of the 8 grid redundancy to write the ad ID, so the query only need to check a lattice, or you need to retrieve 9 grid
In general, this approach is simply to make an equivalent query in the index where GEO=ABCD do a first sieve and then filter
filtering is not necessarily slow, in the database with the index is slower, the case is most suitable for filtering
The retrieval service is a copy of the database
Our search service will load the full amount of ads in the database when it is launched, build the positive and inverted index data of the advertisement, and maintain the consistency with the database through the message notification of multiple copies.
Get messages by subscribing to the channel of Redis, which includes ads for downline advertisers, sync promotion plans, sync ads, sync footage, reload data all at once
When the data is reloaded, the online service will continue to run, so we use a reference substitution, in order to ensure that full load is sufficient memory, memory can only use 1/2
The message mechanism may be unreliable , and every hour the retrieval service synchronizes with the full amount of the database again
- When the memory is not enough to load all the ads are inverted, we should consider the two sides, one is the compressed storage to load the necessary information, positive row of data compression storage, the other is the data partition, such as by the region partition, but also need to consider some data skew problem, We haven't hit the average of memory yet.
CTR Calculation
Each of the creatives has a CTR, so the CTR calculation is very large, our CTR calculation takes the async way, when querying the CTR cache of an ad, it returns to the default CTR, and then asynchronously computes the CTR of the footage to populate the cache
Calculate a second price
How to calculate CPC ads
- First.clickprice = second.ecpm/first.quality/first.ctr + 0.01 * 1000
CPM Ads
- First.displayprice = second.ecpm/first.quality 0.01
ADX ads
Facing problems
The ADX sends the advertisement request is to the outside network, and the request quantity is large, returns within 100MS, a certain DSP time-out cannot have the influence to the whole advertisement retrieval
We control these issues in the following ways
- Flow control, with QPS control for each DSP
- Send requests asynchronously using NIO
- Long connection, try to use HTTP1.1 's keepalive feature, do not allow third parties to use HTTPS's spot link
- Maximum number of long connections established with each DSP
- The timeout rate is greater than 40% binary decrease the traffic until the QPS minimum is set, when the request success rate is greater than 40% multiplied by the flow until the current this DSP's QPS value is reached
- Control over timeout, self encapsulated dspfuture use Hashwheeltimer timer to control timeout time
Where the ADX also needs to be strengthened
- Link preheating, can be pre-established for each DSP long connection to provide services
- The choice of network link, and the network between the DSP problems, you can try to use a different link to a DSP, or priority to use the most economical link
Exposure Services
Receive exposure click to do aggregations in memory and push aggregated information to message queue every minute
Billing Services
Billing services from the Message Queuing consumer exposure service aggregated exposure click, where exposure is consumed every minute, click is real-time consumption.
Exposure consumption is two threads of collaboration, one thread is responsible for pulling data from the message queue and then aggregating the contents of the N exposure service, a thread responsible for consumption.
- When a consumer thread is blocked, the billing service has the risk of losing data when the aggregated exposure is stored in memory, so our team billing service controls the rate at which the cancellation of the message queue is pulled, up to a few minutes of data
Billing Service Master Preparation
The billing service is a master and the lease is realized through Redis
A key,value named lock in Redis is the name of the current work node and has an expiration time.
Master service, get (lock) every second, and then determine if it is consistent with your current node name, and if so, use the expire method to renew it.
For each second, try to write the lock key for the name of your node, using the Setex method of Redis, because key already exists, so it will return false
Floating point problem
Because of the billing experience to do some money aspects of the inspection, so involving floating point accuracy problem
Floating-point numbers do subtraction using the Bigdecimal.subtract () method
Floating-point multiplication using the bigdecimal.multiply () method
Floating-point numbers do division using the Bigdecimal.divide () method
AD Engine resolution