1.
Basic Concepts
A statistical system can be divided into distinguishable statistics (discriminative Statistics) and indistinguishable statistics (non-discriminative Statistics), depending on whether a single independent device can be traced. Au provides a distinguishable statistic that uses an identity identifier (a Unique ID, hereafter referred to as an ID) to track data for a single device over a long period of time. In contrast, early site statistics are indistinguishable statistics, such as page visits, independent IP numbers, and so on. Modern web site statistics are based on the distinguishable statistics of cookies or hardware fingerprints. Since the smart device provides enough hardware fingerprinting and computing power, AU has focused on distinguishing statistics from day one.
The IDs of most mobile statistics are generated from the system ID, including but not limited to IMEI, MAC, and Android IDs. The most famous ID is UDID, under the pressure of privacy, Apple eventually abandoned the UDID and MAC address.
Most Web site statistics are cookie-based and therefore transient IDs (temporal IDs). The Openudid is a typical transient ID.
Apple's IDFA and IDFV are system IDs, but they are also transient IDs.
Since the distinguishable statistics involve user privacy, it is not the system ID, but the UMID of the Allies, that are used in the calculations. The AU does not provide data to third parties [1] that contain the original ID or UMID, but instead provide aggregated results. UMID is neither a system ID nor a transient ID, it is an evolving ID solution. This article will explain why Friends of the Alliance design UMID, and why we should constantly improve the program.
2. ID Quality
The basis for distinguishing statistics is to establish a reliable identity identifier, which seems to be a very simple thing, just choose an ID, or artificially construct a class cookie ID, you can complete the independent user volume, retention and other analysis. But unfortunately, except for the udid that Apple has abolished, there is hardly a close-to-perfect ID.
To facilitate discussion, first ignore the existence of false data, assuming that each device has a real identity x. The goal of the distinguishable statistic is to select an appropriate identity I, so that the statistical results based on I are as consistent as X as possible.
First, we introduce two concept ID collisions (collision) and ID Drift (jitter).
ID Conflict
For a device Cohort, it is always possible to measure the number of X and I in a certain time period, denoted by count (X) and count (i). If within a short enough period of time
Count (X) > count (I)
We call I an ID that has a conflict.
ID Drift
For a device Cohort, it is always possible to measure the number of X and I in a certain time period, denoted by count (X) and count (i). If in a long enough period of time
Count (X) < count (I)
Then we call I an ID that has a drift.
The IMEI of an Android device is an ID with a serious conflict, which, according to our estimates, has a conflict rate greater than 3%. This is because the IMEI of many shanzhai machines is the same.
The Mac of the Android device is also a conflicting ID, because many of the VMS are the same Mac. In addition, the MAC is also a typical existence of a serious drift ID, which is because the Android source code has a randomly generated MAC address after the 24-bit codes have been abused (refer to reading: MAC address drift problem).
Qualitative Analysis
Next, we can qualitatively analyze the impact of ID collisions and drift on statistics:
When an ID is only in conflict, dau and installation using this ID will be underestimated, but it is possible to overestimate the retention. However, these effects are moderate, for example, a 5% ID conflict only causes Dau to be underestimated at most 5%, while the effect on retention can be negligible.
When an ID is only drifting, the dau and installation using this ID are overestimated and will affect retention. When the drift is large, the impact on the statistical indicators is dramatic. For example, a daily drift of 5% ID may cause dau to be overestimated by 2%, but will cause 5% false installs per day (this is because the drift will affect all users, including inactive users), while the fake installation of the retention in the short term high, but long-term retention is low (short-term drift will be high , the time is long, drift will be low). The ID of any kind of cookie will have a similar nature, so traditional Web site statistics are turning to more reliable device fingerprints.
When an ID has both a conflict and a drift, the dau and installation using this ID are completely unreliable. In the case of MAC addresses, the MAC address of this part of the device with drift changes frequently, resulting in a large number of spurious installations, with very low retention rates. For applications with a small number of users, the consequences of choosing such an ID are catastrophic.
In summary, when the drift and conflict of the IDs are small enough, they can be ignored for distinguishing the statistical effects. When these errors are not negligible, the impact of ID collisions is moderate, while the drift of IDs can seriously disrupt installation and retention statistics.
3. ID selection
iOS Platform
With Apple discarding Udid, MAC addresses, and the inability to share the Clipboard through IOS7 on the Openudid, it marks the control of the device ID back in Apple's hands, and indicates Apple's determination to protect the user's privacy.
In the post-iOS7 era, the choice of ID is more clear, the industry's common ID is mainly IDFA (that is, advertising identifiers, Advertisingidentifier) and IDFV (that is, vendor identifier, Identifierforvendor). IDFA is applicable to external advertising, cross-referrals and other cross-application user tracking; IDFV applies to user behavior tracking within the app.
Of course, statistical compatibility and fault tolerance must be ensured for the mobile statistics platform. That's why we've been emphasizing the use of an ever-optimized umid solution instead of any specific ID.
Android Platform
As for the Android platform, the choice of ID has always been a headache due to the openness of the ecosystem.
(1) Single ID
As mentioned earlier, the IMEI and Mac are not the best IDs. In particular, the MAC address is almost an unusable ID.
(2) Combination ID
Some developers choose to combine multiple IDs into a single combination ID, such as
CID = MD5 (imei+mac+android_id)
Using the previous analysis is not difficult to draw, the combination ID will greatly reduce the conflict, but will enlarge the drift. For combination IDs, the drift of any one source ID will cause it to drift.
Developers should try to avoid the CID, be sure to use and avoid using MAC addresses. If you are already using CID, be sure to persist the CID as a cookie ID in the next release and regenerate the CID only if the cookie is lost. Such a strategy can ensure the continuity of the ID as much as possible, while mitigating the impact of drift.
4. Friends Alliance ID Scheme
UMID
Since Umid is still evolving, it can only be explained simply by the Umid life cycle. Umid is an extremely conservative ID, and when a device is assigned a umid, the Allies will try to ensure that the umid will not change. Therefore, Umid's generation strategy is limited by the league's historical data, and the most important design goal is to ensure stability and data consistency. The alliance will continuously monitor conflicts and drift, and will minimize drift and keep conflicts within reasonable bounds. Just as there is no perpetual motive in the world, Umid is not a perfect ID.
Prime Radiant
To further improve the ID quality, friends of the league launched a new SDK. This version of the SDK from design to release for almost a year, the internal code is Prime Radiant, from Asimov science Fiction. With the new features offered by Prime Radiant, friends will be able to better monitor the quality of the ID signal source, and be able to adjust the strategy based on actual data, leveraging the advantages and disadvantages of device ID and transient ID. Prime Radiant also leverages the computing power of smart devices to improve data quality and reliability with cryptography.
Thanks to the data from the Prime Radiant test phase, friends can accurately quantify the quality of each type of ID. Many of the conclusions of this paper are inseparable from these data. For developers who care about data quality and data security, it is recommended to upgrade the new version of the Alliance SDK for experience and evaluation.
new version of the SDK:IOS Statistics sdkandroid Statistics SDK
[1] Friends and third parties exchange data containing IDs only when advertising reconciliations and data are calibrated.