Technology selection and architecture realization of user Portrait

Last Update:2016-02-27 Source: Internet

Author: User

Tags redis cluster

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

This explains the technical architecture and the overall implementation of the user portrait, then discusses the implementation of an architecture (personal insight) from data collation, data platform, and application-oriented three aspects.

Data collation:

1, the data index of the comb from the daily accumulation of the system of logging system, through Sqoop import HDFs, can also be implemented with code, such as Spark JDBC connection to the traditional database data cache. There is also a way to import HDFs by writing data to a local file and then using Sparksql's load or the export of hive.

2, through the hive to write UDF or HIVEQL according to business logic splicing ETL, so that users correspond to different user tag data (here the indicator can be understood as a corresponding label for each user), generate the corresponding source table data, so that the subsequent user portrait system, The creation of a label wide table is done by different rules.

Data platform

1, the Data Platform application distributed File system for the Hadoop HDFs, because after Hadoop2.0, any big data application can request resources through the Resouremanager, registers the service. For example (Sparksubmit, hive) and so on. The advent of the memory-based computing framework does not use the MapReduce of Hadoop. Of course, many people are still inclined to use Hadoop for a lot of offline processing, but Hadoop's encapsulated functions are only too simplistic for maps and reduce, unlike the spark-class computing framework, which has more encapsulated functions (see Blog Spark column). Can greatly improve the development efficiency.

2, the calculation of the framework of the use of Spark and rhadoop, there are two main uses of Spark, one is the data processing and the upper application of the rules specified by the filter filter, (through Scala writing spark code submitted to sparksubmit). One is the sparksql that serves the upper-level application (by starting Spark thriftserver and connecting to the foreground app). The application of Rhadoop is mainly based on the grading of label data, such as the use of collaborative filtering algorithms and other recommended algorithms to score all aspects of the data.

3. The application of MongoDB memory data mainly lies in the real-time query for individual users, and also through the data format conversion (JSON format) of the label wide table after the spark data carding, the foreground application can convert data by connecting MongoDB. The result is a single label display. (You can also convert the data into a key value form in Redis and import the Redis cluster)

4, the role of MySQL is for the upper layer application tag rules of storage, as well as the display of page information. The Data wide table in the background is associated with spark, and the metadata information is collated by connecting the MySQL cache metadata to Filter,select,map,reduce, and then processed with the real data that exists in HDFs.

Application oriented

1, from the data collation, data platform calculation, has been to serve the upper application of the label large-width table generation. (The type of label information that the user corresponds to). Then the front desk according to business logic, tick the different labels for summing, culling, such as the flow of this month more than 200M users (tags) + consumption of more than 100 users (tags) for the operation, through the front-end code to achieve SQL splicing, the number of customers to explore. This is the JDBC connection to Spark's thriftserver, which is calculated by the cluster for the large-width table on HDFs. (note here that many SQL aggregate functions and multi-table association joins are equivalent to Hadoop's MapReduce shuffle, which can easily lead to memory overflow, and related parameter adjustments can refer to the configuration information in the Spark column of this blog) so that the corresponding number of customers will be located, In order to conduct customer base, label analysis, product strategy matching to accurate marketing.

Technology selection and architecture implementation of user portrait

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More