MongoDB mapreduce-based statistical analysis

Source: Internet
Author: User
Tags emit
This article " MongoDB mapreduce-based statistical analysis " Is Developing oecp Community How to solve the problems encountered in and sum up experience.
The previous section briefly introduced an application of MongoDB in the oecp community: Design and Implementation of Dynamic messages. In the last application, we only introduced the most basic query functions of MongoDB. Today I will introduce more advanced MongoDB applications: Using MongoDB for statistical analysis.
In the oecp community, we need to store access records of pages to make it more accurate to analyze Website access conditions so that users can more accurately recommend content they are interested in. The data has the following features:
    • It has nothing to do with the business. Try to separate data storage and business data to reduce the pressure on the Business Database. Data Consistency requirements are not high.
    • Each time you access a page, you need to store a record. Real-time insertion is demanding. Of course, you can use the cache as a temporary buffer to solve the problem of frequent data updates.
    • Data expands rapidly as traffic increases. If a page contains 100 pageviews in one day, 100 new data entries will be added, and the data volume is much higher than the business data, and it is much larger than the order of magnitude of the message dynamic data we mentioned last time. The website should store data for at least two months as much as possible. When the website has a large traffic volume, the storage of massive data should be solved.

Therefore, we still choose MongoDB as persistent storage. Because nosql databases have low data query diversity capabilities, especially the standard key-value database, nosql is generally used to store logs, the analysis requires that the data be extracted to the relational database for statistical query. However, MongoDB provides a wealth of query statistics functions. Both group and mapreduce can implement statistical query analysis such as group by, sum, and count in SQL. The group function can implement simple statistical functions, but when the data volume is very large, the group processing capability is not very good, so we started to use mapreduce for statistical analysis.
Let's take a look at the official introduction to mapreduce:
DB. runcommand (
{Mapreduce: <collection>,
Map: <mapfunction>,
Reduce: <performancefunction>
[, Query: <query filter Object>]
[, Sort: <sort the query. Useful foroptimization>]
[, Limit: <number of objects to returnfrom collection>]
[, Out: <output-collection Name>]
[, Keeptemp: <true | false>]
[, Finalize: <finalizefunction>]
[, Scope: <object where fields go into JavaScript global scope>]
[, Verbose: True]
}
);
The Java Driver provides two methods:
Dbcollection. mapreduce (string map, string reduce, string outputcollection,
Dbobject query );
Dbcollection. mapreduce (dbobject command); // according to the above introduction, this interface always reports an error and does not know how to apply it.

PV data storage structure: (these attributes are mainly used for analysis based on various dimensions in the future)
Entityid: entity ID,
Entityname: entity name,
Userid (LOGIN) Visitor ID,
Sessionid: Session ID,
Referer : Source URL,
URL : Current page URL ,
Title: The title displayed,
Date: Access time,
IP : Visitor IP

First Application Scenario: when accessing a user's space, get the latest access record of a user. If the same page is accessed repeatedly, the latest access record is returned.

    • The first is the map method, which mainly defines the structure of outputcollection. Output structure of outputcollection: {_ ID: Key, value: Value}
Java Code
    1. String mapfun ="Function () {emit ({URL: This. url, Title: This. Title}, this. Date )}";// Key = {URL: This. url, Title: This. Title}, value = return value of the reduce method.
    • The second is the reduce method.
Java code
    1. String performancefun ="Function (Key, Vals) {var date = 0; For (var I in Vals) {If (date = 0) {date = Vals [I];} else if (Vals [I]> date) {date = Vals [I] ;}} return date ;}";// If the data of the same key is compared with each other, the latest time is returned.
    • Run
Java code
    1. Dbobject query = newbasicdbobject ();
    2. Query. Put ("Userid", Userid );
    3. Query. Put ("Date", Newbasicdbobject ("$ GTE", Fromdate ));
    4. *. Getcollection (). mapreduce (mapfun, performancefun,"Pageview_results", Query );// It is best to define a query to reduce the original result set of statistics.
    • Result of traversing the pageview_results set: [{_ ID: {URL: "/blog/yongtree/258", Title: 'blog 1'}, value: '2017-10-11 20:30:56 '}, {_ ID: {URL: "/blog/Slx/2010", Title: 'blog 2'}, value: '2017-10-01 02:23:33 '}]

 
Note: The mapfun and reducefun strings are written in Javascript. MongoDB can parse JavaScript on the server side. If this method is incorrect,ProgramIt cannot be executed normally.
 
Second Application Scenario: when accessing a specific content, the system returns the content that has been browsed for a certain period of time.ArticleTo guide the current user.

    • First, find the person who has accessed the content within a certain period of time as the statistical condition. We use sessionid instead of userid to calculate the access of non-logged users together.
Java code
  1. Dbobject query = newbasicdbobject ();
  2. Query. Put ("Entityid", Entityid );
  3. Query. Put ("Entityname", Entityname );
  4. Query. Put ("Date", Newbasicdbobject ("$ GTE", Fromdate ));
  5. Query. Put ("Date", Newbasicdbobject ("$ Lt", Todate ));
  6. List sessionids =This. Fetch service. getcollection (). Distinct ("Sessionid", Query );// The distinct (string key, dbobject query) function that extracts duplicate values in the result set is used here, which is equivalent to SQL: Select distinct (name) from table
    • Defines the map method, mainly to define the structure of outputcollection. Output structure of outputcollection: {_ ID: Key, value: Times of browsing}
Java code
    1. String mapfun ="Function () {emit ({URL: This. url, Title: This. Title}, 1 )}";// Key = {URL: This. url, Title: This. Title}, value = return value of the reduce method. The number of times the data is calculated. Therefore, the value here defines constant 1.
    • Define reduce Method
Java code
    1. String performancefun ="Function (Key, Vals) {var COUNT = 0; For (var I in Vals) {count + = Vals [I];} return count ;}";// Sum the number of times the data of the same key appears.
    • Run
Java code
    1. *. Getcollection (). mapreduce (mapfun, performancefun,"Pageview_results",NewBasicdbobject ("Sessionid",NewBasicdbobject ("$ In", Sessionids. toarray ())));
    • Result of traversing the pageview_results set: [{_ ID: {URL: "/blog/yongtree/258", Title: 'blog 1'}, value: '45. 0'}, {_ ID: {URL: "/blog/Slx/288", Title: 'blog 2'}, value: '30. 0'}]

 
Front-end display results:

    • When entering the user space, it displays other content that the Space Owner cares about.

    • When browsing a content, it shows the content that other viewers care about.


 
Continue to focus on the oecp community. We will practice and release more MongoDB-based applications. In the spirit of sharing, this document can be reproduced and applied, but the source must be indicated. Follow the yongtree blog in the oecp community

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.