MongoDB application case: Using MongoDB to store log data

Source: Internet
Author: User
Tags createindex mongodb sharding

Services run on the line will generate a large number of running and access logs, the log will contain some errors, warnings, and user behavior information, usually the service will be in the form of text log information, so readable, convenient for daily positioning problems, but when a large number of logs, to dig out a large number of logs of valuable content, Further storage and analysis of the data is required.

Taking the access log of the storage Web service as an example, this article describes how to use MongoDB to store and analyze log data to maximize the value of log data, the content of this article also uses other log storage applications.

Pattern Design

A typical Web server's access log is similar to the following, including access to the source, user, access to the resource address, access results, user-used system and browser type, etc.

127.0.0.1 - frank [10/Oct/2000:13:55:36 -0700] "GET /apache_pb.gif HTTP/1.0" 200 2326 "[http://www.example.com/start.html](http://www.example.com/start.html)" "Mozilla/4.08 [en] (Win98; I ;Nav)"

The simplest way to store these logs is to store each row of logs in a separate document, with each line of logs similar to that in MongoDB

{    _id: ObjectId(‘4f442120eb03305789000000‘),    line: ‘127.0.0.1 - frank [10/Oct/2000:13:55:36 -0700] "GET /apache_pb.gif HTTP/1.0" 200 2326 "[http://www.example.com/start.html](http://www.example.com/start.html)" "Mozilla/4.08 [en] (Win98; I ;Nav)"‘}

Although the above pattern can solve the problem of log storage, but make it difficult to analyze the data, because the text analysis is not good for MongoDB, a better way, when a row of logs stored in the MongoDB document, first extract the values of each field, as shown below, The above log is converted to a document that contains many fields.

{_id:objectid (' 4f442120eb03305789000000 '),host:  "127.0.0.1", logname: NULL, user:  ' Frank ', time: Isodate ( "2000-10-10t20:55:36z"), path:  "/apache_pb.gif", request:  "GET /apache_pb.gif http/1.0 ", status: 200, response_size: 2326, referrer: user_agent:  "mozilla/4.08 [en] (Win98; I; NAV) "}                

Also, in this process, if you feel that some fields are not helpful for data analysis, you can filter them directly to reduce the consumption of storage, such as

    • Data analysis does not care about user information, request, status information, these fields do not need to store
    • The Objectid itself contains time information, and there is no need to store a single temporal field (of course, it also has the advantage of timing, time is more representative of the request, and the query statement is more convenient to write, as far as possible to choose the storage space to occupy a small data type)

Based on the above considerations, the final storage of a row of logs may be similar to the following

{    _id: ObjectId(‘4f442120eb03305789000000‘),    host: "127.0.0.1",    time: ISODate("2000-10-10T20:55:36Z"), path: "/apache_pb.gif", referer: "[http://www.example.com/start.html](http://www.example.com/start.html)", user_agent: "Mozilla/4.08 [en] (Win98; I ;Nav)"}
Write log

Log storage service needs to be able to support a large number of log writes at the same time, users can customize Writeconcern to control the log writing ability, click here to learn more about Writeconcern

db.events.insert({        host: "127.0.0.1", time: ISODate("2000-10-10T20:55:36Z"), path: "/apache_pb.gif", referer: "[http://www.example.com/start.html](http://www.example.com/start.html)", user_agent: "Mozilla/4.08 [en] (Win98; I ;Nav)" })
    • If you want to achieve the highest write throughput, you can specify Writeconcern as {w:0}
    • If the log is of high importance (for example, you need to use a log as a billing voucher), you can use a more secure writeconcern level, such as {w:1} or {w: "Majority"}

At the same time, in order to achieve the optimal write efficiency, users can also consider the bulk of the write mode, one network request to write multiple logs.

db.events.insert([doc1, doc2, ...])
Query log

When the log is stored in MongoDB as described above, it can meet a variety of query requirements

Querying all requests to access/apache_pb.gif
q_events = db.events.find({‘path‘: ‘/apache_pb.gif‘})

If this query is very frequent, you can index the path field to efficiently service such queries

db.events.createIndex({path: 1})
Query all requests for a given day
q_events = db.events.find({‘time‘: { ‘$gte‘: ISODate("2016-12-19T00:00:00.00Z"),‘$lt‘: ISODate("2016-12-20T00:00:00.00Z")}})

By indexing the Time field, you can speed up this type of query

db.events.createIndex({time: 1})
Querying all requests for a host over time
 q_events = db.events.find({    ‘host‘: ‘127.0.0.1‘,    ‘time‘: {‘$gte‘: ISODate("2016-12-19T00:00:00.00Z"),‘$lt‘: ISODate("2016-12-20T00:00:00.00Z" }

})

This type of query can be accelerated by building a composite index of host and time

db.events.createIndex({host: 1, time: 1})

Similarly, users can use MongoDB's aggregation, mapreduce framework to do some more complex query analysis, when used should try to establish a reasonable index to improve query efficiency.

Data sharding

When the Log service node is more and more, the log storage service needs to guarantee the extensible log writing ability as well as the massive log storage ability, then needs to use the MongoDB sharding to expand, the log data is distributed to many Shard, the key problem is Shard key choice.

Shard by Timestamp field

One simple way is to use timestamps for sharding (such as the Objectid type of _id, or the Time field), which has the following problem

    • Because of the sequential growth of the timestamp, the new write will be divided into the same shard, and the log write capability cannot be extended
    • Many log queries are for up-to-date data, and the most recent data is usually scattered only on partial shard, which results in a query that only falls to some shard
Shard by Random Field

According to the _id field to hash, the data and write can be evenly dispersed to the various shard, the writing capacity will increase linearly with the number of Shard, but the problem of the program, the data dispersion is irregular, All scope queries (which are often needed for data analysis) need to be found on all shard and then merge query results to affect query efficiency.

By evenly distributed key shards

Assuming that the path field in the above scenario is fairly evenly distributed, and that many queries are broken down by the path dimension, consider slicing the log data according to the Path field, with the advantage that

    • Write requests are divided into individual shard
    • Query requests for Path focus on one (or more) shard, with high query efficiency

The lack of place is

    • If one path accesses particularly large, it can result in a single chunk, which can only be stored to a single shard, which is prone to access hotspots
    • If the value of path is very small, it can also cause the data not to be distributed well to each shard

Of course, there are ways to improve the above deficiencies, by introducing an additional factor into the Shard key, such as the original Shard key is {path:1}, the introduction of additional factors into

{path: 1, ssk: 1} 其中ssk可以是一个随机值,比如_id的hash值,或是时间戳,这样相同的path还是根据时间排序的

The effect of this is that the value distribution of the Shard key is rich, and there is no case where a single value is particularly numerous.

The above several kinds of sharding methods have advantages and disadvantages, the user can choose the scheme according to the actual demand.

Responding to data growth

Sharding schemes provide massive data storage support, but as data grows, the cost of storage continues to rise, and often many log data has a feature, and the value of log data decreases over time, such as 1 years ago, or even 3 months ago, when historical data was completely non-analytical, which could be used to reduce storage costs , and there are many ways to support this requirement in MongoDB.

TTL Index

MongoDB's TTL index allows the document to automatically expire after a certain period of time, such as the Log Time field above indicates when the request was generated and a TTL index for that field, and the document is automatically deleted after 30 days.

db.events.createIndex( { time: 1 }, { expireAfterSeconds: 108000 } )

The TTL index is currently a background thread (the default 60s once) to remove expired documents, and if many are written, resulting in the accumulation of a large number of documents to expire, it will cause the document to expire and keep up with the storage space.

Using the Capped collection

If there are no particularly stringent requirements for the time that the log is saved, but there is a limit on the total storage space, consider using capped collection to store the log data, specifying a maximum amount of storage space or documents, and MongoDB will automatically delete the capped when the threshold is reached. The oldest document in the collection.

db.createCollection("event", {capped: true, size: 104857600000}
Regularly archive by collection or DB

For example, at the end of each month, the events collection is renamed, the name is taken with the current month, and a new events collection is created for writing, such as the 2016 log will eventually be stored in the following 12 collections

 events-201601 events-201602 events-201603 events-201604 .... events-201612

When you need to clean up historical data, delete the corresponding collection directly

db["events-201601"].drop() db["events-201602"].drop()

Insufficient time, if you want to query data for multiple months, the query will be slightly more complex, you need to query the results from multiple collections to merge

MongoDB application case: Using MongoDB to store log data

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.