Large data paging Twitter's cursor way of paging through web data

Source: Internet
Author: User

The technology realization method of the graph function take MySQL for example
SELECT * from msgs where thread_id =? Limit page * count, Count
But when we look at the Twitter API, we find that many interfaces use the cursor approach, rather than page, count as an intuitive form, such as the followers IDs interface
  code is as follows copy code
URL:
Http://twitter.com/follo Wers/ids.format
Returns An array of numeric IDs for every user following the specified user.
Parameters:
* cursor. Required. Breaks the results into pages. Provide a value of-1 to begin paging. Provide values as returned to into the response body ' s next_cursor and Previous_cursor attributes to page back and forth in The list.
o example:http://twitter.com/followers/ids/barackobama.xml?cursor=-1
o example:http://twitter.com/followe rs/ids/barackobama.xml?cursor=-1300794057949944903
Http://twitter.com/followers/ids. format
As you can see from the above, http://twitter.com/followers/ids.xml this call needs to pass cursor parameters for paging instead of the traditional url?page=n&count=n form. What are the advantages of doing this? Do you want each cursor to maintain a mirror image of the current dataset? Prevent duplicate content from query results due to real-time change of result set?
In Google Groups, the cursor expiration discussion, the Twitter architect John Kalucki mentioned
The code is as follows Copy Code
A cursor is a opaque deletion-tolerant index to A btree keyed by source
UserID and modification time. It brings in the
Reverse Chron sorted list. So, since can ' t change the past, and other than
erasing it, it's effectively stable. (Modifications bubble to the top.) But
You have to deal with additions in the list head and also blocks shrinkage
due to deletions, so your blocks Begin to overlap quite a bit as the data
ages. (If you cache cursors and read very later, you'll be in the few
rows of cursor[n+1] ' s block as duplicates of the Last rows of cursor[n] ' s
blocks. The intersection cardinality is equal into the number of deletions in
Cursor[n] ' s block. Still, there may is value in caching this cursors and
then heuristically rebalancing them when the overlap On crosses some
threshold.
In another new cursor-based pagination not multithread-friendly, John also mentions
The code is as follows Copy Code
the page based Approach does not scale with large sets. We can no
longer support this kind of API without throwing a painful number of
503s.
Working with row-counts forces the data store to recount rows in a O
(n^2) manner. Cursors avoid this issue by allowing practically
constant time access to the next block. The cost becomes O (n/
block_size) which, yes, is O (n), but a graceful one given n < 10^7 and
a block_size of 5000. The cursor approach provides a more complete and
consistent result set.
proportionally, very few users require multiple page-fetches with a
page size of 5,000.
Also, scraping the social graph repeatedly at high speed was could
often be considered a low-value, Borderline abusive use the social
graph API.
It is clear from these two paragraphs that the purpose of using the cursor method for data in large result sets is primarily to improve performance significantly. Or take MySQL as an example to illustrate, such as paging to 100,000, without cursor, the corresponding SQL
SELECT * FROM msgs limit 100000, 100
On a millions of recorded tables, it takes more than 5 seconds to execute this SQL for the first time.
Assuming that we use the value of the table's primary key as cursor_id, the SQL corresponding to the cursor paging method can be optimized to
SELECT * FROM msgs where ID > cursor_id limit 100;

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.