The technology realization method of the graph function take MySQL for example
SELECT * from msgs where thread_id =? Limit page * count, Count
But when we look at the Twitter API, we find that many interfaces use the cursor approach, rather than page, count as an intuitive form, such as the followers IDs interface
code is as follows |
copy code |
URL:
Http://twitter.com/follo Wers/ids.format
Returns An array of numeric IDs for every user following the specified user.
Parameters:
* cursor. Required. Breaks the results into pages. Provide a value of-1 to begin paging. Provide values as returned to into the response body ' s next_cursor and Previous_cursor attributes to page back and forth in The list.
o example:http://twitter.com/followers/ids/barackobama.xml?cursor=-1
o example:http://twitter.com/followe rs/ids/barackobama.xml?cursor=-1300794057949944903
|
Http://twitter.com/followers/ids.
format
As you can see from the above, http://twitter.com/followers/ids.xml this call needs to pass cursor parameters for paging instead of the traditional url?page=n&count=n form. What are the advantages of doing this? Do you want each cursor to maintain a mirror image of the current dataset? Prevent duplicate content from query results due to real-time change of result set?
In Google Groups, the cursor expiration discussion, the Twitter architect John Kalucki mentioned
The code is as follows |
Copy Code |
A cursor is a opaque deletion-tolerant index to A btree keyed by source
UserID and modification time. It brings in the
Reverse Chron sorted list. So, since can ' t change the past, and other than
erasing it, it's effectively stable. (Modifications bubble to the top.) But
You have to deal with additions in the list head and also blocks shrinkage
due to deletions, so your blocks Begin to overlap quite a bit as the data
ages. (If you cache cursors and read very later, you'll be in the few
rows of cursor[n+1] ' s block as duplicates of the Last rows of cursor[n] ' s
blocks. The intersection cardinality is equal into the number of deletions in
Cursor[n] ' s block. Still, there may is value in caching this cursors and
then heuristically rebalancing them when the overlap On crosses some
threshold.
|
In another new cursor-based pagination not multithread-friendly, John also mentions
The code is as follows |
Copy Code |
the page based Approach does not scale with large sets. We can no
longer support this kind of API without throwing a painful number of
503s.
Working with row-counts forces the data store to recount rows in a O
(n^2) manner. Cursors avoid this issue by allowing practically
constant time access to the next block. The cost becomes O (n/
block_size) which, yes, is O (n), but a graceful one given n < 10^7 and
a block_size of 5000. The cursor approach provides a more complete and
consistent result set.
proportionally, very few users require multiple page-fetches with a
page size of 5,000.
Also, scraping the social graph repeatedly at high speed was could
often be considered a low-value, Borderline abusive use the social
graph API.
|
It is clear from these two paragraphs that the purpose of using the cursor method for data in large result sets is primarily to improve performance significantly. Or take MySQL as an example to illustrate, such as paging to 100,000, without cursor, the corresponding SQL
SELECT * FROM msgs limit 100000, 100
On a millions of recorded tables, it takes more than 5 seconds to execute this SQL for the first time.
Assuming that we use the value of the table's primary key as cursor_id, the SQL corresponding to the cursor paging method can be optimized to
SELECT * FROM msgs where ID > cursor_id limit 100;