Big Data discussion: How to organize 170 billion Twitter release information?

Last Update:2015-03-18 Source: Internet

Author: User

Keywords Library publishing large data already

Tags access big data cloud cost data data management data storage files

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

As social networking boomed, the Library of Congress had to deal with Twitter, which reached 133TB, and, thankfully, they had found a way to manage such data.

As of now, the Library of Congress has kept the number of Twitter information has reached 170 billion, the volume of storage files to reach 133tb--because each piece of information has been shared and reproduced in this social network, the library's technical team needs to find a way to provide users with practical search solutions.

In the current project report, library managers pointed out that such large data management tools available on the market could not solve their practical difficulties. "It is clear that the existing technology can only meet the access requirements of large datasets such as scholarship information, but it is weak in creating and distributing such data," the pavilion said. "Because of the complexity of such tasks and the high demand for resources, the private sector has not yet come up with a business solution that is reasonably cost-effective." "

If the private sector is struggling to handle big data-management jobs, how can the problem be tackled by budget-strapped non-profit institutions, including the world's largest libraries? It's a dream to come up with a practical, economical, convenient index system that can handle 170 billion Twitter messages.

Twitter has signed an agreement to allow the Library of Congress to access all the updates posted in the social media web site. Officials admitted that they had to establish a system to help researchers access social platform data, as traditional communication methods, represented by periodicals and publications, have been gradually replaced by the growing popularity of networked communication trends.

Between 2006 and 2010, when Twitter was just born, the first data dump file was 20TB, which included 21 billion Twitter messages (including metadata such as the user's current location and message description). Recently, the museum has just ushered in the second forward storage data--Overall, this partial replica compressed file volume of 133.2TB. After that, the library will collaborate with the Gnip company to collect all Twitter announcements in hours. According to statistics published in February 2011, about 140 million messages were posted daily via Twitter, and the figure grew to about 500 million last October.

The researchers urged the Library of Congress to open the data access function as quickly as possible – the museum said it had received more than 400 such requests. The project, which is implemented in parallel by both the library and Twitter, will provide users with a history of Twitter usage and can list each piece of information they have posted through their accounts.

The Library of Congress is experienced in big data management: According to staff, the museum has been working on data archiving for government websites since 2000, with a total of more than 300TB of data. However, the presence of Twitter has put the archive work at an impasse, as the pavilion cannot find a suitable way to ensure that information is easily searchable. If you continue to use the tape storage scheme that the library has long relied on, it will take up to 24 hours to query only one Twitter message from 2006 to 2010-and this will only account for one-eighth of the total data. "Twitter information is difficult to collate, on the one hand because the volume of data is too large, on the other hand because new data are coming in every day, and that growth is still rising," the official said. "In addition, the variety of Twitter information is more and more diverse. General Twitter information, the use of software client sent automatic response information, manual response information, including links or pictures of information, etc., all of this let us not start. "

The road to finding a solution is tortuous. The Library of Congress has begun to consider distributed and parallel computing schemes, but these two systems are too expensive. "To really achieve a significant reduction in search time, we need to build a huge infrastructure of hundreds of or even thousands of servers." This is too costly and unrealistic for our business-like institutions. "

So what exactly should the museum do? Large data experts have given a series of reference programs. As far as the Library of Congress is concerned, the technical team might be better off sorting by using one tool to process data storage, one tool for retrieval, and the other to respond to query requests, Mark Phillips points out. He is also the founder of the community and development promotion manager at Basho, and the originator of the Open-source database tool Raik (the tool is very scalable in key-value storage).

Large data management tools have built up a thriving new industry where users can choose proprietary software or open source solutions based on different usage requirements and expected costs. The biggest problem for the technical staff of the Library of Congress is how they start the creation and management of the whole system. If the museum wants to take a path of open source, then the optional database creation and management tools will blossom-from Hadoop clusters to Greenplum databases dedicated to high input/output reads and writes. They can also be integrated with Apache solar--, an open-source search tool. Open source provides developers with a bright path to free access to the source code to build ideal system artifacts on commercial hardware, but open source also means that we need to devote a lot of human and material resources to back-end development efforts. Of course, the Library of Congress can also take a more expensive but more carefree path to proprietary software, buying database products directly from Oracle or SAP, the industry's giants.

Either way, however, the amount of monstrous data in the Twitter project is still hard to overcome. But Phillips's attitude gives us some confidence. He points out that while Twitter's current data volume has reached 133TB and is still growing rapidly, Basho has contacted customers with petabytes of data and successfully completed its mission on its own platform. As long as the Library of Congress can track and summarize the growth of the database capacity each month or quarterly, and with the result of adequate hardware resources for the data storage, the Basho database software will be able to solve the library's problems.

So is it not a good thing to use a cloud solution? In theory, the Library of Congress can use the public cloud resources represented by Amazon Web Services to save this data, and as the total amount of Twitter information grows, AWS automatically handles the necessary hardware expansion work. However, in the view of Basho's engineer, Seth Thomas, the long-term cost-performance of such a scheme is questionable. Because of the apparent intention of the librarian to keep the data permanently, the hybrid architecture may be more cost-effective. Perhaps a better way is to save the data locally, and then use the cloud service to implement the analytics functionality. Thus, the library only needs to pay for the dynamic resources in response to the request according to the search amount, and the terminal system only needs to deal with the workload corresponding to the request quantity.

In any case, the Library of Congress has decided to incorporate these Twitter messages into the search system. As a regular user, we should note that information is recorded as soon as the Twitter is updated.

Original link: http://www.networkworld.com/news/2013/010813-loc-tweets-265627.html?hpg1=bn

(Responsible editor: The good of the Legacy)

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More