Easy to handle terabytes of data, open source Graphlab breakthrough human Graph Computing "limit value"

Source: Internet
Author: User
Keywords Can open source expand ease pass

Figure http://www.aliyun.com/zixun/aggregation/14345.html "> Data processing in the past has been the patent of data scientists, as the application of data is more and more extensive, large data analysis has become an essential part of the field of data analysis, There is a growing need for easy access to simple graph data analysis tools. Graphlab is a very popular open source project, Graphlab developers are constantly pursuing the innovation and development of graph computing, so that it can meet the requirements of mass data processing. Sframe's debut is low-key and mysterious, but its function is not to be underestimated, it extends graphlab to the table so that it can easily manage terabytes of data.

Social media figures have attracted many companies ' attention, and similar datasets exist in many areas such as health sciences, security, and financial services. The characteristics of the graph data require special tools and techniques that are too complex for the average user to use these tools in the past as a patent for data scientists. Fortunately, graph data analysis This area attracts a lot of enthusiastic entrepreneurs and developers. These tools have been greatly improved and are becoming simpler.

We have a lot of machines to learn. Examples that apply to graph data analysis, such as the discovery of influential users (PageRank) and community, fraud detection and referral systems (the more popular collaborative filtering among Graphlab users). When a tool is developed in a field, it is often applied to other fields, and in addition to Graphlab, distributed analysis is applied to Giraph, Graphx, Faunus and Grappa, as well as a graph database such as neo4j and yarcdata with some analytic functions.

Recently a new company's establishment has greatly promoted the development of Open source project Graphlab, the company consists of Graphlab developers, raising funds for graph data set development analysis tools. Graphlab Company will continue to use open source Graphlab "Break the limits of the calculation, and strive to innovate."

Graphlab's Sframe is an interesting and low-key and secretive tool that was unveiled for the first time in Strata Santa Clara. It is disk-based and extends graphlab to tabular data through a two-dimensional table form. By adding sframe, users can take advantage of many algorithms in Graphlab to work with data in graphs or tables. More importantly, Sframe has increased the coverage of the Graphlab Data Science workflow: It allows users to directly use Graphlab to clean data from terabytes or create new functionality, Sframe performance can be linearly scaled by adding the kernel.

According to sources, Graphlab is trying to integrate their engines with yarn, however, the Sframe Beta has been able to read data from HDFS, and can also read data from local disks, HDFS, S3, or URLs and save it as a human-readable. csv or more efficient native format. Once the Sframe is created and saved to disk, there is no need to repeat the processing of the data. Here's how to read a. csv file to sframe in Python code, and create a new data feature and save it on a S3 disk:

Graphlab Create is designed for software engineers and data scientists who want to develop data products such as recommender systems, and even those who are unfamiliar with machine learning can quickly get started and help seasoned developers save a lot of time.

By Graphlab Create, you can develop data products or use machine learning and graph analysis methods to analyze data, connect to your data, implement data transformations through an iterative hierarchical model, and easily analyze model and system performance, and run applications on your machine or run instances in AWS.

While Sframe is part of Graphlab Create, a Python package will be released in March to simplify the creation of scalable analytics products (such as recommendation systems and graphic analysis systems). With Graphlab Create, users will be able to build and maintain profiling pipelines from within Python or Ipython and deploy them on a single server or the entire cluster (both local and cloud).

In the past Graphlab were thought to be scalable and fast, but difficult to use and limited in scope. But over the past few months, Graphlab has solved two of the top issues, and the tools it develops should significantly increase Graphlab's appeal to data scientists. Integration with Ipython has opened an era of Graphlab rapid, extensible analysis modules for the Pydata community (end-to-end recommendation via Python's six threads). The Sframe and Graphlab create expands the data science workflow to include data transformation and data assimilation (information ingestion).

Before using diagram tools to analyze, you need to convert data into graphs. Graphbuilder is an open source project for Intel that uses Hadoop mapreduce to generate diagrams from large datasets. Another option is the combination of GRAPHX and spark, a multi-purpose data discrimination tool developed by a new company called Trifacta.

Because Sframes is similar to Pandas (Pydata) and R data architectures, data scientists can quickly and easily use them to improve productivity. To ask why Sframes attracts strata attendees, I think it's because it can scale to a larger dataset: Sframe allows users to work with large tabular datasets, not limited to memory size.

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.