Notes on social network-based Data Mining

Source: Internet
Author: User
Tags oauth string tags gmail mail couchdb idf nltk

Social networks have changed from fashion to the mainstream, and some suggest replacing the World Wide Web (WWW) with a giant global graph (ggg). Further, semantic networks ( is the trend of the future network.


The natural language Toolkit (nltk) provides a large number of tools for text analysis, including calculation of common metrics, information extraction, and NLP. The simplest way to answer "what people are discussing" is to analyze the basic word frequency. Grahviz is the main tool of the visual community. The dot language is a simple text-based format used by graphviz. Canviz ( allows you to draw grahviz diagrams on the <canvas> element of a Web browser.


Micro-patterns ( provide an effective mechanism to embed "more intelligent data" into a web page, and easy to implement by content creators. Microformats are simply agreed to explicitly include structured data in web pages in a fully value-added manner, typical microformats include xfn (http://gmpg.orf/xfn), GEO ( ), hrecipe ( and hreview ( ). among them, GEO is particularly worth noting that kml ( output may be the simplest way to visualize GEO data.


The beautifulsoup package can implement simple web crawling. The two criteria for checking the crawling algorithm are performance and quality. Socialgraph node mapper ( open source project standardizes some URLs, the existence of WWW.


It can import captured data into couchdb (, a document-oriented database that provides the MAP/reduce function, can be used to index data, it also provides a fully rest-based interface ( that allows others to analyze and copy your database and integrate it into any Web architecture for easy use of its replication capabilities. Couchone provides Binary Download and cloudant provides online hosting.


Lucene ( is a Java-based high-performance full-text index search engine library that combines keyword search functions into applications. The couchdb-Lucene ( project is a Web Service encapsulation centered around Lucene's core functionality, able to index the couchdb documentation.


Simile timeline ( is an easy-to-use powerful tool that visualizes event-centric data, especially for studying mail data. Getmail, poplib, and imaplib are both good mail-oriented Python packages, and graph your inbox chrome extensions ( allow authorized access to mail data.


Oauth 2.0 ( is an emerging authorization solution that promotes a better user experience (, authorizing client applications to access protected resources rather than user names and passwords.


Redis ( is a data structure server, fast and easy to install, for powerful documentation Python clients. Redis: Under thehood ( is a good article. Redis provides local operations for common collections.


Infochimps ( is an organization that provides large data directories and provides strong link APIs for Twitter measurement and analysis. Ubigraph ( is a visualization tool for 3D Interactive graphs, bound to Python.


A basic level of human intelligence is to classify things and derive a layered structure. Classification is essentially a layered structure that classifies elements into parent/child relationships. Public classifier is used as a means to describe the collaborative tag field and Social Index achievements in various web ecosystems. Essentially, it is a novel method to describe the tag dispersion field that emerged as a collective smart mechanism. For more information about how to find commonalities, see


Tag Cloud is the most obvious choice for visualization of entities extracted from social data. Open source rotating label cloud WP-culumns ( is a good choice. Kevinhoffman's paper: "In search of the perfect tag cloud" ( B /insearchofperfecttagcloud) provides a good overview of various design strategies for building a tag cloud.


LinkedIn firmly believes that personal professional network data is private, access to the can get authorization certificates, you can use linkedinapi to mine the overall richness of available data, but also provides rate flow restrictions.


Intelligent clustering can bring remarkable user experience. Common metrics of common similarity measurements of clustering:

1) edit distance

2) N-element syntax similarity: Calculate all possible N-element syntaxes of two string tags, and calculate the similarity by calculating their common syntaxes.

3) yake distance: indicates the similarity between the two sets. It is the result of dividing different items in the two sets by the common items between the two sets.

4) Masi distance: /~ Becky/pubs/lrec06masi.pdf

Clustering greedy algorithms are mainly based on Masi measurements. hierarchical clustering can calculate the full matrix of the distance between all items, and traverse the matrix cluster items that meet the minimum distance threshold, k-means clustering is a multi-dimensional space pre-allocated to N points, and then divided into k clusters.


The K-means clustering method is generally available for Geographic Information clustering. Google Earth provides a geographic encoder. (Http:// The dorlingcartogram in protovis is essentially a bubble map of geographical clustering. Open Source Project geodict is a good attempt to study GEO Data ( ).


Between Twitter and blogs, Google buzz provides restful APIs ( ). natural Language empirical law Qi PUF's law asserted that the word frequency in the corpus is inversely proportional to its ranking in the Word Frequency table. ('s_law) The Brown Corpus ( is a reasonable starting point.


TF-IDF (termfrequency-inverse Document Frequency) indicates the inverse Document Frequency of words, and the corpus can be queried by calculating the normalization score of the relative importance of words in the document, which indicates the product of the frequency of Word Frequency and inverse document: TF-IDF = TF * IDF. TF indicates the importance of a word in a specific document, and IDF indicates the importance of a word in the entire corpus.


The TF-IDF model transforms the document model into an unordered set of words. Another way to model the document is vector space model: Each document in a multi-dimensional space contains a vector, the distance between two vectors indicates the similarity of the corresponding document. To calculate the similarity between two documents, you only need to generate a word vector for each document and calculate the dot product of the unit vector of these documents. Therefore, it is easy to compare documents with Cosine similarity.


The xoauth tool ( that accesses Gmail mail, using the xoauth. py tool can generate oauth tokens and keys for anonymous users. Dumbo is a project that allows you to write and run hadoop programs in Python. Scrapy ( is an easy-to-use and sophisticated Web crawling and crawling framework.


Typical NLP pipelines using nltk:

1) end ofsentence (EOS)

2) Word Segmentation

3) part-of-speech tagging (POS)

4) multipart

5) Extraction

Use regular expressions to parse sentences. For details, see "unsupervised multilingual sentence Boundary Detection" ( /~ Strunk/ks2005final.pdf). prerequisite of luhn Digest algorithm: The key sentence in this document is a sentence that contains frequently occurring words. Luhn does not understand data at a deeper semantic level. The analysis method centered on the sentence entity can refer to the Pennsylvania tree library label
Wordstemming in nltk can analyze the semantic triple, and WordNet ( can find out the extra meaning of the item in the triple.


Facebook applications need to be hosted in their own server environment, the development process's open spectrum protocol ( Interactive Visualization can use javascriptinfovis Toolkit (http: // ). sunburst visualizes a space fill of layers such as trees.


Web3.0 seems to be a semantic network, and Fuxi is a powerful Logical Reasoning System in the semantic network. It uses a technology called forward link ( to deduce new information from existing information.


Do not destroy what you have because you don't have it. Remember that what you have was what you expected.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.