Get more data

Last Update:2018-12-03 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Get more data

In many cases, the existing data is enough to build valuable intelligence for your own applications. However, in some cases, building valuable intelligent components in an application may require access to external information. Figure 1.6 is from the mashup website housingmaps (http://www.housingmaps.com. Combined with house data for Craigslist (http: // www.craigslist.com) and Google's map service (http://code.google.com/apis/maps/index.html), users can view houses for sale or to be rented in any geographic region on this site.

Similarly, a news website can combine news and events to obtain location information from a map. This is an improvement for any application. This does not mean that it is a smart application, unless it can intelligently process the information obtained from the map.

A map is a good example of obtaining external information, but there are many other information on the network. Let's see how to use these technologies.

Collection and information capturing

Crawler, also known as Spider, is a program used to obtain public content from the Internet. Crawlers usually access a URL list and track every link in the list. This process repeats constantly, and the number of repetitions is called the depth of crawling ). After a crawler accesses a page, it stores its content locally for further processing. In this way, a large amount of data can be collected, but soon there will be storage and copyright problems. Therefore, be cautious when collecting data. In Chapter 2nd, we will show the implementation of a crawler. There is also an overview of network collection in the appendix, which summarizes our own crawlers and some open-source implementations.

Screen scraping refers to retrieving information from HTML webpages. This looks simple, but it is actually very cumbersome. For example, to build a search engine dedicated to dining out (for example, http: // www. foodiebytes.com), the first task is to obtain menus from the webpage of each restaurant. Information capturing can benefit from the technology introduced in this book. In the restaurant search engine example, You need to evaluate the advantages and disadvantages of a restaurant based on the comments of diners with dining experience. Sometimes you can get a numerical score, but in more cases, comments from diners are all written in natural language. Reading these comments one by one and ranking the restaurant accordingly is obviously a bad solution. In the process of information capture, the application of intelligent technology can realize the automatic classification of reviews, and accordingly give the restaurant a score, boorah (http://www.boorah.com) is an example.

RSS source

Website syndication is another way to obtain external data. This method does not require repeated web crawler access. Generally, structured aggregate content is more machine-friendly than common web pages. There are three common source formats: RSS 1.0, RSS 2.0, and atom.

As the name implies, rdfsite Summary (RSS) 1.0 is derived from the Resource Description Framework [1] (Resource Description Framework, RDF ), the main purpose is to allow both machines and people to understand information on the network. However, humans can speculate on the meaning of content (meaning of words or phrases in a specific context), which is hard for machines to do. The purpose of introducing RDF is to make a semantic explanation of the content on the network, so as to parse useful data according to specific requirements. RSS
1.0 specifications can be found at http: // web. resource.org/rss/1.0/site.

Really Simple Syndication (RSS 2.0) is a Netscape-based rich site summary (rich site summary) 0.91, at least the latter's initial abbreviation RSS is reloaded, it aims to simplify the RDF-based format. It adopts an aggregation special language based on XML format, but does not have XML namespace or RDF reference. Currently, almost all mainstream websites provide RSS 2.0 and are free for non-commercial use by individual users and non-profit organizations. Yahoo! There are a lot of related introductions on the RSS source site (http://developer.yahoo.com/rss) you can on the http://cyber.law.
The RSS 2.0 specifications and other related information are found on harvard.edu/rss.

Finally, we will introduce atom-based aggregation. Some defects in RSS 2.0 have led to the development of standards described in rfc4287 (http://tools.ietf.org/html/rfc4287) for Internet Engineering Task Force (IETF. Atom is not based on RDF. It has both rss1.0 flexibility and RSS 2.0 conciseness. Essentially, it is the product of a compromise that implements existing standards and best meets backward compatibility. However, atom is already as popular as RSS 2.0. Most network aggregators (such as Yahoo! And Google. At IBM
For more information about atom aggregation formats, see http://www.ibm.com/developerworks/XML/standards/x-atomspec.html on developer works.

Restful Service

Rest (representational state transfer) is proposed by Roy T. Fielding [2] In his doctoral thesis. This is a software architecture style that builds applications for distributed, hyperlink media. Rest is a stateless C/S architecture that maps each service into a URL. If non-functional requirements are not complex and you do not need to sign formal agreements with service providers, rest is a good option to easily access various services on the Internet. For more information about this important technology, see Leonard.
Written by Richard and Sam RubyRestful Web ServicesA book.

Many websites provide restful services that you can use in your own applications. Digg provides APIs that can accept rest requests (http://apidoc.digg.com/) and multiple types of feedback such as XML, JSON, JavaScript, and serialized PHP. Through this API, you can obtain news, users, friends, or fans that meet various requirements.

Facebook APIs are also restful interfaces. Therefore, any programming language can communicate with this exciting platform. All you need to do is send the http get or POST request to the Facebook API rest server. Facebook APIs are well documented and will be used later in this book. For more information, see http://wiki.developers.facebook.com/index.php/api.

Web Services

Web services are APIs used for communication between applications. They have a large number of Web Service Frameworks, most of which are open-source. Apache axis (http://ws.apache.org/axis/) is a simple access object protocol (SOAP) Open Source implementation, this protocol can be used to exchange structured and typed information between nodes in a central distributed environment." [3] Apache
Axis is a very popular framework and has made a brand new design in version 2nd. Apache axis2 supports soap 1.1 and soap 1.2, and uses a wide range of rest-style Web Services, there are a lot of other features.

Another noteworthy Apache project is Apache cxf (http://incubator.apache.org/cxf/), a product of Iona's combination of celtix and codehaus xfire. Apachecxf supports the following standards: JAX-WS 2.0, Jax-WSA, JSR-181, SAAJ, soap 1.1 and 1.2, WS-I basic profile, WS-Security, WS-Addressing, WS-rm, WS-Policy, WSDL 1.1 and
2.0. It also supports multiple transmission mechanisms, bindings, and formats. If you want to use web services, you should look at this project.

In addition to these numerous web service frameworks, there are more web service providers. Almost every company is using web services to integrate various applications. These applications have different functions, the technologies used vary widely. This may be the result of the company merger, or it may be because the development work coordination within a large company is not in place. Vertically, almost all large financial investment institutions are using Web Services for seamless integration. Xignite (http://preview.xignite.com/Default.aspx) provides a variety of financial Web Services, SAP, Oracle, Microsoft and other software giants also provide support for Web Services. In short, the integration based on Web Services is everywhere, and as the main integration technology, it is also an important basic component in the design of intelligent applications.

So far, you must have some ideas on how to improve your applications, or you have created new ideas for the next exciting startup. We have ensured that all necessary data is obtained, at least we can access the data. Now let's take a look at the intelligence we will add to the application and their relationships with other familiar concepts.

This article is excerpted from intelligent web algorithms.

Book details: http://blog.csdn.net/broadview2006/article/details/6702401

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Get more data

Contact Us

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support