Semantic Web and semantic grid

Source: Internet
Author: User
The rapid development of the existing Internet makes its defects gradually exposed, such as Web page function monotonous, the low level of intelligent search engine, this is because most Web content is designed for people to read, rather than let computer programs to operate according to its meaning. The computer can skillfully parse the page layout, know where is the title, where there are links to other pages. However, it does not distinguish between the personal homepage and the weather forecast, because there is no reliable way to deal with the semantics, there is no way to intelligently understand the content and operation of the Web page.

The semantic web is a way to make up for this shortcoming, and to extend the semantic information that the computer can handle for the Web page. In the semantic web, all kinds of resources are artificially endowed with a variety of explicit semantic information, and the computer can distinguish and recognize these semantic information, and automatically explain, exchange and deal with them. But semantic web and artificial intelligence are two different concepts, its research object and the method used are different from the traditional natural language processing, it extends the semantic extension of the existing web, so that it can be understood and processed by the computer, and it will be a function of "understanding" Intelligent network of human information. The initial effort to integrate the semantic Web into today's web architecture is already in progress. In the near future, we will see a lot of important new features when the machines have more power to handle and "understand" the data. For example, if someone wants to enroll in a seminar, the computer can automatically make the best schedule and route for it and book hotels.

The founder of the Internet, Tim Berners-lee, introduced the concept and architecture of the Semantic Web in 2000.

In its architecture, the first layer is Unicode and URI, which is the basis of the entire semantic web, Unicode (Unified encoding) processing resources encoding, URI (Uniform Resource Locator) is responsible for identifying resources; The second layer is the xml+ name space +xml pattern, which is used to represent the content and structure of the data. The third layer is the RDF+RDF mode, which is used to describe the resources and their types; the fourth layer is the ontology vocabulary, which is used to describe the relationship between various resources; The fifth layer is logic, which is logically performed on the basis of the following four layers; The sixth layer is validated, validated against logical statements The seventh level is trust, which establishes a trust relationship between users.

The combined second layer is the key layer of semantic web, which is used to express the semantics of web information, and is also the hotspot of Semantic Web research now. Extensible Markup Language XML (extensible Markup Language) allows everyone to create their own labels to annotate a page or part of a page's text. scripts, or programs, can apply these tags to a complex application, but a program writer must know how a Web page author uses each label. In short, XML allows users to add arbitrary structures to the document. Resource Description Framework The basic structure of RDF (Resource Description framework) is the object-attribute-value ternary group, which is equivalent to the subject, verb and object in the sentence. These triples can be represented by XML syntax. Using this structure to describe a large amount of data processed by the machine is a very natural way. The RDF schema is a glossary of attributes and classes (Class) that describes the RDF resource, providing semantics for the hierarchies of these properties and classes.

Because the two systems may represent the same concept with different identifiers, or they may represent different meanings with an identifier, the program is required to understand whether some identifiers represent the same thing if they are to compare and merge information between two databases. One solution to the problem is ontology (Ontology). ontologies are explicit descriptions of conceptualization, including classification and a set of inference rules. Classification defines the classes of objects and their relationships so that we can express a large number of relationships between entities, and according to inference rules, programs can be automatically inferred. Simply put, a dictionary or metric is defined between different systems to make them agree on the relationship between the entities and their relationships so that they can communicate and share.

The semantic web needs to be able to formally describe the meaning of terms in Web documents. Daml+oil (that is, DARPA Agent Markup Language + ontology inference language), OWL (Web Ontology language), they are the important expansion and improvement of the standard of the universe, are based on the knowledge representation of Artificial intelligence The Ontology language provides a natural way to describe the relationship between classes and subclasses in web words, and the restrictions on relationships between classes and classes (or between subclasses and subclasses). They add more words to describe attributes and classes than RDF schemas, such as disjoint between classes (disjointness), equivalence, richer attribute types, attribute characteristics, and so on.

Of course, to implement the semantic web is far from enough, and the main technical challenge is to allow the computer to do more "thinking" and "inference." In order to make semantic web work, computers must be able to access structured information sets and a set of inference rules for automatic inference. Adding logic-using rules to reason, choosing how to act and answering questions-is a task facing the Semantic web organization.

With a lot of web pages rich in semantic information, it's as if there is a huge global interconnected database. With the help of semantic information, the intelligence and automation of the software agent agents developed by people will be greatly improved, they collect web content from different resources, search and process information and exchange information with other programs, and truly exert the power of Semantic web. When more machines can handle Web content and services (including more agents), through the exchange of information and collaboration between agents, the efficiency of information processing will grow exponentially, to better meet the needs of users.

Grid

Grid is a new technology, which is in constant development and change. Simply put, the grid is an information society network infrastructure, is the use of the internet dispersed in different geographical location of multiple resources, including computing resources, storage resources, communication resources, software resources, information resources, knowledge resources, such as the full connectivity and unified distribution, management and coordination, through the logical relationship Make up a "virtual supercomputer." The machine takes every computer, including the PC, as one of its "nodes", with thousands of such "nodes" in parallel, forming a "grid with super computational power". And each one to connect their own computer to the user on the grid, also "have" the supercomputer, can be invoked anytime, anywhere, computing and information resources, access to integrated information services at the same time, to maximize the realization of resource sharing. The grid computing mode divides the data to be computed first, and then the computers of different nodes can download one or more pieces of data according to their processing ability. Idle computing power is mobilized whenever the user of a computer on a node does not use the computer. The advantage of grid is that not only the data processing ability is very strong, but also can make full use of the idle processing ability on the net to save the calculation cost, realize the sharing of resources, and eliminate the resource island.

Grid computing technology first appeared in the field of scientific research in large-scale scientific calculation and project research, pharmaceuticals, manufacturing, meteorology, exploration and other industries that require large computer functions will be the first beneficiaries of this technology, and with the increase of computing resources connected to grid systems, grid computing technology will benefit small businesses and consumers, Home PC users will also be able to use the faster, cheaper services offered by public and private organizations, where any device can be accessed anywhere to enjoy some level of resources without having to worry about where the resources are coming from, like using the current grid.

The American Natural Science Foundation launched the Advanced Computing Framework Program (PACI) in 1997, and the European Union launched Eurogrid and DataGrid respectively in 2000 and 2001. The 2001 Global Grid Forum (Forum) was established as an international organization that regulates grid research and establishes grid standards. Just as the TCP/IP protocol is the core of the Internet, the build grid also needs to define standard protocols and services. So far, the grid has no formal standards, but in the core technology, the relevant agencies and enterprises have agreed: the United States Argonne National Laboratory and the University of Southern California Institute of Information Science (ISI) developed by the Globus Toolkit has become the de facto standard grid computing. Web Services is the most important part of grid related research and development work in the business world. At present, some industry giants have reached a consensus on several underlying standard protocols, including XML, SOAP, WSDL, UDDI, and so on.

Semantic grid

Combining the advantages of semantic Web, grid and Web services and making up for their shortcomings, the researchers put forward the concept of semantic grid. The graph shows the relationship between web, grid, semantic web and semantic grid, which is the enhancement of the computing power of the web, and the semantic grid is the extension of the grid in semantic ability; from another perspective, the semantic Web enhances semantic capabilities on the existing web, The semantic grid is the extension of the semantic web to computing power.

In the UK's e-science plan study, it has been found that there is a gap between the existing efforts of the grid and the e-science vision, to achieve e-science ease of use and seamless automation requirements, to achieve as much machine-handling and as little human intervention, This is similar to the semantic web goal, so the idea of a semantic grid was first presented in 2001, and a semantic grid research group SEM-GRD was set up in the Global Grid forum GGF in 2002. The key to their semantic grid vision is to describe all the resources, including services, in a machine-controllable way, with the goal of implementing semantic interoperability. One way to achieve this goal is to apply the technology of Semantic web to the development of grid computing, down to the infrastructure to grid applications. It is noteworthy that "semantics" is diffused from the bottom to the entire grid rather than just adding a semantic (knowledge) layer to it.

The Knowledge Grid research group of the Institute of Computing Technology, Chinese Academy of Sciences, under the leadership of Zhuge researcher, is conducting research on semantic grid, by adopting the new computing model and the new resource organization and management model, we can effectively assist users to obtain, share, manage, work together and make decisions, and provide a deeper and more comprehensive , more intelligent service. Focus on solving three scientific problems: normative organization of resources, semantic interconnection and intelligent aggregation.

• Standardize the organization. This paper puts forward the theory, method, technology and tool of standardizing organization and management of resource space model and resources, so that all kinds of disordered resources (information, knowledge and service) can be regulated and organized, so that users and services could operate various resources effectively and correctly according to semantics, so as to improve the use efficiency of resources

• Semantic interconnection. Through multi-layer semantic interconnection and a single semantic image, all kinds of network resources distributed in the world are interconnected on the semantic layer, eliminating the islands of resources, and the semantics of resources can be understood by the machine mainly through the typed semantic chain network.

• Intelligent aggregation. Solve how to make the resources to understand each other, according to the needs of users effectively, dynamic and intelligent aggregation of various resources, mainly through the soft equipment to achieve.


The concept and architecture of Semantic Web


The web, the grid, the semantic web, and the semantic gridRelationship between talking about the system architecture of large high concurrent high load Web sites (turn)I have done in the Cernet dial-up access platform, and then in the Yahoo3721 load search engine front-end platform development, but also in the mop to deal with the large-scale community mop the structure of the upgrade and other work, at the same time they have contacted and developed a number of large and medium-sized Web site modules, Therefore, there are some accumulation and experience in coping with high load and concurrent solutions for large Web sites, and we can discuss them with you.


A small site, such as personal site, you can use the simplest HTML static page to achieve, with some pictures to achieve beautification effect, all the pages are stored in a directory, such a site on the system architecture, performance requirements are very simple, with the continuous enrichment of internet business, Site-related technology through the development of these years, has been subdivided into very fine aspects, especially for large web sites, the technology is involved in a very wide range, from hardware to software, programming language, database, WebServer, firewalls and other fields have a high demand, is not the original simple HTML static site can match.

A large web site, such as a portal site. In the face of a large number of user access, high concurrent requests, the basic solution focused on a number of links: the use of high-performance servers, high-performance databases, efficient programming languages, and high-performance web containers. But in addition to these, there is no way to solve the high load and high concurrency problems faced by large web sites.

Some of the solutions provided above also mean a greater amount of input, and the solution has bottlenecks and no good scalability, and I'm going to say some of my experiences from a low cost, high performance and high scalability perspective.

1, HTML static
In fact, we all know that the most efficient and consumption of the smallest is the pure static HTML page, so we try to make our site on the page to use static page to achieve, the simplest method is actually the most effective way. But for a lot of content and frequently updated Web sites, we can not all manually to achieve one by one, so there is our common Information distribution system CMS, like we often visit the various portals of the news channels, and even their other channels, are through the information publishing system to manage and implement, Information Publishing system can realize the simplest information input automatically generate static pages, but also have channel management, authority management, automatic crawl functions, for a large web site, with a set of efficient, manageable CMS is essential.

In addition to portals and information publishing types of Web sites, for highly interactive community type sites, as far as possible static is also a necessary means to improve performance, the community of posts, articles in real time static, there is an update when the static again is a large number of use of the strategy, A hodgepodge of MOP is the use of such strategies, NetEase community and so on.

At the same time, HTML static is also the use of some caching strategies, for the system frequently use database query but the content update is very small application, you can consider the use of HTML static to implement, such as forum forum in public settings information, This information is currently the mainstream forum can be managed and stored in the database, the information is actually a large number of the foreground program calls, but the update frequency is very small, you can consider this part of the content to be updated in the background when the static, so as to avoid a large number of database access requests.

2, Picture server separation
As you know, for the Web server, whether it is Apache, IIS or other containers, the picture is the most resource-consuming, so we need to separate the picture and the page, which is basically a large site will adopt a strategy, they have a separate picture server, and even many of the image server. Such a framework can reduce the supply of page access requests to the server system pressure, and can ensure that the system will not crash because of picture problems, on the application server and image server, can be configured to optimize the configuration, such as Apache in the configuration of contenttype can be as little support as possible, As few loadmodule as possible, to ensure higher system consumption and execution efficiency.

3, database cluster and library table hash
Large Web sites have complex applications, these applications must use the database, then in the face of a large number of accesses, the database bottleneck will soon emerge, when a database will soon not meet the application, so we need to use a database cluster or library table hash.

In the database cluster, many databases have their own solutions, Oracle, Sybase, and so have a good solution, the common MySQL provided by the Master/slave is similar to the solution, you use what kind of db, refer to the corresponding solution to implement it.

The database cluster mentioned above is constrained by the DB type used in architecture, cost, and extensibility, so we need to consider improving the system architecture from an application perspective, which is a common and most effective solution. We install the business and application in the application or function module to separate the database, different modules corresponding to different databases or tables, and then according to a certain strategy for a page or function of a smaller database hash, such as user table, according to User ID table hash, This will improve the performance of the system at a low cost and have a good scalability. Sohu's forum is to adopt such a framework, the Forum users, settings, posts and other information for the database separation, and then the posts, users in accordance with the plate and ID hash database and table, the final configuration file can be a simple configuration can make the system at any time to add a low-cost database to supplement the system performance.

4, caching
The word cache has been approached with technology, and many places use caching. The Web site architecture and caching in Web development are also very important. Here we first describe the two most basic caches. The advanced and distributed caching is described later.
Architecture of the cache, more familiar to Apache people can know that Apache provides its own cache module, can also use the addition of Squid module for caching, both of which can effectively improve Apache access response capabilities.
Web site program Development cache, Linux provides the memory cache is commonly used caching interface, can be used in web development, such as Java development can invoke memorycache to cache some data and communication sharing, some large communities use such a framework. In addition, in the use of web language development, all languages have their own caching modules and methods, PHP has Pear's cache module, more Java,. NET is not very familiar, I believe there must be.

5, Mirror
Mirroring is a large web site often used to improve performance and data security, mirroring technology can solve the different network access and geographical user access speed difference, such as the difference between chinanet and Edunet has prompted many sites in the education network to build mirror sites, The data is scheduled to be updated or updated in real time. In the details of mirroring technology, here does not elaborate too deep, there are many professional off-the-shelf solution architecture and product optional. There are also inexpensive ways to implement the software, such as the Linux on the rsync and other tools.

6. Load balance
Load balancing will be the ultimate solution for large web sites to address high load access and a large number of concurrent requests.
Load balancing technology has developed for many years, there are many professional service providers and products to choose from, I personally contacted a number of solutions, of which two of the framework can be used for reference.
Hardware four-tier exchange
The fourth layer Exchange uses the header information of the third layer and the fourth Layer information packet, according to the application interval to identify the traffic flow, the entire interval segment of the traffic flow to the appropriate application server for processing. Layer Fourth switching functions are like virtual IP, pointing to the physical server. It transmits a variety of business compliance protocols, with HTTP, FTP, NFS, Telnet, or other protocols. These services require a complex load balancing algorithm based on the physical server. In the IP world, the business type is determined by the terminal TCP or UDP port address, and the application interval in layer fourth switching is determined by the source and terminal IP addresses, TCP, and UDP ports.
In the hardware four-tier switching product area, there are some well-known products to choose from, such as Alteon, F5 and so on, these products are expensive, but value for money, can provide very good performance and very flexible management capabilities. Yahoo China in the beginning of nearly 2000 servers using three or four units Alteon was done.

Software four layer Exchange
You know, after the principle of the hardware layer four switch, the software four layer exchange based on the OSI model comes into being, the principle of this solution is consistent, but the performance is slightly poor. But to meet a certain amount of pressure or easy, some people say that the software implementation is actually more flexible, processing ability completely look at your configuration of the familiar ability.
Software four-tier exchange we can use the Linux on the commonly used LVS to solve, LVs is Linux Virtual Server, he provides a real-time disaster response based on the heartbeat line heartbeat solution, improve the robustness of the system, At the same time can provide flexible virtual VIP configuration and management functions, can meet a variety of application requirements, which is essential for distributed systems.

A typical strategy for using load balancing is to in the software or hardware four-tier exchange based on squid cluster, this idea in many large Web sites including search engines are adopted, such a low-cost architecture, high-performance and strong expansion, at any time to the structure of the add and subtract nodes are very easy. This architecture I am ready to clean up and discuss with you.

For large web sites, each of the previous mentioned methods may be used at the same time, I introduced here more superficial, the specific implementation of a lot of details also need to be familiar with and experience, sometimes a very small squid parameters or Apache parameter settings, the impact on the system performance will be very large, I hope that we can discuss together to achieve the effect of the discussion.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.