(For more information about Java character types ).

Source: Internet
Author: User
Java version 1.1 introduces many operators (char data. These new classes can convert the characters on a certain platform into Unicode characters. In this article, we need to check which classes are added, and add the operators to the category.

Char type

---- In C language, the most popular base data type is Char. This type of model is often used up, because it was originally set to 8 bits, and in the latest 25 years, 8 bits are also the smallest data storage units in the memory of the computer. We put the latter (eight digits are the smallest data storage unit in the memory of the computer) and tie the ASCII character set to a condition of 7 characters, you can see that, the char type is a common "usable" data type. In C language, a pointer to Char-type variable volume is also used as a pointer, all data that can be guided to the char type can be computed and guided to the char type.

---- In the C language, the use and misuse of char-type data leads to the incompatibility between the compilation and translation tools, the asni Standard C language has made a special change on the two sides: the generic needle type is redefined as void, because this requires the sender to give a final statement of precision to it; the value of the character data is recognized as a letter number, it determines how the character will be processed in the number value calculation. Next, in the middle of 1980s, the engineers and users recognized that 8 bits were not enough to describe all the characters in the world. However, fortunately, at that time, some features of the C language were so rooted in the depth of strength that people did not want, perhaps it is impossible to change the definition of char data. Time is like a shuttle, turning to the eye in 1990s, people began to explore Java in the early stage. Many of them have been created. One of them is that the characters in Java should be defined as 16 characters. This decision supports the use of Unicode standards (the Standards describe various different word operators in many languages). However, it also sets up a dance platform for the appearance of many questions. These questions can be corrected only when they are present.

What is a character?

---- A character (character) is a letter, and a string of letters is grouped into a single word, A group of single words is grouped into sentences. However, in reality, in the computer, the character (called a graphic character) described on the screen is ), and the number value specified for this character (called the code value), is not directly connected.

---- In ASCII, 96 printable character characters are defined, so that you can use the books to write English. This is not enough to compare with the meaning of more than 20 thousand graph operators to describe the text in the table, which is simply a day difference. Starting from the early moers and Podo codes, the English language is simple and single (fewer symbols are displayed based on the average frequency rate) it becomes a universal language of the digital era. However, as more people enter the digital era, the non-British countries will use more computing machines, more and more people are unable to tolerate computation. They can only use ASCII codes and express English letters. This greatly increases the number of operators that the computing machine can solve. As a result, the number of characters used by the computer must be doubled.

---- When the respected 7-bit ASCII code is combined into an 8-bit ISO Latin-1 (or ISO 8859_1, ISO table indicates that the number of characters that can be used doubles after the Chinese character code is compiled. As you can think of from the name of this Code, this quasi-warranty is on the computer, many European countries can describe their statements. However, it is only accurate, and it cannot be used. At that time, many computer manufacturers made some benefits, and they have begun to use 128 "characters" of the 8-digit characters ". In this example, we can see that we have an IBM personal computer (PC ), and once a most streaming line of the computer end, Dec company VT-100. The latter is continuously stored on the final imitation software.

---- When the question stops using the eight-character, this question will be discussed in the next ten years, however, when the question is raised, the answer can be answered. In my opinion, this question has been raised since the introduction of the Macintosh computer in 1984. The Macintosh has introduced two revolutionary concepts to the main streamcompute: The character bodies stored in Ram; and worldscript ). However, this is only a flip made by Xerox Corporation, on its dandelion series host, these techniques are used in the form of the star character processing system. However, the Macintosh has brought these new character sets and fonts to users who are still using the "dumb" end. Once a person opens his head, there will be no way to be terminated by the use of different words-for many people who love it. By the end of December 1980s, a consortium named Unicode Consortium was created for the rational and quasi-use of these characters, in 1990, it released its first Unicode norm. Fortunately, from 1980s to 1990s, the number of character sets increased exponentially. At that time, almost all the engineers who are writing code from new characters recognize that the Unicode mark in the previous step cannot last long, for various types of text, they have created code different from each other. However, in the case that Unicode is not widely used, the concept that only 128 or 256 characters can be used is no longer stored. After the Macintosh, the support for different fonts has become indispensable in the word processing system. The eight-character characters are fading away, and then gradually disappear.

Java and Unicode

---- The char type of the Java base is defined as the 16-bit unsigned character. It is the only unsigned class type in Java. The original reason for using a 16-bit character is to allow Java to support any UNICODE character, therefore, Java is used to describe or demonstrate any language supported by Unicode. But yes, it is often two different questions that support the expression string display of a certain language and the character string that can print a certain language. The main environment of the Development Group based on oak (the first generation of Java) is UNIX and Some UNIX-based systems, for developers, the most practical word set is still ISO Latin-1. Correspondingly, this development group carries the Unix heritage, the I/O system of Java takes the Unix stream concept as the model type in a large process. In the UNIX system, each type of I/O backup can be shown in a series of 8-bit streams. This method is modeled on UNIX in the I/O system, so that the Java language has 16-bit Java characters, however, there are only eight-bit input devices, which makes Java a little regret. Because any location where the Java string is read or written by 8 bits has a small sequential code, which is called the "hacker) to map 8-bit Unicode to 16-bit Unicode, or split 16-bit Unicode into 8-bit Unicode characters.

---- In the 1.0 Java Development Kit (JDK), input hack is input in the datainputstrean class, and output hack is input) the entire printstream class. (In reality, in the Alpha 2 version of Java, there is an input class named textinputstream, but when it is officially released, this class is replaced by datainputstream .) This is also a bit of headache for developers who are new to Java, this is because every time they rush to search for the number of functions that correspond to the GETC () function in the C language in Java. Please refer to the following Java 1.0 procedure:

import java.io.*;
public class bogus {
 public static void main(String args[]) {
 FileInputStream fis;
 DataInputStream dis;
 char c;
 try {
 fis = new FileInputStream("data.txt");
 dis = new DataInputStream(fis);
 while (true) {
 c = dis.readChar();
 System.out.print(c);
 System.out.flush();
 if (c == '/n') break;
 }
 fis.close();
 } catch (Exception e) { }
 System.exit(0);
 }
}

---- At first glance, the result of this process is to open a file, read each time, and input a character, and return it after reading a line of character string. However, in the real world, the output result you can get is only a bunch of useless failover results. However, the original reason why you cannot correctly enter the data.txt file is that the readchar letter reads 16-bit Unicode characters, while the system. out. print is used as the eight-digit ISO Latin-1 character. However, if you modify the process order above and use the Readline function of datainputstream, the Process Order will be run normally. Because the code for Readline is read in a format that complies with the Unicode specification and is called the "modified UTF-8. (The UTF-8 is Unicode-based and shows the format of Unicode characters in an 8-bit input stream .) In Java 1.0, Java character strings are grouped by 16-bit Unicode characters, but yes, only one pair of photo tables can map ISO Latin-1 characters to Unicode. Fortunately, unicode-defined "0" Page code-that is, the 256 characters with a height of 8 characters and a full value of 0-that is, the complete and ISO Latin-1 Character Set one-to-one response. Because of this, it is very easy to map the two, and if you use only ISO Latin-1 text, then, when you use the Java class, read these articles and perform various operations, and re-write them into a file, during this entire process, no questions will be asked.

---- There are two situations in which the loss of the entered encoding can be converted to another loss: not all platforms store their multi-text files in the format of "fixed UTF-8"; in addition, however, the applications on these platforms cannot be judged by the process sequence. The characters in this formula are not non-Latin characters. In this case, the implementation of Java is incomplete, and, in the later version, it is also difficult to provide the support required.

Java 1.1 and Unicode

---- Java version 1.1 introduces a new set of operators named readers and writers. I name the previously mentioned bogus class as a cool class. This cool class uses the inputstreamreader class instead of datainputstream to handle the file. Note that inputstreamreader is a sub-class of the new reader class, and system. Out is a printwriter object, while printwriter is a sub-class of the writer class. Below is the subcode of this example:

import java.io.*;
public class cool {
 public static void main(String args[]) {
 FileInputStream fis;
 InputStreamReader irs;
 char c;
 try {
 fis = new FileInputStream("data.txt");
 irs = new InputStreamReader(fis);
 System.out.println("Using encoding : "+irs.getEncoding());
 while (true) {
 c = (char) irs.read();
 System.out.print(c);
 System.out.flush();
 if (c == '/n') break;
 }
 fis.close();
 } catch (Exception e) { }
 System.exit(0);
 }
}

---- The child in this example must be different from the master in the previous example. Do not use the inputstreamreader class here. The first child in this example uses the datainputstream class. In addition, this example adds a line of procedure to the previous example, And it inputs the encoding method used by the inputstreamreader class.

---- The first point of the key-off is the original code (these code are stored in the getchar method of the datainputstream class, if there is no file, it cannot be viewed from the table.) the file has been deleted (more accurately, it is a double hit when it is used; in the version to be deleted ). In Java version 1.1, the conversion mechanism has been encapsulated into the reader class. This encapsulation provides a way to support non-Latin characters by using Java libraries when Unicode is used internally.

---- However, it is similar to the initial sub-system design. There are some classes corresponding to the "read" class to complete writing ". For example, the outputstreamwriter class can be used to write character strings into the input and output streams, while bufferedwriter can add a layer of slow Data Writing.

Are we still making real progress in the commercial aspect of Keras?

---- The reader and writer classes have an objective mark, which is intended to provide a standard method, use the characters used by the previous machine (not the Macintosh Greek, or the Windows Spanish) both can be converted to Unicode. This means that when we move data from one platform to another, java classes for handling character strings do not have to be changed. Now, the conversion code generation has been banned, so the question to be discussed is what the coding means.

---- When thinking about this article, I thought about an Xerox Executive Officer (before the establishment of Xerox, when Xerox company name is haloid company name) said a paragraph name, he wants to say that the copy machine is more than one, because it is a secret book, you can easily put the copy paper into the printer, and when she creates the original copy, get a copy. However, this method lacks foresight, the Reprinting Machine provides the recipients with better information than those who have created the text. The creator of javasoft is in the aspect of the design font code encoding and decoding class, in the same sample, we can see that this part of the system is insufficient.

---- In haloid Corporation, the Executive Officer shall note the production process of the re-printed parts, without considering the situation of those who already have copies. In Java, the main concern of these classes is to convert the Unicode character to the local lower-layer platform (that is, the platform that can run Java virtual server ). servers) characters that can be understood. The typical type of the inputstringreader class (and the most easy-to-use type) should be shown in the preceding example, you only need to compress a read object (Reader) by week in the word stream. For a disk file, the processing process is to import the stream into the class named filereader. In typical cases, when the class is turned into an instance, it will set a province-less Code Generator for its platform. In Windows and UNIX environments, the configurable codecs are "8859_1"-Or read ISO Latin-1 files, and convert it to the Unicode codecs. However, just as the re-printing machine should use one, in the word character to change the square surface, also need another kind of should use the power, it is also to read the character from its platform, convert it to Unicode, and then convert Unicode to the character of the local platform. In the post, I will show you how to use the current Java design, there is no such function (that is, the function of converting multiple platforms ).

---- In addition to "8859_1", there are also its codecs-which are converted between the JIS (Japanese) character and Unicode, in the Chinese character and Unicode between the conversion and Other encoding devices. However, let's take a look at the article published with JDK (which can also be obtained from the OFT web site ), you cannot find a record about its codecs. However, there is no clear API for you to list the codes that can be used. The only way to actually solve the code editor that can be used is to read the source code of inputstreamreader. Fortunately, reading these tokens is like opening a new tumor.

---- In the source file of inputstreamreader, the code used for conversion is encapsulated in the sun-specific class named Sun. Io. bytetocharconverter. Similar to the Sun class, the source code of this class cannot be obtained in JDK, and it is not recorded in the document. Because of this, what this class actually does is a secret. Fortunately, we can get the source code of the entire JDK from sun and see what this class is doing through it. Inputstreamreader has a constructor. It uses a string as its name to indicate what code generator is used. In Sun's reality, the character string is pasted to "Sun. io. bytetocharxxx In the dashboard, and divide the "XXX" part in the dashboard with the string you passed to the letter creation for generation. Next, let's look at the classes in the classes.zip file. You can refer to bytetochar8859_1 and bytetocharcp1255 encoding devices for several unique characters. In the same way, these classes are sun classes, because they do not have any input. However, you can find some old articles in the form of internationalization specification on the web site of javasoft. In these contributions, there is a page describing the large numbers supported by Sun (for example, not the full part. The connection points on this page can be found in the resource source section of this document. The question I mentioned earlier is also pre-displayed. This kind of body structure does not allow you to add your own converter.

---- I think there is a fault in this kind. The owner must have the following two reasons: except for the conversion from the base-local platform character to the Unicode character, this kind of design did not mention the need for a uniform Java platform to support any converter; and there is no such method to list the transcoder it supports. This means that the writers who write the code already recognize the data files stored on a certain platform, it is the file created on this platform. However, as you can see, what you want to read on a different hardware, and what files related to the Internet, in many cases, it is not a file created on the local platform, but a special code editor is required. This need leads to the final question of this system.

---- Now, let's assume that you want to read (or perform analysis) in your Java application's procedural order) the text is from a non-local platform-we want it to come from a Commodore-64 platform. Commodore uses its self-defined ASCII character set for reading and converting Commodore-64 files to Unicode, I must use a special bytetochar subclass to understand the Commodore character format. The reader class design allows you to do this, but in the JDK implementation of this design, the required class for conversion is Sun's private sun. i/O package. For this reason, I must be able to use only the interface that is not described in the document. In addition, I have to install it in the sun. Io package for my converter. As shown in the Xerox example, many customers in this system are very experienced in programming, they supported the use of data left by the former system. These users are forced to bypass this special tumor. However, enabling sufficient processing of these data is also a good solution, we have obtained some progress.

Summary

---- Java, which is different from the C language. It is clear that the char variable value can be regarded as a Unicode character. It combines the number of characters with their graph characters. Despite the simplicity of this concept, it is often not appropriate for those programmers who have abundant and rich experience. For these sequencers, it was the first time that the program was written to run in a non-English character string environment. In addition, many things need to be done to compile various languages in the world into a typical one, the result of this operation is the Unicode Character Mark. On the main page of Unicode, the reader can get more information about Unicode and the Unicode Character and encoding value pairs.

---- At the end, the combination of the words and strings has become clearer in Java 1.1, all Java programmers are forced to use their methods to describe the characters on their platforms. This all-ball design shows that people are using the most appropriate language to use the computer, the product pole is obtained.

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.