The following program uses two Unicode
escapes, which represent Unicode characters by their hexadecimal numeric
codes. What does the program print?
public class EscapeRout { public static void main(String[] args) { // \u0022 is the Unicode escape for double quote (") System.out.println("a\u0022.length() + \u0022b".length()); } }
Solution 14: Escape Rout
A naive analysis of the program
suggests that it should print 26 because there are 26 characters
between the quotation marks that bound the string "a\u0022.length() +
\u0022b". A deeper analysis suggests that the program should print
16, as each of the two Unicode escapes requires six characters in the
source file but represents only one character in the string. The string is
therefore ten characters shorter than it appears. Running the program tells a
different story. It prints neither 26 nor 16 but
2.
The key to understanding this puzzle is that Java provides no special treatment for Unicode escapes
within string literals. The compiler translates Unicode escapes into the
characters they represent before it parses the program into tokens, such as
strings literals [JLS 3.2]. Therefore, the first Unicode escape in the program
closes a one-character string literal ("a"), and the second one opens a
one-character string literal ("b"). The program prints the value of the
expression "a".length() + "b".length(), or 2.
If the author of the program had actually wanted this behavior,
it would have been much clearer to say:
System.out.println("a".length() + "b".length());
More likely, the author wanted to put the two double quote
characters into the string literal. You can't do this with Unicode escapes, but
you can do it with escape sequences [JLS 3.10.6].
The escape sequence representing a double quote is a backslash followed by a
double quote (\"). If the Unicode escapes in the original program are
replaced with this escape sequence, it will print 16 as expected:
System.out.println("a\".length() + \"b".length());
There are escape sequences for many characters, including the
single quote (\'), linefeed (\n), tab (\t), and
backslash (\\). You can use escape sequences in character literals as
well as in string literals. In fact, you can put any ASCII character into a
string literal or a character literal by using a special kind of escape sequence
called an octal escape, but it is preferable to
use normal escape sequences where possible. Both normal escape sequences and
octal escapes are far preferable to Unicode escapes because unlike Unicode
escapes, escape sequences are processed after the program is parsed into
tokens.
All the programs
in this book are written using the ASCII subset of Unicode. ASCII is the lowest
common denominator of character sets. ASCII has only 128 characters, but Unicode
has more than 65,000. A Unicode escape can be used to insert any Unicode
character into a program using only ASCII characters. A Unicode escape means
exactly the same thing as the character that it represents.
Unicode escapes are designed for use when a programmer needs to
insert a character that can't be represented in the source file's character set.
They are used primarily to put non-ASCII characters into identifiers, string
literals, character literals, and comments. Occasionally, a Unicode escape adds
to the clarity of a program by positively identifying one of several
similar-looking characters.
In summary, prefer escape sequences
to Unicode escapes in string and character literals. Unicode escapes can
be confusing because they are processed so early in the compilation sequence.
Do not use Unicode escapes to represent ASCII
characters. Inside of string and character literals, use escape
sequences; outside of these literals, insert ASCII characters directly into the
source file.
No comments:
Post a Comment
Your comments are welcome!