Path: blob/master/src/java.base/share/classes/jdk/internal/icu/lang/UCharacter.java
41161 views
/*1* Copyright (c) 2009, 2020, Oracle and/or its affiliates. All rights reserved.2* DO NOT ALTER OR REMOVE COPYRIGHT NOTICES OR THIS FILE HEADER.3*4* This code is free software; you can redistribute it and/or modify it5* under the terms of the GNU General Public License version 2 only, as6* published by the Free Software Foundation. Oracle designates this7* particular file as subject to the "Classpath" exception as provided8* by Oracle in the LICENSE file that accompanied this code.9*10* This code is distributed in the hope that it will be useful, but WITHOUT11* ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or12* FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License13* version 2 for more details (a copy is included in the LICENSE file that14* accompanied this code).15*16* You should have received a copy of the GNU General Public License version17* 2 along with this work; if not, write to the Free Software Foundation,18* Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA.19*20* Please contact Oracle, 500 Oracle Parkway, Redwood Shores, CA 94065 USA21* or visit www.oracle.com if you need additional information or have any22* questions.23*/2425/**26*******************************************************************************27* Copyright (C) 1996-2014, International Business Machines Corporation and28* others. All Rights Reserved.29*******************************************************************************30*/3132package jdk.internal.icu.lang;3334import jdk.internal.icu.impl.UBiDiProps;35import jdk.internal.icu.impl.UCharacterProperty;36import jdk.internal.icu.text.Normalizer2;37import jdk.internal.icu.text.UTF16;38import jdk.internal.icu.util.VersionInfo;3940/**41* <p>The UCharacter class provides extensions to the42* <a href="http://java.sun.com/j2se/1.5/docs/api/java/lang/Character.html">43* java.lang.Character</a> class. These extensions provide support for44* more Unicode properties and together with the <a href=../text/UTF16.html>UTF16</a>45* class, provide support for supplementary characters (those with code46* points above U+FFFF).47* Each ICU release supports the latest version of Unicode available at that time.48*49* <p>Code points are represented in these API using ints. While it would be50* more convenient in Java to have a separate primitive datatype for them,51* ints suffice in the meantime.52*53* <p>To use this class please add the jar file name icu4j.jar to the54* class path, since it contains data files which supply the information used55* by this file.<br>56* E.g. In Windows <br>57* <code>set CLASSPATH=%CLASSPATH%;$JAR_FILE_PATH/ucharacter.jar</code>.<br>58* Otherwise, another method would be to copy the files uprops.dat and59* unames.icu from the icu4j source subdirectory60* <i>$ICU4J_SRC/src/com.ibm.icu.impl.data</i> to your class directory61* <i>$ICU4J_CLASS/com.ibm.icu.impl.data</i>.62*63* <p>Aside from the additions for UTF-16 support, and the updated Unicode64* properties, the main differences between UCharacter and Character are:65* <ul>66* <li> UCharacter is not designed to be a char wrapper and does not have67* APIs to which involves management of that single char.<br>68* These include:69* <ul>70* <li> char charValue(),71* <li> int compareTo(java.lang.Character, java.lang.Character), etc.72* </ul>73* <li> UCharacter does not include Character APIs that are deprecated, nor74* does it include the Java-specific character information, such as75* boolean isJavaIdentifierPart(char ch).76* <li> Character maps characters 'A' - 'Z' and 'a' - 'z' to the numeric77* values '10' - '35'. UCharacter also does this in digit and78* getNumericValue, to adhere to the java semantics of these79* methods. New methods unicodeDigit, and80* getUnicodeNumericValue do not treat the above code points81* as having numeric values. This is a semantic change from ICU4J 1.3.1.82* </ul>83* <p>84* Further detail on differences can be determined using the program85* <a href=86* "http://source.icu-project.org/repos/icu/icu4j/trunk/src/com/ibm/icu/dev/test/lang/UCharacterCompare.java">87* com.ibm.icu.dev.test.lang.UCharacterCompare</a>88* </p>89* <p>90* In addition to Java compatibility functions, which calculate derived properties,91* this API provides low-level access to the Unicode Character Database.92* </p>93* <p>94* Unicode assigns each code point (not just assigned character) values for95* many properties.96* Most of them are simple boolean flags, or constants from a small enumerated list.97* For some properties, values are strings or other relatively more complex types.98* </p>99* <p>100* For more information see101* <a href="http://www.unicode/org/ucd/">"About the Unicode Character Database"</a>102* (http://www.unicode.org/ucd/)103* and the <a href="http://www.icu-project.org/userguide/properties.html">ICU104* User Guide chapter on Properties</a>105* (http://www.icu-project.org/userguide/properties.html).106* </p>107* <p>108* There are also functions that provide easy migration from C/POSIX functions109* like isblank(). Their use is generally discouraged because the C/POSIX110* standards do not define their semantics beyond the ASCII range, which means111* that different implementations exhibit very different behavior.112* Instead, Unicode properties should be used directly.113* </p>114* <p>115* There are also only a few, broad C/POSIX character classes, and they tend116* to be used for conflicting purposes. For example, the "isalpha()" class117* is sometimes used to determine word boundaries, while a more sophisticated118* approach would at least distinguish initial letters from continuation119* characters (the latter including combining marks).120* (In ICU, BreakIterator is the most sophisticated API for word boundaries.)121* Another example: There is no "istitle()" class for titlecase characters.122* </p>123* <p>124* ICU 3.4 and later provides API access for all twelve C/POSIX character classes.125* ICU implements them according to the Standard Recommendations in126* Annex C: Compatibility Properties of UTS #18 Unicode Regular Expressions127* (http://www.unicode.org/reports/tr18/#Compatibility_Properties).128* </p>129* <p>130* API access for C/POSIX character classes is as follows:131* <pre>{@code132* - alpha: isUAlphabetic(c) or hasBinaryProperty(c, UProperty.ALPHABETIC)133* - lower: isULowercase(c) or hasBinaryProperty(c, UProperty.LOWERCASE)134* - upper: isUUppercase(c) or hasBinaryProperty(c, UProperty.UPPERCASE)135* - punct: ((1<<getType(c)) & ((1<<DASH_PUNCTUATION)|(1<<START_PUNCTUATION)|136* (1<<END_PUNCTUATION)|(1<<CONNECTOR_PUNCTUATION)|(1<<OTHER_PUNCTUATION)|137* (1<<INITIAL_PUNCTUATION)|(1<<FINAL_PUNCTUATION)))!=0138* - digit: isDigit(c) or getType(c)==DECIMAL_DIGIT_NUMBER139* - xdigit: hasBinaryProperty(c, UProperty.POSIX_XDIGIT)140* - alnum: hasBinaryProperty(c, UProperty.POSIX_ALNUM)141* - space: isUWhiteSpace(c) or hasBinaryProperty(c, UProperty.WHITE_SPACE)142* - blank: hasBinaryProperty(c, UProperty.POSIX_BLANK)143* - cntrl: getType(c)==CONTROL144* - graph: hasBinaryProperty(c, UProperty.POSIX_GRAPH)145* - print: hasBinaryProperty(c, UProperty.POSIX_PRINT)146* }</pre>147* </p>148* <p>149* The C/POSIX character classes are also available in UnicodeSet patterns,150* using patterns like [:graph:] or \p{graph}.151* </p>152*153* There are several ICU (and Java) whitespace functions.154* Comparison:<ul>155* <li> isUWhiteSpace=UCHAR_WHITE_SPACE: Unicode White_Space property;156* most of general categories "Z" (separators) + most whitespace ISO controls157* (including no-break spaces, but excluding IS1..IS4 and ZWSP)158* <li> isWhitespace: Java isWhitespace; Z + whitespace ISO controls but excluding no-break spaces159* <li> isSpaceChar: just Z (including no-break spaces)</ul>160* </p>161* <p>162* This class is not subclassable.163* </p>164* @author Syn Wee Quek165* @stable ICU 2.1166* @see com.ibm.icu.lang.UCharacterEnums167*/168169public final class UCharacter170{171172/**173* Joining Group constants.174* @see UProperty#JOINING_GROUP175* @stable ICU 2.4176*/177public static interface JoiningGroup178{179/**180* @stable ICU 2.4181*/182public static final int NO_JOINING_GROUP = 0;183}184185/**186* Numeric Type constants.187* @see UProperty#NUMERIC_TYPE188* @stable ICU 2.4189*/190public static interface NumericType191{192/**193* @stable ICU 2.4194*/195public static final int NONE = 0;196/**197* @stable ICU 2.4198*/199public static final int DECIMAL = 1;200/**201* @stable ICU 2.4202*/203public static final int DIGIT = 2;204/**205* @stable ICU 2.4206*/207public static final int NUMERIC = 3;208/**209* @stable ICU 2.4210*/211public static final int COUNT = 4;212}213214/**215* Hangul Syllable Type constants.216*217* @see UProperty#HANGUL_SYLLABLE_TYPE218* @stable ICU 2.6219*/220public static interface HangulSyllableType221{222/**223* @stable ICU 2.6224*/225public static final int NOT_APPLICABLE = 0; /*[NA]*/ /*See note !!*/226/**227* @stable ICU 2.6228*/229public static final int LEADING_JAMO = 1; /*[L]*/230/**231* @stable ICU 2.6232*/233public static final int VOWEL_JAMO = 2; /*[V]*/234/**235* @stable ICU 2.6236*/237public static final int TRAILING_JAMO = 3; /*[T]*/238/**239* @stable ICU 2.6240*/241public static final int LV_SYLLABLE = 4; /*[LV]*/242/**243* @stable ICU 2.6244*/245public static final int LVT_SYLLABLE = 5; /*[LVT]*/246/**247* @stable ICU 2.6248*/249public static final int COUNT = 6;250}251252// public data members -----------------------------------------------253254/**255* The lowest Unicode code point value.256* @stable ICU 2.1257*/258public static final int MIN_VALUE = UTF16.CODEPOINT_MIN_VALUE;259260/**261* The highest Unicode code point value (scalar value) according to the262* Unicode Standard.263* This is a 21-bit value (21 bits, rounded up).<br>264* Up-to-date Unicode implementation of java.lang.Character.MAX_VALUE265* @stable ICU 2.1266*/267public static final int MAX_VALUE = UTF16.CODEPOINT_MAX_VALUE;268269// public methods ----------------------------------------------------270271/**272* Returns the numeric value of a decimal digit code point.273* <br>This method observes the semantics of274* <code>java.lang.Character.digit()</code>. Note that this275* will return positive values for code points for which isDigit276* returns false, just like java.lang.Character.277* <br><em>Semantic Change:</em> In release 1.3.1 and278* prior, this did not treat the European letters as having a279* digit value, and also treated numeric letters and other numbers as280* digits.281* This has been changed to conform to the java semantics.282* <br>A code point is a valid digit if and only if:283* <ul>284* <li>ch is a decimal digit or one of the european letters, and285* <li>the value of ch is less than the specified radix.286* </ul>287* @param ch the code point to query288* @param radix the radix289* @return the numeric value represented by the code point in the290* specified radix, or -1 if the code point is not a decimal digit291* or if its value is too large for the radix292* @stable ICU 2.1293*/294public static int digit(int ch, int radix)295{296if (2 <= radix && radix <= 36) {297int value = digit(ch);298if (value < 0) {299// ch is not a decimal digit, try latin letters300value = UCharacterProperty.getEuropeanDigit(ch);301}302return (value < radix) ? value : -1;303} else {304return -1; // invalid radix305}306}307308/**309* Returns the numeric value of a decimal digit code point.310* <br>This is a convenience overload of <code>digit(int, int)</code>311* that provides a decimal radix.312* <br><em>Semantic Change:</em> In release 1.3.1 and prior, this313* treated numeric letters and other numbers as digits. This has314* been changed to conform to the java semantics.315* @param ch the code point to query316* @return the numeric value represented by the code point,317* or -1 if the code point is not a decimal digit or if its318* value is too large for a decimal radix319* @stable ICU 2.1320*/321public static int digit(int ch)322{323return UCharacterProperty.INSTANCE.digit(ch);324}325326/**327* Returns a value indicating a code point's Unicode category.328* Up-to-date Unicode implementation of java.lang.Character.getType()329* except for the above mentioned code points that had their category330* changed.<br>331* Return results are constants from the interface332* <a href=UCharacterCategory.html>UCharacterCategory</a><br>333* <em>NOTE:</em> the UCharacterCategory values are <em>not</em> compatible with334* those returned by java.lang.Character.getType. UCharacterCategory values335* match the ones used in ICU4C, while java.lang.Character type336* values, though similar, skip the value 17.</p>337* @param ch code point whose type is to be determined338* @return category which is a value of UCharacterCategory339* @stable ICU 2.1340*/341public static int getType(int ch)342{343return UCharacterProperty.INSTANCE.getType(ch);344}345346/**347* Returns the Bidirection property of a code point.348* For example, 0x0041 (letter A) has the LEFT_TO_RIGHT directional349* property.<br>350* Result returned belongs to the interface351* <a href=UCharacterDirection.html>UCharacterDirection</a>352* @param ch the code point to be determined its direction353* @return direction constant from UCharacterDirection.354* @stable ICU 2.1355*/356public static int getDirection(int ch)357{358return UBiDiProps.INSTANCE.getClass(ch);359}360361/**362* Maps the specified code point to a "mirror-image" code point.363* For code points with the "mirrored" property, implementations sometimes364* need a "poor man's" mapping to another code point such that the default365* glyph may serve as the mirror-image of the default glyph of the366* specified code point.<br>367* This is useful for text conversion to and from codepages with visual368* order, and for displays without glyph selection capabilities.369* @param ch code point whose mirror is to be retrieved370* @return another code point that may serve as a mirror-image substitute,371* or ch itself if there is no such mapping or ch does not have the372* "mirrored" property373* @stable ICU 2.1374*/375public static int getMirror(int ch)376{377return UBiDiProps.INSTANCE.getMirror(ch);378}379380/**381* Maps the specified character to its paired bracket character.382* For Bidi_Paired_Bracket_Type!=None, this is the same as getMirror(int).383* Otherwise c itself is returned.384* See http://www.unicode.org/reports/tr9/385*386* @param c the code point to be mapped387* @return the paired bracket code point,388* or c itself if there is no such mapping389* (Bidi_Paired_Bracket_Type=None)390*391* @see UProperty#BIDI_PAIRED_BRACKET392* @see UProperty#BIDI_PAIRED_BRACKET_TYPE393* @see #getMirror(int)394* @stable ICU 52395*/396public static int getBidiPairedBracket(int c) {397return UBiDiProps.INSTANCE.getPairedBracket(c);398}399400/**401* Returns the combining class of the argument codepoint402* @param ch code point whose combining is to be retrieved403* @return the combining class of the codepoint404* @stable ICU 2.1405*/406public static int getCombiningClass(int ch)407{408return Normalizer2.getNFDInstance().getCombiningClass(ch);409}410411/**412* Returns the version of Unicode data used.413* @return the unicode version number used414* @stable ICU 2.1415*/416public static VersionInfo getUnicodeVersion()417{418return UCharacterProperty.INSTANCE.m_unicodeVersion_;419}420421/**422* Returns a code point corresponding to the two UTF16 characters.423* @param lead the lead char424* @param trail the trail char425* @return code point if surrogate characters are valid.426* @exception IllegalArgumentException thrown when argument characters do427* not form a valid codepoint428* @stable ICU 2.1429*/430public static int getCodePoint(char lead, char trail)431{432if (UTF16.isLeadSurrogate(lead) && UTF16.isTrailSurrogate(trail)) {433return UCharacterProperty.getRawSupplementary(lead, trail);434}435throw new IllegalArgumentException("Illegal surrogate characters");436}437438/**439* Returns the "age" of the code point.</p>440* <p>The "age" is the Unicode version when the code point was first441* designated (as a non-character or for Private Use) or assigned a442* character.443* <p>This can be useful to avoid emitting code points to receiving444* processes that do not accept newer characters.</p>445* <p>The data is from the UCD file DerivedAge.txt.</p>446* @param ch The code point.447* @return the Unicode version number448* @stable ICU 2.6449*/450public static VersionInfo getAge(int ch)451{452if (ch < MIN_VALUE || ch > MAX_VALUE) {453throw new IllegalArgumentException("Codepoint out of bounds");454}455return UCharacterProperty.INSTANCE.getAge(ch);456}457458/**459* Returns the property value for an Unicode property type of a code point.460* Also returns binary and mask property values.</p>461* <p>Unicode, especially in version 3.2, defines many more properties than462* the original set in UnicodeData.txt.</p>463* <p>The properties APIs are intended to reflect Unicode properties as464* defined in the Unicode Character Database (UCD) and Unicode Technical465* Reports (UTR). For details about the properties see466* http://www.unicode.org/.</p>467* <p>For names of Unicode properties see the UCD file PropertyAliases.txt.468* </p>469* <pre>470* Sample usage:471* int ea = UCharacter.getIntPropertyValue(c, UProperty.EAST_ASIAN_WIDTH);472* int ideo = UCharacter.getIntPropertyValue(c, UProperty.IDEOGRAPHIC);473* boolean b = (ideo == 1) ? true : false;474* </pre>475* @param ch code point to test.476* @param type UProperty selector constant, identifies which binary477* property to check. Must be478* UProperty.BINARY_START <= type < UProperty.BINARY_LIMIT or479* UProperty.INT_START <= type < UProperty.INT_LIMIT or480* UProperty.MASK_START <= type < UProperty.MASK_LIMIT.481* @return numeric value that is directly the property value or,482* for enumerated properties, corresponds to the numeric value of483* the enumerated constant of the respective property value484* enumeration type (cast to enum type if necessary).485* Returns 0 or 1 (for false / true) for binary Unicode properties.486* Returns a bit-mask for mask properties.487* Returns 0 if 'type' is out of bounds or if the Unicode version488* does not have data for the property at all, or not for this code489* point.490* @see UProperty491* @see #hasBinaryProperty492* @see #getIntPropertyMinValue493* @see #getIntPropertyMaxValue494* @see #getUnicodeVersion495* @stable ICU 2.4496*/497// for BiDiBase.java498public static int getIntPropertyValue(int ch, int type) {499return UCharacterProperty.INSTANCE.getIntPropertyValue(ch, type);500}501502// private constructor -----------------------------------------------503504/**505* Private constructor to prevent instantiation506*/507private UCharacter() { }508509/*510* Copied from UCharacterEnums.java511*/512513/**514* Character type Mn515* @stable ICU 2.1516*/517public static final byte NON_SPACING_MARK = 6;518/**519* Character type Me520* @stable ICU 2.1521*/522public static final byte ENCLOSING_MARK = 7;523/**524* Character type Mc525* @stable ICU 2.1526*/527public static final byte COMBINING_SPACING_MARK = 8;528/**529* Character type count530* @stable ICU 2.1531*/532public static final byte CHAR_CATEGORY_COUNT = 30;533534/**535* Directional type R536* @stable ICU 2.1537*/538public static final int RIGHT_TO_LEFT = 1;539/**540* Directional type AL541* @stable ICU 2.1542*/543public static final int RIGHT_TO_LEFT_ARABIC = 13;544}545546547