Path: blob/master/src/java.base/share/classes/jdk/internal/icu/text/UTF16.java
41161 views
/*1* Copyright (c) 2005, 2020, Oracle and/or its affiliates. All rights reserved.2* DO NOT ALTER OR REMOVE COPYRIGHT NOTICES OR THIS FILE HEADER.3*4* This code is free software; you can redistribute it and/or modify it5* under the terms of the GNU General Public License version 2 only, as6* published by the Free Software Foundation. Oracle designates this7* particular file as subject to the "Classpath" exception as provided8* by Oracle in the LICENSE file that accompanied this code.9*10* This code is distributed in the hope that it will be useful, but WITHOUT11* ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or12* FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License13* version 2 for more details (a copy is included in the LICENSE file that14* accompanied this code).15*16* You should have received a copy of the GNU General Public License version17* 2 along with this work; if not, write to the Free Software Foundation,18* Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA.19*20* Please contact Oracle, 500 Oracle Parkway, Redwood Shores, CA 94065 USA21* or visit www.oracle.com if you need additional information or have any22* questions.23*/24/**25*******************************************************************************26* Copyright (C) 1996-2014, International Business Machines Corporation and27* others. All Rights Reserved.28*******************************************************************************29*/3031package jdk.internal.icu.text;3233import jdk.internal.icu.impl.UCharacterProperty;3435/**36* <p>Standalone utility class providing UTF16 character conversions and37* indexing conversions.38* <p>Code that uses strings alone rarely need modification.39* By design, UTF-16 does not allow overlap, so searching for strings is a safe40* operation. Similarly, concatenation is always safe. Substringing is safe if41* the start and end are both on UTF-32 boundaries. In normal code, the values42* for start and end are on those boundaries, since they arose from operations43* like searching. If not, the nearest UTF-32 boundaries can be determined44* using <code>bounds()</code>.45* <strong>Examples:</strong>46* <p>The following examples illustrate use of some of these methods.47* <pre>{@code48* // iteration forwards: Original49* for (int i = 0; i < s.length(); ++i) {50* char ch = s.charAt(i);51* doSomethingWith(ch);52* }53*54* // iteration forwards: Changes for UTF-3255* int ch;56* for (int i = 0; i < s.length(); i += UTF16.getCharCount(ch)) {57* ch = UTF16.charAt(s, i);58* doSomethingWith(ch);59* }60*61* // iteration backwards: Original62* for (int i = s.length() - 1; i >= 0; --i) {63* char ch = s.charAt(i);64* doSomethingWith(ch);65* }66*67* // iteration backwards: Changes for UTF-3268* int ch;69* for (int i = s.length() - 1; i > 0; i -= UTF16.getCharCount(ch)) {70* ch = UTF16.charAt(s, i);71* doSomethingWith(ch);72* }73* }</pre>74* <strong>Notes:</strong>75* <ul>76* <li>77* <strong>Naming:</strong> For clarity, High and Low surrogates are called78* <code>Lead</code> and <code>Trail</code> in the API, which gives a better79* sense of their ordering in a string. <code>offset16</code> and80* <code>offset32</code> are used to distinguish offsets to UTF-1681* boundaries vs offsets to UTF-32 boundaries. <code>int char32</code> is82* used to contain UTF-32 characters, as opposed to <code>char16</code>,83* which is a UTF-16 code unit.84* </li>85* <li>86* <strong>Roundtripping Offsets:</strong> You can always roundtrip from a87* UTF-32 offset to a UTF-16 offset and back. Because of the difference in88* structure, you can roundtrip from a UTF-16 offset to a UTF-32 offset and89* back if and only if <code>bounds(string, offset16) != TRAIL</code>.90* </li>91* <li>92* <strong>Exceptions:</strong> The error checking will throw an exception93* if indices are out of bounds. Other than that, all methods will94* behave reasonably, even if unmatched surrogates or out-of-bounds UTF-3295* values are present. <code>UCharacter.isLegal()</code> can be used to check96* for validity if desired.97* </li>98* <li>99* <strong>Unmatched Surrogates:</strong> If the string contains unmatched100* surrogates, then these are counted as one UTF-32 value. This matches101* their iteration behavior, which is vital. It also matches common display102* practice as missing glyphs (see the Unicode Standard Section 5.4, 5.5).103* </li>104* <li>105* <strong>Optimization:</strong> The method implementations may need106* optimization if the compiler doesn't fold static final methods. Since107* surrogate pairs will form an exceeding small percentage of all the text108* in the world, the singleton case should always be optimized for.109* </li>110* </ul>111* @author Mark Davis, with help from Markus Scherer112* @stable ICU 2.1113*/114115public final class UTF16116{117// public variables ---------------------------------------------------118119/**120* The lowest Unicode code point value.121* @stable ICU 2.1122*/123public static final int CODEPOINT_MIN_VALUE = 0;124/**125* The highest Unicode code point value (scalar value) according to the126* Unicode Standard.127* @stable ICU 2.1128*/129public static final int CODEPOINT_MAX_VALUE = 0x10ffff;130/**131* The minimum value for Supplementary code points132* @stable ICU 2.1133*/134public static final int SUPPLEMENTARY_MIN_VALUE = 0x10000;135/**136* Lead surrogate minimum value137* @stable ICU 2.1138*/139public static final int LEAD_SURROGATE_MIN_VALUE = 0xD800;140/**141* Trail surrogate minimum value142* @stable ICU 2.1143*/144public static final int TRAIL_SURROGATE_MIN_VALUE = 0xDC00;145/**146* Lead surrogate maximum value147* @stable ICU 2.1148*/149public static final int LEAD_SURROGATE_MAX_VALUE = 0xDBFF;150/**151* Trail surrogate maximum value152* @stable ICU 2.1153*/154public static final int TRAIL_SURROGATE_MAX_VALUE = 0xDFFF;155/**156* Surrogate minimum value157* @stable ICU 2.1158*/159public static final int SURROGATE_MIN_VALUE = LEAD_SURROGATE_MIN_VALUE;160/**161* Lead surrogate bitmask162*/163private static final int LEAD_SURROGATE_BITMASK = 0xFFFFFC00;164/**165* Trail surrogate bitmask166*/167private static final int TRAIL_SURROGATE_BITMASK = 0xFFFFFC00;168/**169* Surrogate bitmask170*/171private static final int SURROGATE_BITMASK = 0xFFFFF800;172/**173* Lead surrogate bits174*/175private static final int LEAD_SURROGATE_BITS = 0xD800;176/**177* Trail surrogate bits178*/179private static final int TRAIL_SURROGATE_BITS = 0xDC00;180/**181* Surrogate bits182*/183private static final int SURROGATE_BITS = 0xD800;184185// constructor --------------------------------------------------------186187// /CLOVER:OFF188/**189* Prevent instance from being created.190*/191private UTF16() {192}193194// /CLOVER:ON195// public method ------------------------------------------------------196197/**198* Extract a single UTF-32 value from a string.199* Used when iterating forwards or backwards (with200* <code>UTF16.getCharCount()</code>, as well as random access. If a201* validity check is required, use202* <code><a href="../lang/UCharacter.html#isLegal(char)">203* UCharacter.isLegal()</a></code> on the return value.204* If the char retrieved is part of a surrogate pair, its supplementary205* character will be returned. If a complete supplementary character is206* not found the incomplete character will be returned207* @param source array of UTF-16 chars208* @param offset16 UTF-16 offset to the start of the character.209* @return UTF-32 value for the UTF-32 value that contains the char at210* offset16. The boundaries of that codepoint are the same as in211* <code>bounds32()</code>.212* @exception IndexOutOfBoundsException thrown if offset16 is out of213* bounds.214* @stable ICU 2.1215*/216public static int charAt(String source, int offset16) {217char single = source.charAt(offset16);218if (single < LEAD_SURROGATE_MIN_VALUE) {219return single;220}221return _charAt(source, offset16, single);222}223224private static int _charAt(String source, int offset16, char single) {225if (single > TRAIL_SURROGATE_MAX_VALUE) {226return single;227}228229// Convert the UTF-16 surrogate pair if necessary.230// For simplicity in usage, and because the frequency of pairs is231// low, look both directions.232233if (single <= LEAD_SURROGATE_MAX_VALUE) {234++offset16;235if (source.length() != offset16) {236char trail = source.charAt(offset16);237if (trail >= TRAIL_SURROGATE_MIN_VALUE && trail <= TRAIL_SURROGATE_MAX_VALUE) {238return UCharacterProperty.getRawSupplementary(single, trail);239}240}241} else {242--offset16;243if (offset16 >= 0) {244// single is a trail surrogate so245char lead = source.charAt(offset16);246if (lead >= LEAD_SURROGATE_MIN_VALUE && lead <= LEAD_SURROGATE_MAX_VALUE) {247return UCharacterProperty.getRawSupplementary(lead, single);248}249}250}251return single; // return unmatched surrogate252}253254/**255* Extract a single UTF-32 value from a string.256* Used when iterating forwards or backwards (with257* <code>UTF16.getCharCount()</code>, as well as random access. If a258* validity check is required, use259* <code><a href="../lang/UCharacter.html#isLegal(char)">UCharacter.isLegal()260* </a></code> on the return value.261* If the char retrieved is part of a surrogate pair, its supplementary262* character will be returned. If a complete supplementary character is263* not found the incomplete character will be returned264* @param source array of UTF-16 chars265* @param offset16 UTF-16 offset to the start of the character.266* @return UTF-32 value for the UTF-32 value that contains the char at267* offset16. The boundaries of that codepoint are the same as in268* <code>bounds32()</code>.269* @exception IndexOutOfBoundsException thrown if offset16 is out of bounds.270* @stable ICU 2.1271*/272public static int charAt(CharSequence source, int offset16) {273char single = source.charAt(offset16);274if (single < UTF16.LEAD_SURROGATE_MIN_VALUE) {275return single;276}277return _charAt(source, offset16, single);278}279280private static int _charAt(CharSequence source, int offset16, char single) {281if (single > UTF16.TRAIL_SURROGATE_MAX_VALUE) {282return single;283}284285// Convert the UTF-16 surrogate pair if necessary.286// For simplicity in usage, and because the frequency of pairs is287// low, look both directions.288289if (single <= UTF16.LEAD_SURROGATE_MAX_VALUE) {290++offset16;291if (source.length() != offset16) {292char trail = source.charAt(offset16);293if (trail >= UTF16.TRAIL_SURROGATE_MIN_VALUE294&& trail <= UTF16.TRAIL_SURROGATE_MAX_VALUE) {295return UCharacterProperty.getRawSupplementary(single, trail);296}297}298} else {299--offset16;300if (offset16 >= 0) {301// single is a trail surrogate so302char lead = source.charAt(offset16);303if (lead >= UTF16.LEAD_SURROGATE_MIN_VALUE304&& lead <= UTF16.LEAD_SURROGATE_MAX_VALUE) {305return UCharacterProperty.getRawSupplementary(lead, single);306}307}308}309return single; // return unmatched surrogate310}311312/**313* Extract a single UTF-32 value from a substring. Used when iterating forwards or backwards314* (with <code>UTF16.getCharCount()</code>, as well as random access. If a validity check is315* required, use <code><a href="../lang/UCharacter.html#isLegal(char)">UCharacter.isLegal()316* </a></code>317* on the return value. If the char retrieved is part of a surrogate pair, its supplementary318* character will be returned. If a complete supplementary character is not found the incomplete319* character will be returned320*321* @param source Array of UTF-16 chars322* @param start Offset to substring in the source array for analyzing323* @param limit Offset to substring in the source array for analyzing324* @param offset16 UTF-16 offset relative to start325* @return UTF-32 value for the UTF-32 value that contains the char at offset16. The boundaries326* of that codepoint are the same as in <code>bounds32()</code>.327* @exception IndexOutOfBoundsException Thrown if offset16 is not within the range of start and limit.328* @stable ICU 2.1329*/330public static int charAt(char source[], int start, int limit, int offset16) {331offset16 += start;332if (offset16 < start || offset16 >= limit) {333throw new ArrayIndexOutOfBoundsException(offset16);334}335336char single = source[offset16];337if (!isSurrogate(single)) {338return single;339}340341// Convert the UTF-16 surrogate pair if necessary.342// For simplicity in usage, and because the frequency of pairs is343// low, look both directions.344if (single <= LEAD_SURROGATE_MAX_VALUE) {345offset16++;346if (offset16 >= limit) {347return single;348}349char trail = source[offset16];350if (isTrailSurrogate(trail)) {351return UCharacterProperty.getRawSupplementary(single, trail);352}353}354else { // isTrailSurrogate(single), so355if (offset16 == start) {356return single;357}358offset16--;359char lead = source[offset16];360if (isLeadSurrogate(lead))361return UCharacterProperty.getRawSupplementary(lead, single);362}363return single; // return unmatched surrogate364}365366/**367* Determines how many chars this char32 requires.368* If a validity check is required, use <code>369* <a href="../lang/UCharacter.html#isLegal(char)">isLegal()</a></code> on370* char32 before calling.371* @param char32 the input codepoint.372* @return 2 if is in supplementary space, otherwise 1.373* @stable ICU 2.1374*/375public static int getCharCount(int char32)376{377if (char32 < SUPPLEMENTARY_MIN_VALUE) {378return 1;379}380return 2;381}382383/**384* Determines whether the code value is a surrogate.385* @param char16 the input character.386* @return true if the input character is a surrogate.387* @stable ICU 2.1388*/389public static boolean isSurrogate(char char16)390{391return (char16 & SURROGATE_BITMASK) == SURROGATE_BITS;392}393394/**395* Determines whether the character is a trail surrogate.396* @param char16 the input character.397* @return true if the input character is a trail surrogate.398* @stable ICU 2.1399*/400public static boolean isTrailSurrogate(char char16)401{402return (char16 & TRAIL_SURROGATE_BITMASK) == TRAIL_SURROGATE_BITS;403}404405/**406* Determines whether the character is a lead surrogate.407* @param char16 the input character.408* @return true if the input character is a lead surrogate409* @stable ICU 2.1410*/411public static boolean isLeadSurrogate(char char16)412{413return (char16 & LEAD_SURROGATE_BITMASK) == LEAD_SURROGATE_BITS;414}415416/**417* Returns the lead surrogate.418* If a validity check is required, use419* <code><a href="../lang/UCharacter.html#isLegal(char)">isLegal()</a></code>420* on char32 before calling.421* @param char32 the input character.422* @return lead surrogate if the getCharCount(ch) is 2; <br>423* and 0 otherwise (note: 0 is not a valid lead surrogate).424* @stable ICU 2.1425*/426public static char getLeadSurrogate(int char32)427{428if (char32 >= SUPPLEMENTARY_MIN_VALUE) {429return (char)(LEAD_SURROGATE_OFFSET_ +430(char32 >> LEAD_SURROGATE_SHIFT_));431}432433return 0;434}435436/**437* Returns the trail surrogate.438* If a validity check is required, use439* <code><a href="../lang/UCharacter.html#isLegal(char)">isLegal()</a></code>440* on char32 before calling.441* @param char32 the input character.442* @return the trail surrogate if the getCharCount(ch) is 2; <br> otherwise443* the character itself444* @stable ICU 2.1445*/446public static char getTrailSurrogate(int char32)447{448if (char32 >= SUPPLEMENTARY_MIN_VALUE) {449return (char)(TRAIL_SURROGATE_MIN_VALUE +450(char32 & TRAIL_SURROGATE_MASK_));451}452453return (char) char32;454}455456/**457* Convenience method corresponding to String.valueOf(char). Returns a one458* or two char string containing the UTF-32 value in UTF16 format. If a459* validity check is required, use460* <code><a href="../lang/UCharacter.html#isLegal(char)">isLegal()</a></code>461* on char32 before calling.462* @param char32 the input character.463* @return string value of char32 in UTF16 format464* @exception IllegalArgumentException thrown if char32 is a invalid465* codepoint.466* @stable ICU 2.1467*/468public static String valueOf(int char32)469{470if (char32 < CODEPOINT_MIN_VALUE || char32 > CODEPOINT_MAX_VALUE) {471throw new IllegalArgumentException("Illegal codepoint");472}473return toString(char32);474}475476/**477* Append a single UTF-32 value to the end of a StringBuffer.478* If a validity check is required, use479* <code><a href="../lang/UCharacter.html#isLegal(char)">isLegal()</a></code>480* on char32 before calling.481* @param target the buffer to append to482* @param char32 value to append.483* @return the updated StringBuffer484* @exception IllegalArgumentException thrown when char32 does not lie485* within the range of the Unicode codepoints486* @stable ICU 2.1487*/488public static StringBuffer append(StringBuffer target, int char32)489{490// Check for irregular values491if (char32 < CODEPOINT_MIN_VALUE || char32 > CODEPOINT_MAX_VALUE) {492throw new IllegalArgumentException("Illegal codepoint: " + Integer.toHexString(char32));493}494495// Write the UTF-16 values496if (char32 >= SUPPLEMENTARY_MIN_VALUE)497{498target.append(getLeadSurrogate(char32));499target.append(getTrailSurrogate(char32));500}501else {502target.append((char) char32);503}504return target;505}506507/**508* Shifts offset16 by the argument number of codepoints within a subarray.509* @param source char array510* @param start position of the subarray to be performed on511* @param limit position of the subarray to be performed on512* @param offset16 UTF16 position to shift relative to start513* @param shift32 number of codepoints to shift514* @return new shifted offset16 relative to start515* @exception IndexOutOfBoundsException if the new offset16 is out of516* bounds with respect to the subarray or the subarray bounds517* are out of range.518* @stable ICU 2.1519*/520public static int moveCodePointOffset(char source[], int start, int limit,521int offset16, int shift32)522{523int size = source.length;524int count;525char ch;526int result = offset16 + start;527if (start < 0 || limit < start) {528throw new StringIndexOutOfBoundsException(start);529}530if (limit > size) {531throw new StringIndexOutOfBoundsException(limit);532}533if (offset16 < 0 || result > limit) {534throw new StringIndexOutOfBoundsException(offset16);535}536if (shift32 > 0) {537if (shift32 + result > size) {538throw new StringIndexOutOfBoundsException(result);539}540count = shift32;541while (result < limit && count > 0)542{543ch = source[result];544if (isLeadSurrogate(ch) && (result + 1 < limit) &&545isTrailSurrogate(source[result + 1])) {546result++;547}548count--;549result++;550}551} else {552if (result + shift32 < start) {553throw new StringIndexOutOfBoundsException(result);554}555for (count = -shift32; count > 0; count--) {556result--;557if (result < start) {558break;559}560ch = source[result];561if (isTrailSurrogate(ch) && result > start && isLeadSurrogate(source[result - 1])) {562result--;563}564}565}566if (count != 0) {567throw new StringIndexOutOfBoundsException(shift32);568}569result -= start;570return result;571}572573// private data members -------------------------------------------------574575/**576* Shift value for lead surrogate to form a supplementary character.577*/578private static final int LEAD_SURROGATE_SHIFT_ = 10;579580/**581* Mask to retrieve the significant value from a trail surrogate.582*/583private static final int TRAIL_SURROGATE_MASK_ = 0x3FF;584585/**586* Value that all lead surrogate starts with587*/588private static final int LEAD_SURROGATE_OFFSET_ =589LEAD_SURROGATE_MIN_VALUE -590(SUPPLEMENTARY_MIN_VALUE591>> LEAD_SURROGATE_SHIFT_);592593// private methods ------------------------------------------------------594595/**596* <p>Converts argument code point and returns a String object representing597* the code point's value in UTF16 format.598* <p>This method does not check for the validity of the codepoint, the599* results are not guaranteed if a invalid codepoint is passed as600* argument.601* <p>The result is a string whose length is 1 for non-supplementary code602* points, 2 otherwise.603* @param ch code point604* @return string representation of the code point605*/606private static String toString(int ch)607{608if (ch < SUPPLEMENTARY_MIN_VALUE) {609return String.valueOf((char) ch);610}611612StringBuilder result = new StringBuilder();613result.append(getLeadSurrogate(ch));614result.append(getTrailSurrogate(ch));615return result.toString();616}617}618619620