ISO/IEC JTC 1/SC22/Java SG ISO/IEC JTC 1/SC22/Java SG N 3-13 DATE: 1998-10-13 REPLACES: N/A DOC TYPE: Plain DOS Text TITLE: Normalization of Unicode string SOURCE: Akio Kido (expert contribution) PROJECT: N/A STATUS: This document is circulated to National Bodies of JTC 1/SC22/Java SG for review and consideration at the October 1998 SC22/Java SG meeting in Tokyo. ACTION ID: FYI DUE DATE: DISTRIBUTION: P and L Members MEDIUM: DISKETTE NO.: NO. OF PAGES: 1 Text of contribution: Whereas, one of important application area of Java is World Wide Web, and in the WWW environment, character encoding of text object in digital documents should be transparent from end user, since the character encoding may be converted from one to another during the data transfer from server to client, or may vary from a Web page to another, and whereas a CC data element, such as Latin capital character "A" with acute accent may be represented by a precomposed character in an encoding, but may be represented by a combining sequence in other encoding, therefore, Java standard may need to have a set of functionality to access to the CC data element in a character string. ISO/IEC 10646 and some other ISO/IEC coded character standards specify combining characters and composite sequence as well as pre-composed form of character. For example, a glyph Latin Alphabet Capital Letter A with Acute may be represented by LATIN CAPITAL LETTER A WITH ACUTE (u00C1) and/or the sequence of LATIN CAPITAL LETTER A (u0041) followed by COMBINING ACUTE ACCENT (u0301). Considering the above duplicate representation of a CC data element, the World Wide Web Consortium (W3C) is now working to establish "Character Model in the Web" and has published requirement document as a technical report (ref. http://www.w3.org/TR/1998/WD-charreq-19980710 ). In the document, the W3C mentioned about the requirement on the CC data element level comparison regardless of the character representation difference of a CC data element. To achieve the CC data element level comparison, normalization of ISO/IEC 10646 (Unicode) string may be required as a pre-handling of the comparison, Also, the Unicode consortium is working on the normalization technique of Unicode string (ref. http://www.unicode.org/unicode/reports/tr15/ ). The question is whether such CC data element level access should be done at application layer by using existing Java API with hardcoding of Unicode values in the application, or a set of new Java API which enables the access to the CC data element in a Java string object should be provided as standard API of the ISO Java standard. In the latter case, the following two API may be required. - Convert an un-normalized string object to normalized one - Detect CC data element boundary in a string object