| Unicode |
|---|
| Character encodings |
| UCS |
| Mapping |
| Bi-directional text |
| BOM |
| Han unification |
| Unicode and HTML |
| Unicode and E-mail |
| Unicode typefaces |
In computing, Unicode is an industry standard allowing computers to consistently represent and manipulate text expressed in most of the world's writing systems. A character encoding consists of a code that pairs a sequence of characters from a given character set (sometimes incorrectly referred to as Code page This article compares Unicode encodings Two situations are considered eight-bit-clean environments and environments like Simple Mail Transfer Protocol that forbid use of UTF-7 (7- Bit Unicode Transformation Format) is a variable-length character encoding that was proposed for representing Unicode -encoded text using a UTF-1 is a way of transforming ISO 10646/ Unicode into a stream of Bytes Due to the design it is not possible to resynchronise if decoding starts in the middle of a UTF-8 (8- Bit UCS / Unicode Transformation Format) is a variable-length Character encoding for Unicode. "Compatibility Encoding Scheme for UTF-16 8-Bit" ( CESU-8) is a variant of UTF-8 that is described in Unicode Technical Report #26. In Computing, UTF-16 (16- Bit Unicode Transformation Format is a variable-length Character encoding for Unicode, capable of encoding UTF-32 (or UCS-4) is a protocol for encoding Unicode characters that uses exactly 32 Bits for each Unicode Code point. UTF-EBCDIC is a Character encoding used to represent Unicode characters The Standard Compression Scheme for Unicode (SCSU is a Unicode Technical Standard for reducing the number of Bytes needed to represent Unicode text especially BOCU-1 is a MIME compatible Unicode compression scheme BOCU stands for B inary O rdered C ompression for U nicode Punycode is a Computer programming encoding syntax by which a Unicode string of characters can be translated into the more-limited character set permitted An internationalized domain name ( GB18030 is the registered Internet name for the official Character set of the People's Republic of China (PRC superseding GB2312. The Universal Character Set (UCS defined by the ISO / IEC 10646 International Standard, is a standard set of characters upon which Unicode ’s Bi-directional text is used as some Writing systems of the world notably the Arabic (including variants such as Nasta'liq) and Hebrew scripts A byte-order mark ( BOM) is the Unicode character at code point U+FEFF ("zero-width no-break space" when that character is used to denote Web pages authored using hypertext markup language ( HTML) may contain multilingual text represented with the Unicode universal character set. Many E-mail clients now offer some support for Unicode in E-mail bodies Unicode typefaces (also known as UCS fonts and Unicode fonts) are Typefaces containing a wide range of characters, letters, Digits Computing is usually defined like the activity of using and developing Computer technology Computer hardware and software. A technical standard is an established norm or requirement It is usually a formal document that establishes uniform engineering or technical criteria methods processes and practices A computer is a Machine that manipulates data according to a list of instructions. For other uses see Character. In Computer and machine-based Telecommunications terminology a character is a unit of A writing system is a type of Symbolic system used to represent elements or statements expressible in Language. Developed in tandem with the Universal Character Set standard and published in book form as The Unicode Standard, Unicode consists of a repertoire of about 100,000 characters, a set of code charts for visual reference, an encoding methodology and set of standard character encodings, an enumeration of character properties such as upper and lower case, a set of reference data computer files, and a number of related items, such as character properties, rules for normalization, decomposition, collation, rendering and bidirectional display order (for the correct display of text containing both right-to-left scripts, such as Arabic or Hebrew, and left-to-right scripts). The Universal Character Set (UCS defined by the ISO / IEC 10646 International Standard, is a standard set of characters upon which For other uses see Character. In Computer and machine-based Telecommunications terminology a character is a unit of A character encoding consists of a code that pairs a sequence of characters from a given character set (sometimes incorrectly referred to as Code page In Orthography and Typography, letter case (or just case) is the distinction between Majuscule ( capital or upper-case A computer file is a block of Arbitrary Information, or resource for storing information which is available to a Computer program and is usually Bi-directional text is used as some Writing systems of the world notably the Arabic (including variants such as Nasta'liq) and Hebrew scripts Arabic (ar الْعَرَبيّة (informally ar عَرَبيْ) in terms of the number of speakers is the largest living member of the Semitic language [1]
The Unicode Consortium, the non-profit organization that coordinates Unicode's development, has the ambitious goal of eventually replacing existing character encoding schemes with Unicode and its standard Unicode Transformation Format (UTF) schemes, as many of the existing schemes are limited in size and scope and are incompatible with multilingual environments. The Unicode® Consortium ( Unicode Inc) the non-profit organization that coordinates the development of the Unicode ™ Standard has the ambitious goal of eventually
Unicode's success at unifying character sets has led to its widespread and predominant use in the internationalization and localization of computer software. In Computing, Internationalization and localization (also spelled internationalisation and localisation, see spelling differences) are means of adapting The standard has been implemented in many recent technologies, including XML, the Java programming language, the Microsoft .NET Framework and modern operating systems. Don't change "Extensible" An operating system (commonly abbreviated OS and O/S) is the software component of a Computer system that is responsible for the management and coordination
Unicode can be implemented by different character encodings. A character encoding consists of a code that pairs a sequence of characters from a given character set (sometimes incorrectly referred to as Code page The most commonly used encodings are UTF-8 (which uses 1 byte for all ASCII characters, which have the same code values as in the standard ASCII encoding, and up to 4 bytes for other characters), the now-obsolete UCS-2 (which uses 2 bytes for all characters, but does not include every character in the Unicode standard), and UTF-16 (which extends UCS-2, using 4 bytes to encode characters missing from UCS-2). UTF-8 (8- Bit UCS / Unicode Transformation Format) is a variable-length Character encoding for Unicode. A byte (pronounced "bite" baɪt is the basic unit of measurement of information storage in Computer science. American Standard Code for Information Interchange ( ASCII) In Computing, UTF-16 (16- Bit Unicode Transformation Format is a variable-length Character encoding for Unicode, capable of encoding In Computing, UTF-16 (16- Bit Unicode Transformation Format is a variable-length Character encoding for Unicode, capable of encoding
Contents |
Unicode has the explicit aim of transcending the limitations of traditional character encodings, such as those defined by the ISO 8859 standard which find wide usage in various countries of the world but remain largely incompatible with each other. A character encoding consists of a code that pairs a sequence of characters from a given character set (sometimes incorrectly referred to as Code page ISO/IEC 8859 is a joint ISO and IEC standard for 8-bit Character encodings for use by computers Many traditional character encodings share a common problem in that they allow bilingual computer processing (usually using Roman characters and the local script) but not multilingual computer processing (computer processing of arbitrary scripts mixed with each other).
Unicode, in intent, encodes the underlying characters — graphemes and grapheme-like units — rather than the variant glyphs (renderings) for such characters. For other uses see Character. In Computer and machine-based Telecommunications terminology a character is a unit of In Typography, a grapheme is the fundamental unit in written language. A glyph is an element of writing Two or more glyphs representing the same symbol whether interchangeable or context-dependent are called Allographs the abstract unit they In the case of Chinese characters, this sometimes leads to controversies over distinguishing the underlying character from its variant glyphs (see Han unification). A Chinese character, also known as a Han character ( is a Logogram used in writing Chinese (hanzi Japanese (
In text processing, Unicode takes the role of providing a unique code point — a number, not a glyph — for each character. In other words, Unicode represents a character in an abstract way and leaves the visual rendering (size, shape, font or style) to other software, such as a web browser or word processor. In typography a font (also fount) is traditionally defined as a complete character set of a single size and style of a particular Typeface. A web browser is a software application which enables a user to display and interact with text images videos music games and other information typically located on a This simple aim becomes complicated, however, by concessions made by Unicode's designers in the hope of encouraging a more rapid adoption of Unicode.
The first 256 code points were made identical to the content of ISO 8859-1 so as to make it trivial to convert existing western text. ISO 8859-1, more formally cited as ISO/IEC 8859-1 is part 1 of ISO/IEC 8859, a standard Character encoding of the Latin alphabet. Many essentially identical characters were encoded multiple times at different code points to preserve distinctions used by legacy encodings and therefore allow conversion from those encodings to Unicode (and back) without losing any information. For example, the "fullwidth forms" section of code points encompasses a full Latin alphabet that is separate from the main Latin alphabet section. In CJK computing Graphic characters are traditionally classed into fullwidth (in Taiwan and Hong Kong: 全形 elsewhere 全角 and halfwidth In Chinese, Japanese and Korean (CJK) fonts, these characters are rendered at the same width as CJK ideographs rather than at half the width. CJK is a collective term for Chinese, Japanese, and Korean, which constitute the main East Asian languages. An ideogram or ideograph (from Greek idea "idea" + grafo "to write" is a Graphic symbol that represents an Idea For other examples, see Duplicate characters in Unicode. Unicode has a certain amount of duplication of characters: these are pairs of single Unicode codepoints that are Canonically equivalent.
Unicode defines a codespace of 1,114,112 code points in the range 0hex to 10FFFFhex. In Computer programming, code space is the Memory segment in the main memory allocated to a process to store the code in execution In Character encoding terminology a code point is any of the numerical values that make up the Codespace. [2] It is normal to reference a Unicode code point by writing "U+" followed by its hexadecimal number. For code points in the Basic Multilingual Plane (BMP), four digits are used (e. See also Mapping of Unicode characters The Unicode characters can be categorized in many different ways Unicode code points can be logically divided into 17 g. U+0058 for the character LATIN CAPITAL LETTER X); for code points outside the BMP, five or six digits are used, as required (e. g. U+E0001 for the character LANGUAGE TAG and U+10FFFD for the character PRIVATE USE CHARACTER-10FFFD). Older versions of the standard used similar notations, but with slightly different rules. For example, Unicode 3. 0 used "U-" followed by eight digits, and allowed "U+" to be used only with exactly four digits in order to indicate a code unit, not a code point. The Unicode codespace is divided into seventeen planes, each comprising 65,536 code points or 256 rows of 256 code points:
All code points in the BMP are accessed as a single code point in UTF-16 encoding, whereas the code points in Planes 1 through 16 (supplementary planes, or, informally, astral planes) are accessed as surrogate pairs in UTF-16. See also Mapping of Unicode characters The Unicode characters can be categorized in many different ways Unicode code points can be logically divided into 17 In Computing, UTF-16 (16- Bit Unicode Transformation Format is a variable-length Character encoding for Unicode, capable of encoding In Computing, UTF-16 (16- Bit Unicode Transformation Format is a variable-length Character encoding for Unicode, capable of encoding Within each plane characters are allocated in named blocks of related characters. Although blocks are an arbitrary size, they are always a multiple 16 code points, and often a multiple of 128 code points. Characters required for a given script may be spread out over several different blocks. The following categories of code points are defined:
Code points in the range U+D800. . U+DBFF (1,024 code points) are known as high-surrogate code points, and code points in the range U+DC00. . U+DFFF (1,024 code points) are known as low-surrogate code points. A high-surrogate code point (also known as a leading surrogate) followed by a low-surrogate code point (also known as a trailing surrogate) together form a surrogate pair that represents a code point outside the Basic Multilingual Plane in the UTF-16 encoding form. See also Mapping of Unicode characters The Unicode characters can be categorized in many different ways Unicode code points can be logically divided into 17 In Computing, UTF-16 (16- Bit Unicode Transformation Format is a variable-length Character encoding for Unicode, capable of encoding High and low surrogate code points are not valid by themselves, and are only valid as surrogate pairs in UTF-16 encoded texts. Thus the range of code points that are available for use as characters is U+0000. . U+D7FF and U+E000. . U+10FFFF (1,112,064 code points). The hexadecimal value of these code points (i. e. excluding surrogates) is sometimes referred to as the character's scalar value. Noncharacters are code points that are guaranteed never to be used for encoding characters, although applications may make use of these code points internally if they wish. There are sixty-six noncharacters: U+FDD0. . U+FDEF and any code point ending in the value FFFE or FFFF (i. e. U+FFFE, U+FFFF, U+1FFFE, U+1FFFF, . . . U+10FFFE, U+10FFFF). The set of noncharacters is stable, and no new noncharacters will ever be defined. [3] Reserved code points are those code points which are available for use as encoded characters, but are not yet defined as characters by Unicode. Private use characters are defined as characters for private use The semantics of these characters are not defined by Unicode, and so any interchange of such characters requires an agreement between sender and receiver on their interpretation. There are three private use areas in the Unicode codespace:
Graphic characters are characters defined by Unicode to have a particular semantic, and either have a visible glyph shape or represent a visible space. A glyph is an element of writing Two or more glyphs representing the same symbol whether interchangeable or context-dependent are called Allographs the abstract unit they As of Unicode 5. 1 there are 100,507 graphic characters. Format characters are characters that do not have a visible appearance, but may have an effect on the appearance or behaviour of neighbouring characters. For example, U+200C ZERO WIDTH NON-JOINER and U+200D ZERO WIDTH JOINER may be used to change the default shaping behaviour of adjacent characters (e. The zero width non joiner ( ZWNJ) is a Non-printing character used in the computerized Typesetting of some Cursive script, Korean Hangul The zero width joiner ( ZWJ) is a Non-printing character ("" used in the computerized Typesetting of some Cursive scripts such g. to inhibit ligatures or request ligature formation). There are 141 format characters in Unicode 5. 1. Sixty-five code points (U+0000. . U+001F and U+007F. . U+009F) are reserved as control codes, and correspond to the C0 and C1 control codes defined in ISO/IEC 6429. Of these U+0009 (Tab), U+000A (Line Feed) and U+000D (Carriage Return) are widely used in Unicode-encoded texts. Graphic characters, format characters, control code characters and private use characters are collectively known as assigned characters. The set of graphic and format characters defined by Unicode does not correspond directly to the repertoire of abstract characters that is representable under Unicode. Unicode encodes characters by associating an abstract character with a particular code point. [4] However, not all abstract characters are encoded as a single Unicode character, and some abstract characters may be represented in Unicode by a sequence of two or more characters. For example, Latin Small Letter I With Ogonek And Dot Above And Acute, which is required in Lithuanian, is represented by the character sequence U+012F, U+0307, U+0301. Unicode maintains a list of uniquely named character sequences for abstract characters that are not directly encoded in Unicode. [5] All graphic, format and private use characters have a unique and immutable name by which they may be identified. Although a Unicode character name may not be changed under any circumstances (historically this was not the case), in cases where the name is seriously defective and misleading or has a serious typographical error, a formal alias may be defined, and applications are encouraged to use the formal alias in place of the official character name. For example, U+A015 YI SYLLABLE WU has the formal alias YI SYLLABLE ITERATION MARK, and U+FE18 PRESENTATION FORM FOR VERTICAL RIGHT WHITE LENTICULAR BRAKCET has the formal alias PRESENTATION FORM FOR VERTICAL RIGHT WHITE LENTICULAR BRACKET. [6]
The Unicode Consortium, based in California, develops the Unicode standard. The Unicode® Consortium ( Unicode Inc) the non-profit organization that coordinates the development of the Unicode ™ Standard has the ambitious goal of eventually California ( is a US state on the West Coast of the United States, along the Pacific Ocean. There are various levels of membership, and any company or individual willing to pay the membership dues may join this organization. Full members include most of the main computer software and hardware companies with any interest in text-processing standards, including Adobe Systems, Apple, Google, HP, IBM, Microsoft, Sun Microsystems and Yahoo. Adobe Systems Incorporated (pronounced a-DOE-bee əˈdoʊbiː ( is an American Computer software company headquartered in San Jose California Apple Inc, ( formerly Apple Computer Inc, is an American Multinational corporation with a focus on designing and manufacturing Consumer electronics Google Inc is an American public corporation, earning revenue from advertising related to its Internet search, e-mail, online International Business Machines Corporation abbreviated IBM and nicknamed "Big Blue", is a multinational Computer Technology Microsoft Corporation is an American multinational Computer technology Corporation, which rose to dominate the Home computer Sun Microsystems Inc ( is a multinational vendor of Computers computer components Computer software, and Information technology services
The Consortium first published The Unicode Standard (ISBN 0-321-18578-1) in 1991, and continues to develop standards based on that original work. The latest major version of the standard, Unicode 5. 0 (ISBN 0-321-48091-0), was published in 2007. The data files for the most recent minor version, Unicode 5. 1, are available from the consortium's web site.
Unicode is developed in conjunction with the International Organization for Standardization and shares the character repertoire with ISO/IEC 10646: the Universal Character Set. The Universal Character Set (UCS defined by the ISO / IEC 10646 International Standard, is a standard set of characters upon which Unicode and ISO/IEC 10646 function equivalently as character encodings, but The Unicode Standard contains much more information for implementers, covering — in depth — topics such as bitwise encoding, collation and rendering. The Unicode collation algorithm (UCA provides a standard way to put names words or strings of text in sequence according to the needs of a particular situation The Unicode Standard enumerates a multitude of character properties, including those needed for supporting bidirectional text. Bi-directional text is used as some Writing systems of the world notably the Arabic (including variants such as Nasta'liq) and Hebrew scripts The two standards do use slightly different terminology.
Thus far the following major and minor versions of the Unicode standard have been published (update versions, which do not include any changes to character repertoire, are omitted). [7]
| Version | Date | Book | Corresponding ISO/IEC 10646 Edition | Scripts | Characters |
|---|---|---|---|---|---|
| 1. 0. 0 | October 1991 | ISBN 0-201-56788-1 (Vol. 1) | 24 | 7,161 | |
| 1. 0. 1 | June 1992 | ISBN 0-201-60845-6 (Vol. 2) | 25 | 28,359 | |
| 1. 1 | June 1993 | ISO/IEC 10646-1:1993 | 24 | 34,233 | |
| 2. 0 | July 1996 | ISBN 0-201-48345-9 | ISO/IEC 10646-1:1993 plus Amendments 5, 6 and 7 | 25 | 38,950 |
| 2. 1 | May 1998 | ISO/IEC 10646-1:1993 plus Amendments 5, 6 and 7, and two characters from Amendment 18 | 25 | 38,952 | |
| 3. 0 | September 1999 | ISBN 0-201-61633-5 | ISO/IEC 10646-1:2000 | 38 | 49,259 |
| 3. 1 | March 2001 | ISO/IEC 10646-1:2000 ISO/IEC 10646-2:2001 | 41 | 94,205 | |
| 3. 2 | March 2002 | ISO/IEC 10646-1:2000 plus Amendment 1 ISO/IEC 10646-2:2001 | 45 | 95,221 | |
| 4. 0 | April 2003 | ISBN 0-321-18578-1 | ISO/IEC 10646:2003 | 52 | 96,447 |
| 4. 1 | March 2005 | ISO/IEC 10646:2003 plus Amendment 1 | 59 | 97,720 | |
| 5. 0 | July 2006 | ISBN 0-321-48091-0 | ISO/IEC 10646:2003 plus Amendments 1 and 2, and four characters from Amendment 3 | 64 | 99,089 |
| 5. 1 | April 2008 | ISO/IEC 10646:2003 plus Amendments 1, 2, 3 and 4 | 75 | 100,713 |
Unicode 5. 2, corresponding to ISO/IEC 10646:2003 plus Amendments 1-6, is tentatively scheduled for release in Summer 2009. [8]
Unicode covers almost all scripts (writing systems) in current use today. A writing system is a type of Symbolic system used to represent elements or statements expressible in Language. [9]
Although 75 writing systems (alphabets, syllabaries, and others) are included in the latest version of Unicode, there remain more still awaiting encoding, particularly some used in historical, liturgical and academic settings. A writing system is a type of Symbolic system used to represent elements or statements expressible in Language. An alphabet is a standardized set of letters basic written symbols each of which roughly represents a Phoneme, a Spoken language, either A syllabary is a set of written symbols that represent (or approximate Syllables which make up Words A symbol in a syllabary typically represents an optional Further additions of characters to the already-encoded scripts, as well as symbols, in particular for mathematics and music (in the form of notes and rhythmic symbols), also occur. Mathematics is the body of Knowledge and Academic discipline that studies such concepts as Quantity, Structure, Space and See also Modern musical symbols Music notation or musical notation is any system which represents aurally perceived Music through the use The Unicode Roadmap Committee (Michael Everson, Rick McGowan, and Ken Whistler) maintain the list of scripts that are candidates or potential candidates for encoding and their tentative code block assignments on the Unicode Roadmap page of the Unicode Consortium Web site. Michael Everson (born January 9, 1963) is a linguist, script encoder, typesetter, and Font designer. The Unicode® Consortium ( Unicode Inc) the non-profit organization that coordinates the development of the Unicode ™ Standard has the ambitious goal of eventually For some scripts on the Roadmap, encoding proposals have been made and are working their way through the approval process. For others, no proposal has yet been made, and they await agreement on character repertoire and other details from the user communities involved.
Among the scripts currently scheduled for encoding in the next version of Unicode are Avestan, Egyptian Hieroglyphics, Tai Tham, Tai Viet, Imperial Aramaic, Inscriptional Pahlavi, Inscriptional Parthian, Javanese, Kaithi, Lisu, Meetei Mayek, Nü Shu, Old South Arabian, Old Turkic, Samaritan and Tangut. The Avestan alphabet is a writing system developed during the Sassanid era (226-651 in Iran to render the Avestan language. Egyptian hieroglyphs (ˈhaɪərəʊɡlɪf from Greek grc-Grek ἱερογλύφος " sacred carving " also hieroglyphic = grc-Grek The Lanna script ( Northern Thai language:) is used for three living languages Northern Thai (that is Kam Mu’ang Tai Lü and Khün. The Tai Viet script is used by three Tai languages spoken primarily in northwestern Vietnam northern Laos and central Thailand &ndash Tai Dam (also Black Tai or Tai Noir Aramaic is a Semitic language with The Parthian language, also known as Arsacid Pahlavi, is a now-extinct ancient Northwestern Iranian language spoken in Parthia, a region of northeastern The Javanese script, natively known as Carakan ( Tjarakan) is the script originally used to write Javanese. Kaithi (कैथी also called "Kayathi" or "Kayasthi" is the name of a historical script used widely in parts of North India, primarily in the former The Fraser alphabet or Old Lisu Alphabet is an Artificial script invented around 1915 by the Sara Ba Thaw a Karen preacher from Myanmar, and Meitei Mayek script (also Meithei Mayek, Meetei Mayek, Manipuri script) (Manipuri Meetei Mayek) is an Abugida that was used Nü Shu ( is a Syllabary Writing system that was used exclusively among women in Jiangyong County in Hunan province of southern The ancient South Arabian alphabet (also known as musnad المُسند branched from the Proto-Sinaitic alphabet in about the 9th century BC. The Old Turkic script (also Göktürk script, Orkhon script, Orkhon-Yenisey script; Turkish: Orhun Yazıtları, 鄂爾渾文字 The Samaritan alphabet is a direct descendant of the paleo-Hebrew variety of the Phoenician alphabet. The Tangut script was an obsolete Logographic writing system used for writing the equally obsolete Tangut language in Western Xia Dynasty [10]
Other scripts for which an encoding proposal is anticipated to be submitted in the near future include Classical Yi, Old Uyghur and Oracle Bone Script. The Yi scripts, also known as Cuan or Wei are used to write the Yi languages Classical Yi Logogram Uyghur (/ ug-Latn Uyƣurqə/ug-Cyrl Уйғурчә, or / ug-Latn Uyƣur tili/ug-Cyrl Уйғур Oracle bone script ( refers to incised (or rarely brush-written ancient Chinese characters found on Oracle bones which are animal bones or turtle shells used in However, there are a number of writing systems, such as Mayan, Rongorongo and Linear A which are not currently being considered for encoding. The Maya script, also known as Maya hieroglyphs, was the writing system of the Pre-Columbian Maya civilization of Mesoamerica, presently Rongorongo (ˈrɒŋɡoʊˈrɒŋɡoʊ in English in Rapa Nui) is a system of Glyphs discovered in the 19th century on Easter Island that appears to be Linear A is one of two linear scripts used in ancient Crete before Greek Mycenaean Linear B.
Modern invented scripts, most of which do not qualify for inclusion in Unicode due to lack of real-world usage, are listed in the ConScript Unicode Registry, along with unofficial but widely-used Private Use Area code assignments. The ConScript Unicode Registry is a volunteer project to coordinate the assignment of code points in the Unicode Private Use Area for the encoding of Artificial Unicode ’s
Several mechanisms have been specified for implementing Unicode; which one implementers choose depends on available storage space, source code compatibility, and interoperability with other systems. Unicode ’s In Computer science, source code (commonly just source or code) is any sequence of statements or declarations written in some Human-readable
Unicode defines two mapping methods: the Unicode Transformation Format (UTF) encodings, and the Universal Character Set (UCS) encodings. UTF-8 (8- Bit UCS / Unicode Transformation Format) is a variable-length Character encoding for Unicode. The Universal Character Set (UCS defined by the ISO / IEC 10646 International Standard, is a standard set of characters upon which An encoding maps (possibly a subset of) the range of Unicode code points to sequences of values in some fixed-size range, termed code values. The numbers in the names of the encodings indicate the number of bits in one code value (for UTF encodings) or the number of bytes per code value (for UCS) encodings. UTF-8 and UTF-16 are probably the most commonly used encodings. UCS-2 is an obsolete subset of UTF-16; UCS-4 and UTF-32 are functionally equivalent.
UTF encodings include:
UTF-8 uses one to four bytes per code point and, being compact for Latin scripts and ASCII-compatible, provides the de facto standard encoding for interchange of Unicode text. In Computing, UTF-16 (16- Bit Unicode Transformation Format is a variable-length Character encoding for Unicode, capable of encoding UTF-32 (or UCS-4) is a protocol for encoding Unicode characters that uses exactly 32 Bits for each Unicode Code point. It is also used by most recent Linux distributions as a direct replacement for legacy encodings in general text handling. A Linux distribution (also called GNU/Linux by distributions such as Debian, Fedora, Ubuntu, Linux Mint, Mandriva and
The UCS-2 and UTF-16 encodings specify the Unicode Byte Order Mark (BOM) for use at the beginnings of text files, which may be used for byte ordering detection (or byte endianness detection). A byte-order mark ( BOM) is the Unicode character at code point U+FEFF ("zero-width no-break space" when that character is used to denote Some software developers have adopted it for other encodings, including UTF-8, which does not need an indication of byte order. In this case it attempts to mark the file as containing Unicode text. The BOM, code point U+FEFF has the important property of unambiguity on byte reorder, regardless of the Unicode encoding used; U+FFFE (the result of byte-swapping U+FEFF) does not equate to a legal character, and U+FEFF in other places, other than the beginning of text, conveys the zero-width no-break space (a character with no appearance and no effect other than preventing the formation of ligatures). Also, the units FE and FF never appear in UTF-8. UTF-8 (8- Bit UCS / Unicode Transformation Format) is a variable-length Character encoding for Unicode. The same character converted to UTF-8 becomes the byte sequence EF BB BF.
In UTF-32 and UCS-4, one 32-bit code value serves as a fairly direct representation of any character's code point (although the endianness, which varies across different platforms, affects how the code value actually manifests as an octet sequence). In the other cases, each code point may be represented by a variable number of code values. UTF-32 is widely used as internal representation of text in programs (as opposed to stored or transmitted text), since every Unix operating system which uses the gcc compilers to generate software uses it as the standard "wide character" encoding. The GNU Compiler Collection (usually shortened to GCC) is a set of Compilers produced for various Programming languages by the GNU Project Recent versions of the Python programming language (beginning with 2. Python is a general-purpose High-level programming language. Its design philosophy emphasizes programmer productivity and code readability 2) may also be configured to use UTF-32 as the representation for unicode strings, effectively disseminating such encoding in high-level coded software. In computing a high-level programming language is a Programming language with strong abstraction from the details of the computer
Punycode, another encoding form, enables the encoding of Unicode strings into the limited character set supported by the ASCII-based Domain Name System. Punycode is a Computer programming encoding syntax by which a Unicode string of characters can be translated into the more-limited character set permitted American Standard Code for Information Interchange ( ASCII) The Domain Name System (DNS is a hierarchical naming system for computers services or any resource participating in the Internet. The encoding is used as part of IDNA, which is a system enabling the use of Internationalized Domain Names in all scripts that are supported by Unicode. An internationalized domain name ( An internationalized domain name ( Earlier and now historical proposals include UTF-5 and UTF-6. This article compares Unicode encodings Two situations are considered eight-bit-clean environments and environments like Simple Mail Transfer Protocol that forbid use of
GB18030 is another encoding form for Unicode, from the Standardization Administration of China. GB18030 is the registered Internet name for the official Character set of the People's Republic of China (PRC superseding GB2312. It is the official character set of the People's Republic of China (PRC). A character encoding consists of a code that pairs a sequence of characters from a given character set (sometimes incorrectly referred to as Code page Talk People's Republic of China) PEOPLE'S REPUBLIC OF CHINA ARTICLE GUIDELINES BOCU-1 and SCSU are Unicode compression schemes. BOCU-1 is a MIME compatible Unicode compression scheme BOCU stands for B inary O rdered C ompression for U nicode The Standard Compression Scheme for Unicode (SCSU is a Unicode Technical Standard for reducing the number of Bytes needed to represent Unicode text especially The April Fools' Day RFC of 2005 specified two parody UTF encodings, UTF-9 and UTF-18. Almost every April Fools' Day ( 1 April) since 1989 the Internet Engineering Task Force has published one or more humorous RFC documents following in the A parody (ˈpɛɹədiː US, [ˈpaɹədiː] UK) in contemporary usage is a work created to mock comment on or poke fun at an original work its subject UTF-9 and UTF-18 (9- and 18- Bit Unicode Transformation Format, respectively were two April Fools' Day RFC joke specifications for encoding unicode
Unicode includes a mechanism for modifying character shape and so greatly extending the supported glyph repertoire. This covers the use of combining diacritical marks. In Digital typography, combining characters are characters that are intended to modify other characters They get inserted after the main character (one can stack several combining diacritics over the same character). Unicode also contains precomposed versions of most letter/diacritic combinations in normal use. A precomposed character (alternatively decomposable character) is a Unicode entity that can be decomposed into an equivalent string of several other characters These make conversion to and from legacy encodings simpler and allow applications to use Unicode as an internal text format without having to implement combining characters. For example é can be represented in Unicode as U+0065 (Latin small letter e) followed by U+0301 (combining acute) but it can also be represented as the precomposed character U+00E9 (Latin small letter e with acute). In Computing, Unicode is an Industry standard allowing Computers to consistently represent and manipulate text expressed in most of the world's So in many cases, users have many ways of encoding the same character. To deal with this, Unicode provides the mechanism of canonical equivalence. Unicode contains numerous characters to maintain compatibility with existing standards some of which are functionally equivalent to other characters or sequences of characters
An example of this arises with hangul, the Korean alphabet. Unicode provides the mechanism for composing hangul syllables with their individual subcomponents, known as hangul Jamo. However, it also provides all 11,172 combinations of precomposed hangul syllables.
The CJK ideographs currently have codes only for their precomposed form. CJK is a collective term for Chinese, Japanese, and Korean, which constitute the main East Asian languages. Still, most of those ideographs comprise simpler elements (often called radicals in English), so in principle Unicode could have decomposed them just as it has happened with hangul. This would have greatly reduced the number of required code points, while allowing the display of virtually every conceivable ideograph (which might do away with some of the problems caused by the Han unification). A similar idea covers some input methods, such as Cangjie and Wubi. The Cangjie input method (often erroneously spelt “Changjie” or “Cang Jei” is a system by which Chinese characters may be entered into a Computer by means The Wubizixing input method ( often abbreviated to simply Wubi or Wubi Xing, is a Chinese character input method primarily for inputting simplified However, attempts to do this for character encoding have stumbled over the fact that ideographs do not actually decompose as simply or as regularly as it seems they should.
A set of radicals was provided in Unicode 3. This disambiguation page differentiates the various historical uses of the term radical in the context of Chinese characters 0 (CJK radicals between U+2E80 and U+2EFF, KangXi radicals in U+2F00 to U+2FDF, and ideographic description characters from U+2FF0 to U+2FFB), but the Unicode standard (ch. 11. 1 of Unicode 4. 1) warns against using ideographic description sequences as an alternate representation for previously encoded characters:
This process is different from a formal encoding of an ideograph. There is no canonical description of unencoded ideographs; there is no semantic assigned to described ideographs; there is no equivalence defined for described ideographs. Conceptually, ideograph descriptions are more akin to the English phrase, “an ‘e’ with an acute accent on it,” than to the character sequence <U+006E, U+0301> [sic; 'e' should be U+0065].
Many scripts, including Arabic and Devanagari, have special orthographic rules which require that certain combinations of letterforms be combined into special ligature forms. The Arabic alphabet is the script used for writing several languages of Asia and Africa such as Arabic, Persian, and Urdu. The rules governing ligature formation can be quite complex, requiring special script-shaping technologies such as ACE (Arabic Calligraphic Engine by DecoType in the 1980s and used to generate all the Arabic examples in the printed editions of the Unicode Standard) which became the proof of concept for OpenType (by Adobe and Microsoft), Graphite (by SIL International), or AAT (by Apple). OpenType is a scalable format for Computer fonts initially developed by Microsoft, later joined by Adobe Systems. Graphite is a programmable Unicode -compliant smart-font rendering system developed by SIL international. SIL International (the official name of what was originally the Summer Institute of Linguistics) is a worldwide U Apple Advanced Typography ( AAT) is Apple Inc 's computer software for advanced Font rendering supporting Internationalization and complex Instructions are also embedded in fonts to tell the operating system how to properly output different character sequences. An operating system (commonly abbreviated OS and O/S) is the software component of a Computer system that is responsible for the management and coordination A simple solution to the placement of combining marks or diacritics is assigning the marks a width of zero and placing the glyph itself to the left or right of the left sidebearing (depending on the direction of the script they are intended to be used with). A mark handled this way will appear over whatever character precedes it, but will not adjust its position relative to the width or height of the base glyph; it may be visually awkward and it may overlap some glyphs. Real stacking is impossible, but can be approximated in limited cases (for example, Thai top-combining vowels and tone marks can just be at different heights to start with). Generally this approach is only effective in monospaced fonts but can also be used as a fallback rendering method when more complex methods fail.
As of 2004, most software still cannot reliably handle many features not supported by older font formats, so combining characters generally will not work correctly. "MMIV" redirects here For the Modest Mouse album see " Baron von Bullshit Rides Again " For example, ḗ (precomposed e with macron and acute above) and ḗ (e followed by the combining macron above and combining acute above) should be rendered identically, both appearing as an e with a macron and acute accent, but in practice, their appearance can vary greatly across software applications. E is the fifth letter in the Latin alphabet. Its name in English is spelled e (iː plural es or ees (also written E's E A macron, from Greek el μακρόv ( makrón) meaning "long" is a Diacritic ¯ placed over or under a Vowel which was originally History An early precursor of the acute accent was the apex, used in Latin inscriptions to mark long vowels. Similarly, underdots, as needed in the romanization of Indic, will often be placed incorrectly. Overdot See also Anusvara Language scripts or transcription schemes that use the dot above a letter as a diacritical mark In Arabic romanization In Linguistics, romanization (or latinization, also spelled romanisation or latinisation) is the representation of a Word or The Indo-Aryan languages (within the context of Indo-European studies also Indic) are a branch of the Indo-European language family As a workaround, Unicode characters that map to precomposed glyphs can be used for many such characters. The need for such alternatives inherits from the limitations of fonts and rendering technology, not weaknesses of Unicode itself.
Several subsets of Unicode are standardized: Microsoft Windows since Windows NT 4. 0 supports WGL-4 with 652 characters, which is considered to support all contemporary European languages using the Latin, Greek or Cyrillic script. Windows Glyph List 4, or more commonly WGL4 for short also known as the Pan-European character set, is a character repertoire on recent Microsoft's operating Other standardized subsets of Unicode include the Multilingual European Subsets:[11] MES-1 (Latin scripts only, 335 characters), MES-2 (Latin, Greek and Cyrillic 1062 characters)[12] and MES-3A & MES-3B (two larger subsets, not shown here). Note that MES-2 includes every character in MES-1 and WGL-4.
| WGL-4, MES-1 and MES-2 | ||
| Row | Cells | Range(s) |
|---|---|---|
| 00 | 20–7E | Basic Latin (00–7F) |
| A0–FF | Latin-1 Supplement (80–FF) | |
| 01 | 00–13, 14–15, 16–2B, 2C–2D, 2E–4D, 4E–4F, 50–7E, 7F | Latin Extended-A (00–7F) |
| 8F, 92, B7, DE-EF, FA–FF | Latin Extended-B (80–FF …) | |
| 02 | 18–1B, 1E–1F | Latin Extended-B (… 00–4F) |
| 59, 7C, 92 | IPA Extensions (50–AF) | |
| BB–BD, C6, C7, C9, D6, D8–DB, DC, DD, DF, EE | Spacing Modifier Letters (B0–FF) | |
| 03 | 74–75, 7A, 7E, 84–8A, 8C, 8E–A1, A3–CE, D7, DA–E1 | Greek (70–FF) |
| 04 | 00, 01–0C, 0D, 0E–4F, 50, 51–5C, 5D, 5E–5F, 90–91, 92–C4, C7–C8, CB–CC, D0–EB, EE–F5, F8–F9 | Cyrillic (00–FF) |
| 1E | 02–03, 0A–0B, 1E–1F, 40–41, 56–57, 60–61, 6A–6B, 80–85, 9B, F2–F3 | Latin Extended Additional (00–FF) |
| 1F | 00–15, 18–1D, 20–45, 48–4D, 50–57, 59, 5B, 5D, 5F–7D, 80–B4, B6–C4, C6–D3, D6–DB, DD–EF, F2–F4, F6–FE | Greek Extended (00–FF) |
| 20 | 13–14, 15, 17, 18–19, 1A–1B, 1C–1D, 1E, 20–22, 26, 30, 32–33, 39–3A, 3C, 3E | General Punctuation (00–6F) |
| 44, 4A, 7F, 82 | Superscripts and Subscripts (70–9F) | |
| A3–A4, A7, AC, AF | Currency Symbols (A0–CF) | |
| 21 | 05, 13, 16, 22, 26, 2E | Letterlike Symbols (00–4F) |
| 5B–5E | Number Forms (50–8F) | |
| 90–93, 94–95, A8 | Arrows (90–FF) | |
| 22 | 00, 02, 03, 06, 08-09, 0F, 11–12, 15, 19–1A, 1E–1F, 27-28, 29, 2A, 2B, 48, 59, 60–61, 64–65, 82–83, 95, 97 | Mathematical Operators (00–FF) |
| 23 | 02, 0A, 20–21, 29–2A | Miscellaneous Technical (00–FF) |
| 25 | 00, 02, 0C, 10, 14, 18, 1C, 24, 2C, 34, 3C, 50–6C | Box Drawing (00–7F) |
| 80, 84, 88, 8C, 90–93 | Block Elements (80–9F) | |
| A0–A1, AA–AC, B2, BA, BC, C4, CA–CB, CF, D8–D9, E6 | Geometric Shapes (A0–FF) | |
| 26 | 3A–3C, 40, 42, 60, 63, 65–66, 6A, 6B | Miscellaneous Symbols (00–FF) |
| F0 | (01–02) | Private Use Area (00–FF …) |
| FB | 01–02 | Alphabetic Presentation Forms (00–4F) |
| FF | FD | Specials |
Rendering software which cannot process a Unicode character appropriately most often display it as only an open rectangle, or the Unicode “replacement character” (U+FFFD, �), to indicate the position of the unrecognized character. Unicode as of version 51 defines the following ranges for encoding the Latin alphabet and derived characters See also Mapping of Unicode characters Unicode as of version 51 defines the following ranges for encoding the Latin alphabet and derived characters See also Mapping of Unicode characters This article is about the terms 'subscript' and 'superscript' as used in typography This article is about the terms 'subscript' and 'superscript' as used in typography A currency sign is a graphic symbol often used as a shorthand for a Currency 's name Letterlike Symbols are special characters like a regular alphabet or symbol characters but they have specific style and appearance which is known and commonly used in many different Number Forms are special symbols or characters like any regular alpha-numeric or symbol characters but they have very specific numerical values assigned to them and they are commonly Unicode ranges encoding Mathematical operators Mathematical Operators (2200&ndash22FF Miscellaneous Mathematical Symbols-A (27C0&ndash27EF Miscellaneous Technical is a Unicode character block ranging from (hexadecimal 2300 to 23FF which contains various common Symbols which are related to and Box drawing characters, also known as line drawing characters, or pseudographics, are widely used in Text user interfaces to draw various frames and boxes The Miscellaneous Symbols plane of Unicode (2600–26FF contains various glyphs representing things from a variety of categories Astrological, Astronomical Specials is the name of a short Unicode block allocated at the very end of the Basic Multilingual Plane, at U+FFF0&ndashFFFF Some systems have made attempts to provide more information about such characters. The Apple LastResort font will display a substitute glyph indicating the Unicode range of the character and the SIL Unicode fallback font will display a box showing the hexadecimal scalar value of the character. A fallback font is a reserve typeface containing symbols for as many Unicode characters as possible SIL International (the official name of what was originally the Summer Institute of Linguistics) is a worldwide U A fallback font is a reserve typeface containing symbols for as many Unicode characters as possible
Unicode has become the dominant scheme for internal processing and sometimes storage (though a lot of text is still stored in legacy encodings) of text. Early adopters tended to use UCS-2 and later moved to UTF-16 (as this was the least disruptive way to add support for non-BMP characters). The best known such system is Windows NT (and its descendants, Windows 2000, Windows XP and Windows Vista), which uses Unicode as the sole internal character encoding. Windows NT is a family of Operating systems produced by Microsoft, the first version of which was released in July 1993 Windows 2000 (also referred to as Win2K) is a preemptive, interruptible graphical and business-oriented Operating system designed to work with Windows XP is a family of 32-bit and 64-bit Operating systems produced by Microsoft for use on Personal computers including home and Windows Vista (ˈvɪstə is a line of Operating systems developed by Microsoft for use on Personal computers including home and business desktops The Java and .NET bytecode environments, Mac OS X, and KDE also use it for internal representation. A Java Virtual Machine ( JVM) is a set of computer software programs and data structures which use a Virtual machine Mac OS X (mæk oʊ ɛs tɛn is a line of computer Operating systems developed marketed and sold by Apple Inc, the latest of which is pre-loaded on all currently KDE ( K Desktop Environment) (ˌkeɪdiːˈiː is a Free software project which aims to be a powerful system for an easy-to-use Desktop environment.
UTF-8 (originally developed for Plan 9) has become the main storage encoding on most Unix-like operating systems (though others are also used by some libraries) because it is a relatively easy replacement for traditional extended ASCII character sets. Plan 9 from Bell Labs is a Distributed operating system, primarily used for research A Unix-like (sometimes shortened to *nix) Operating system is one that behaves in a manner similar to a Unix system while not necessarily conforming The term extended ASCII (or high ASCII) describes Eight-bit or larger Character encodings that include the standard seven- Bit
Multilingual text-rendering engines which use Unicode include Uniscribe for Microsoft Windows, ATSUI for Mac OS X and Pango, a free software engine used by GTK+ (and hence the GNOME desktop). Uniscribe is the Microsoft Windows set of services for rendering Unicode -encoded text especially Complex text layout. The Apple Type Services for Unicode Imaging ( ATSUI) is the set of services for rendering Unicode -encoded text starting with Mac OS 8 Pango (Παν語 is a free and open source Computing library for rendering internationalized texts in high quality Free software or software libre is Software that can be used studied and modified without restriction and which can be copied and redistributed in modified or unmodified GTK+, or The GIMP Toolkit, is a Cross-platform Widget toolkit for creating Graphical user interfaces It is one of the most popular toolkits A gnome is a Mythical creature characterized by its extremely small size and subterranean lifestyle
Because keyboard layouts cannot have simple key combinations for all characters, several operating systems provide alternative input methods that allow access to the entire repertoire. Many systems provide direct Unicode input support in some form to allow selection of arbitrary Unicode characters
ISO 14755[13], which standardises methods for entering Unicode characters from their codepoints, specifies several methods. There is the Basic method, where a beginning sequence is followed by the hexadecimal representation of the codepoint and the ending sequence. There is also a screen-selection entry method specified, where the characters are listed in a table in a screen, such as with a character map program.
MIME defines two different mechanisms for encoding non-ASCII characters in e-mail, depending on whether the characters are in e-mail headers such as the "Subject:" or in the text body of the message. Many E-mail clients now offer some support for Unicode in E-mail bodies Multipurpose Internet Mail Extensions ( MIME) is an Internet standard that extends the format of e-mail to support text in Character Electronic mail, often abbreviated to e-mail, email, or originally eMail, is a Store-and-forward method of writing sending receiving In both cases, the original character set is identified as well as a transfer encoding. For e-mail transmission of Unicode the UTF-8 character set and the Base64 transfer encoding are recommended. UTF-8 (8- Bit UCS / Unicode Transformation Format) is a variable-length Character encoding for Unicode. The term Base64 refers to a specific MIME content transfer encoding. The details of the two different mechanisms are specified in the MIME standards and are generally hidden from users of e-mail software.
The adoption of Unicode in e-mail has been very slow. Electronic mail, often abbreviated to e-mail, email, or originally eMail, is a Store-and-forward method of writing sending receiving Some East-Asian text is still encoded in encodings such as ISO-2022, and some devices, such as cell phones, still cannot handle Unicode data correctly. ISO 2022, more formally ISO/IEC 2022 "Information Technology—Character code structure and extension techniques" is an ISO standard (equivalent to the Support has been improving however. Many major free mail providers such as Google (gmail), Microsoft (Hotmail) support it. Google Inc is an American public corporation, earning revenue from advertising related to its Internet search, e-mail, online Gmail, officially Google Mail in Germany and the United Kingdom is a free POP3 and IMAP Microsoft Corporation is an American multinational Computer technology Corporation, which rose to dominate the Home computer Windows Live Hotmail, formerly known as MSN Hotmail and commonly referred to simply as Hotmail, is a free Webmail service of the Windows Live The notable exception is Yahoo.
All W3C recommendations have used Unicode as their document character set since HTML 4. Web pages authored using hypertext markup language ( HTML) may contain multilingual text represented with the Unicode universal character set. 0. Web browsers have supported Unicode, especially UTF-8, for many years. A web browser is a software application which enables a user to display and interact with text images videos music games and other information typically located on a Display problems result primarily from font related issues; in particular versions of Microsoft Internet Explorer up to version 6 do not render many code points unless explicitly told to use a font that contains them. In Typography, a typeface is a set of one or more Fonts designed with stylistic unity each comprising a coordinated set of Glyphs A typeface usually comprises Windows Internet Explorer (formerly Microsoft Internet Explorer abbreviated MSIE) commonly abbreviated to IE, is a series of graphical [14]
Although syntax rules may affect the order in which characters are allowed to appear, both HTML 4 and XML (including XHTML) documents, by definition, comprise characters from most of the Unicode code points, with the exception of:
These characters manifest either directly as bytes according to document's encoding, if the encoding supports them, or users may write them as numeric character references based on the character's Unicode code point. HTML, an initialism of HyperText Markup Language, is the predominant Markup language for Web pages It provides a means to describe the structure Don't change "Extensible" The Extensible Hypertext Markup Language, or XHTML, is a Control character article i need to think about merging these A byte (pronounced "bite" baɪt is the basic unit of measurement of information storage in Computer science. For example, the references Δ, Й, ק, م, ๗, あ, 叶, 葉, and 말 (or the same numeric values expressed in hexadecimal, with &#x as the prefix) display on browsers as Δ, Й, ק, م, ๗, あ, 叶, 葉, and 말.
When specifying URIs, for example as URLs in HTTP requests, non-ASCII characters must be percent-encoded. Uniform Resource Locator is an URI which also specifies where the identified resource is available and the protocol for retrieving it Hypertext Transfer Protocol ( HTTP) is a Communications protocol for the transfer of information on the Internet. Percent-encoding, also known as URL encoding, is a mechanism for encoding information in a Uniform Resource Identifier (URI under certain circumstances
Free and retail fonts based on Unicode are commonly available, since TrueType and OpenType support Unicode. Unicode typefaces (also known as UCS fonts and Unicode fonts) are Typefaces containing a wide range of characters, letters, Digits TrueType is an Outline font standard originally developed by Apple Computer in the late 1980s as a competitor to Adobe 's Type 1 fonts OpenType is a scalable format for Computer fonts initially developed by Microsoft, later joined by Adobe Systems. These font formats map Unicode code points to glyphs.
Thousands of fonts exist on the market, but fewer than a dozen fonts — sometimes described as "pan-Unicode" fonts — attempt to support the majority of Unicode's character repertoire. This is a list of Typefaces. Serif Here you can find a graphical version of this table Instead, Unicode-based fonts typically focus on supporting only basic ASCII and particular scripts or sets of characters or symbols. This is a list of Typefaces. Serif Here you can find a graphical version of this table Several reasons justify this approach: applications and documents rarely need to render characters from more than one or two writing systems; fonts tend to demand resources in computing environments; and operating systems and applications show increasing intelligence in regard to obtaining glyph information from separate font files as needed, i. e. font substitution. Font substitution is the process of using one font in place of another when the intended font either is not available or does not contain Glyphs for the required Furthermore, designing a consistent set of rendering instructions for tens of thousands of glyphs constitutes a monumental task; such a venture passes the point of diminishing returns for most typefaces. In Economics, diminishing returns is also called diminishing marginal returns or the law of diminishing returns.
Han unification (the identification of forms in the three East Asian languages which one can treat as stylistic variations of the same historical character) has become one of the most controversial aspects of Unicode, despite the presence of a majority of experts from all three regions in the Ideographic Rapporteur Group (IRG), which advises the Consortium and ISO on additions to the repertoire and on Han unification. East Asian languages describe two notional groupings of languages in East and Southeast Asia: Languages which have been greatly influenced by The Ideographic Rapporteur Group ( IRG) advises the Unicode Consortium and the ISO /IEC JTC1/SC2/WG2 on Han character additions to the repertoire [15]
Unicode has been criticized for failing to allow for older and alternative forms of kanji which, critics argue, complicates the processing of ancient Japanese and uncommon Japanese names, although it follows the recommendations of Japanese language scholars and of the Japanese government and contains all of the same characters as previous widely used encoding standards. are the Chinese characters that are used in the modern Japanese logographic writing system along with Hiragana (ひらがな 平仮名 Katakana [16] There have been several attempts to create alternative encodings that preserve the minor, stylistic differences between Chinese, Japanese, and Korean characters in opposition to Unicode's policy of Han unification. Among them are TRON (although it is not widely adopted in Japan, there are some users who need to handle historical Japanese text and favor it), and UTF-2000. TRON is a multi-byte Character encoding. It is similar to Unicode but does not use Unicode's Han unification process each character from each CJK
Many older forms were not included in early versions of the Unicode standard, but Unicode 4. 0 contains more than 70,000 Han characters and work continues on adding characters from the early literature of China, Korea, and Japan. Some argue, however, that this is not satisfactory, pointing out as an example the need to create new characters, representing words in various Chinese dialects, more of which may be invented in the future. Spoken Chinese ( comprises many regional variants the largest of which are Mandarin, Wu, Cantonese, and Min.
Despite these problems, the official encoding of China, GB-18030, supports the full range of characters in Unicode. GB18030 is the registered Internet name for the official Character set of the People's Republic of China (PRC superseding GB2312.
Injective mappings must be provided between characters in existing legacy character sets and characters in Unicode to facilitate conversion to Unicode and allow interoperability with legacy software. Lack of consistency in various mappings between earlier Japanese encodings such as Shift-JIS or EUC-JP and Unicode led to round-trip format conversion mismatches, particularly the mapping of the character JIS X 201 '~' (1-33, WAVE DASH), heavily used in legacy database data, to either '~' U+FF5E FULLWIDTH TILDE (in Microsoft Windows) or '〜' U+301C WAVE DASH (other vendors). Extended Unix Code ( EUC) is a multibyte Character encoding system used primarily for Japanese, Korean, and Simplified Chinese. The term round-trip is commonly used in Document conversion particularly involving Markup languages such as XML and SGML. Microsoft Windows is a series of Software Operating systems and Graphical user interfaces produced by Microsoft. [17]
Some Japanese computer programmers objected to Unicode because it requires them to separate the use of '\' U+005C REVERSE SOLIDUS (backslash) and '¥' U+00A5 YEN SIGN, which was mapped to 0x5C in JIS X 0201, and there is a lot of legacy code with this usage. [18] (This encoding also replaces tilde '~' 0x7E with overline '¯', now 0xAF. ) The separation of these characters exists in ISO 8859-1, from long before Unicode.
Thai alphabet support has been criticized for its illogical ordering of Thai characters. The Thai Alphabet (อักษรไทย àksŏn thai) is used to write the Thai language and other minority languages in Thailand The vowels เ, แ, โ, ใ, ไ that are written to the left of the preceding consonant are in visual order instead of logical order, unlike the Unicode representations of other Indic scripts. This complication is due to Unicode inheriting the Thai Industrial Standard 620, which worked in the same way. Thai Industrial Standard 620-2533, commonly referred to as TIS-620, is the most common Character set and Character encoding for the Thai language This ordering problem complicates the Unicode collation process slightly, requiring table lookups to reorder Thai characters for collation. [16]
Indic scripts such as Tamil and Devanagari are each allocated only 128 code points, matching the ISCII standard. The Brahmic family is a family of syllabaries (writing systems used in South Asia, Southeast Asia, and parts of Central Asia and East Asia, The Indian Script Code for Information Interchange ( ISCII) is a coding scheme for representing various writing systems of India. The correct rendering of Unicode Indic text requires transforming the stored logical order characters into visual order and the forming of ligatures out of components. Some local scholars argued in favor of assignments of Unicode codepoints to these ligatures, going against the practice for other writing systems, though Unicode contains some Arabic and other ligatures for back compatibility purposes only. [19][20][21] Encoding of any new ligatures in Unicode will not happen, in part because the set of ligatures is font-dependent, and Unicode is an encoding independent of font variations. The same kind of issue arose for Tibetan script (the Chinese National Standard organization failed to achieve a similar change). The Tibetan script is an Abugida of Indic origin used to write the Tibetan language as well as the Dzongkha language, Ladakhi language