Citizendia

The Universal Character Set (UCS) is defined by the ISO/IEC 10646 International Standard as a character set on which many encodings are based. The International Electrotechnical Commission ( IEC) is a not-for-profit, non-governmental international Standards organization that prepares and publishes It contains nearly a hundred thousand abstract characters, each identified by an unambiguous name and an integer number called its code point. For other uses see Character. In Computer and machine-based Telecommunications terminology a character is a unit of The integers (from the Latin integer, literally "untouched" hence "whole" the word entire comes from the same origin but via French

Characters (letters, numbers, symbols, ideograms, logograms, etc. ) from the many languages, scripts, and traditions of the world are represented in the UCS with unique code points. A language is a dynamic set of visual auditory or tactile Symbols of Communication and the elements used to manipulate them A writing system is a type of Symbolic system used to represent elements or statements expressible in Language. The inclusiveness of the UCS is continually improving as characters from previously unrepresented writing systems are added.

Since 1991, the Unicode Consortium has worked with ISO to develop The Unicode Standard ("Unicode") and ISO/IEC 10646 in tandem. Year 1991 ( MCMXCI) was a Common year starting on Tuesday (link will display full calendar of the Gregorian Calendar. The Unicode® Consortium ( Unicode Inc) the non-profit organization that coordinates the development of the Unicode ™ Standard has the ambitious goal of eventually In Computing, Unicode is an Industry standard allowing Computers to consistently represent and manipulate text expressed in most of the world's The repertoire, character names, and code points of Version 2. 0 of Unicode exactly match those of ISO/IEC 10646-1:1993 with its first seven published amendments. After the publication of Unicode 3. 0 in February 2000, corresponding new and updated characters entered the UCS via ISO/IEC 10646-1:2000. 2000 ( MM) was a Leap year that started on Saturday of the Common Era, in accordance with the Gregorian calendar.

The UCS has over 1. 1 million code points, but only the first 65,536 (the Basic Multilingual Plane, or BMP) had entered into common use before 2000. See also Mapping of Unicode characters The Unicode characters can be categorized in many different ways Unicode code points can be logically divided into 17 This situation began changing when the People's Republic of China (PRC) ruled in 2000 that all computer systems sold in its jurisdiction would have to support GB18030. Talk People's Republic of China) PEOPLE'S REPUBLIC OF CHINA ARTICLE GUIDELINES GB18030 is the registered Internet name for the official Character set of the People's Republic of China (PRC superseding GB2312. This required computer systems intended for sale in the PRC to move beyond the BMP.

The system deliberately leaves many code points not assigned to characters, even in the BMP. It does this to allow for future expansion or to minimize conflicts with other encoding forms.

Unicode
Character encodings
UCS
Mapping
Bi-directional text
BOM
Han unification
Unicode and HTML
Unicode and E-mail
Unicode typefaces

Contents

Encoding forms of the Universal Character Set

ISO 10646 defines several character encoding forms for the Universal Character Set. In Computing, Unicode is an Industry standard allowing Computers to consistently represent and manipulate text expressed in most of the world's A character encoding consists of a code that pairs a sequence of characters from a given character set (sometimes incorrectly referred to as Code page This article compares Unicode encodings Two situations are considered eight-bit-clean environments and environments like Simple Mail Transfer Protocol that forbid use of UTF-7 (7- Bit Unicode Transformation Format) is a variable-length character encoding that was proposed for representing Unicode -encoded text using a UTF-1 is a way of transforming ISO 10646/ Unicode into a stream of Bytes Due to the design it is not possible to resynchronise if decoding starts in the middle of a UTF-8 (8- Bit UCS / Unicode Transformation Format) is a variable-length Character encoding for Unicode. "Compatibility Encoding Scheme for UTF-16 8-Bit" ( CESU-8) is a variant of UTF-8 that is described in Unicode Technical Report #26. In Computing, UTF-16 (16- Bit Unicode Transformation Format is a variable-length Character encoding for Unicode, capable of encoding UTF-32 (or UCS-4) is a protocol for encoding Unicode characters that uses exactly 32 Bits for each Unicode Code point. UTF-EBCDIC is a Character encoding used to represent Unicode characters The Standard Compression Scheme for Unicode (SCSU is a Unicode Technical Standard for reducing the number of Bytes needed to represent Unicode text especially BOCU-1 is a MIME compatible Unicode compression scheme BOCU stands for B inary O rdered C ompression for U nicode Punycode is a Computer programming encoding syntax by which a Unicode string of characters can be translated into the more-limited character set permitted An internationalized domain name ( GB18030 is the registered Internet name for the official Character set of the People's Republic of China (PRC superseding GB2312. Unicode ’s Bi-directional text is used as some Writing systems of the world notably the Arabic (including variants such as Nasta'liq) and Hebrew scripts A byte-order mark ( BOM) is the Unicode character at code point U+FEFF ("zero-width no-break space" when that character is used to denote Web pages authored using hypertext markup language ( HTML) may contain multilingual text represented with the Unicode universal character set. Many E-mail clients now offer some support for Unicode in E-mail bodies Unicode typefaces (also known as UCS fonts and Unicode fonts) are Typefaces containing a wide range of characters, letters, Digits The simplest, UCS-2, uses a single code value (defined as one or more numbers representing a code point) between 0 and 65,535 for each character, and allows exactly two bytes (one 16-bit word) to represent that value. A byte (pronounced "bite" baɪt is the basic unit of measurement of information storage in Computer science. A bit is a binary digit, taking a value of either 0 or 1 Binary digits are a basic unit of Information storage and communication UCS-2 thereby permits a binary representation of every code point in the BMP, as long as the code point represents a character. UCS-2 cannot represent code points outside the BMP.

The first amendment to the original edition of the UCS defined UTF-16, an extension of UCS-2, to represent code points outside the BMP. In Computing, UTF-16 (16- Bit Unicode Transformation Format is a variable-length Character encoding for Unicode, capable of encoding A range of code points in the S (Special) Zone of the BMP remains unassigned to characters. UCS-2 disallows use of code values for these code points, but UTF-16 allows their use in pairs. Each pair consists of an "RC-element" (a two-octet sequence comprising the R-octet and the C-octet from the four octet sequence that corresponds to a cell in the coding space of a coded character set) from the high-half zone and an "RC-element" from the low-half zone. Unicode also adopted UTF-16, but in Unicode terminology, the high-half zone elements become "high surrogates" and the low-half zone elements become "low surrogates".

Another encoding, UCS-4, uses a single code value between 0 and (theoretically) hexadecimal 7FFFFFFF for each character (although the UCS stops at 10FFFF and ISO/IEC 10646 has stated that all future assignments of characters will also take place in that range). UTF-32 (or UCS-4) is a protocol for encoding Unicode characters that uses exactly 32 Bits for each Unicode Code point. UCS-4 allows representation of each value as exactly four bytes (one 32-bit word). UCS-4 thereby permits a binary representation of every code point in the UCS, including those outside the BMP. As in UCS-2, every encoded character has a fixed length in bytes, which makes it simple to manipulate, but of course it requires twice as much storage as UCS-2.

Occasionally, articles about Unicode will mistakenly refer to UCS-2 as "UCS-16". UCS-16 does not exist; the authors who make this error usually intend to refer to UCS-2 or to UTF-16.

History of ISO 10646

The International Organization for Standardization (ISO) set out to compose the universal character set in 1989, and published the draft of ISO 10646 in 1990. Year 1989 ( MCMLXXXIX) was a Common year starting on Sunday (link displays 1989 Gregorian calendar) Year 1990 ( MCMXC) was a Common year starting on Monday (link displays the 1990 Gregorian calendar) Hugh McGregor Ross was one of its principal architects. Hugh McGregor Ross (born August 31, 1917 in Nairobi, Kenya) is an early pioneer in the history of British Computing. That standard differed markedly from the current one. It defined 128 groups of 256 planes of 256 rows of 256 cells, for an apparent total of 2,147,483,648 characters, but actually the standard could code only 679,477,248 characters, as the policy forbade byte values of control characters (0x00 to 0x1F and 0x80 to 0x9F, in hexadecimal notation) anywhere. In Mathematics and Computer science, hexadecimal (also base -, hexa, or hex) is a Numeral system with a The Latin capital letter A, for example, had a location in group 0x20, plane 0x20, row 0x20, cell 0x41.

One could code the characters of this primordial ISO 10646 standard in one of three ways:

  1. UCS-4, four bytes for every character, enabling the simple encoding of all characters;
  2. UCS-2, two bytes for every character, enabling the encoding of the first plane, 0x20, the Basic Multilingual Plane, containing the first 36,864 codepoints, straightforwardly, and other planes and groups by switching to them with ISO 2022 escape sequences;
  3. UTF-1, which encodes all the characters in sequences of bytes of varying length (1 to 5 bytes, each of which contain no control characters). ISO 2022, more formally ISO/IEC 2022 "Information Technology—Character code structure and extension techniques" is an ISO standard (equivalent to the UTF-1 is a way of transforming ISO 10646/ Unicode into a stream of Bytes Due to the design it is not possible to resynchronise if decoding starts in the middle of a

In 1990, therefore, two initiatives for a universal character set existed: Unicode, with 16 bits for every character (65,536 possible characters), and ISO 10646. In Computing, Unicode is an Industry standard allowing Computers to consistently represent and manipulate text expressed in most of the world's The software companies refused to accept the complexity and size requirement of the ISO standard and were able to convince a number of ISO National Bodies to vote against it. The ISO standardisers realised they could not continue to support the standard in its current state and negotiated the unification of their standard with Unicode. Two changes took place: the lifting of the limitation upon characters (prohibition of control character values), thus permitting characters like 0x0000101F; and the synchronisation of the repertoire of the Basic Multilingual Plane with that of Unicode.

Meanwhile, in the passage of time, the situation changed in the Unicode standard itself: 65,536 characters came to appear insufficient, and the standard from version 2. 0 and onwards supports encoding of 1,112,064 characters by means of the UTF-16 surrogate mechanism. In Computing, UTF-16 (16- Bit Unicode Transformation Format is a variable-length Character encoding for Unicode, capable of encoding For that reason, ISO 10646 was limited to contain as many characters as could be encoded by UTF-16 and no more, that is, a little over a million characters instead of over 2,000 million. The UCS-4 encoding of ISO 10646 was incorporated into the Unicode standard with the limitation to the UTF-16 range and under the name UTF-32. UTF-32 (or UCS-4) is a protocol for encoding Unicode characters that uses exactly 32 Bits for each Unicode Code point. As for UTF-1, no-one used it, because of its bad design (no way of distinguishing between single bytes, lead bytes and trail bytes, a problem similar to that of the Shift-JIS encoding of Japanese) and its poor performance (many division operations). Rob Pike and Ken Thompson, the designers of the Plan 9 operating system, devised a new, fast and well-designed mixed width encoding, which came to be called UTF-8. Robert C Pike (born 1956 is a Software engineer and Author. He is best known for his work at Bell Labs, where he was a member of the Unix Kenneth Lane Thompson (born February 4 1943) commonly referred to as Ken Thompson (or simply Plan 9 from Bell Labs is a Distributed operating system, primarily used for research UTF-8 (8- Bit UCS / Unicode Transformation Format) is a variable-length Character encoding for Unicode.

Differences between ISO 10646 and Unicode

According to Alain LaBonté, the head of the Canadian delegation to ISO/IEC JTC1/SC22/WG20, the internationalization (i18n) working group in 2000:

The ISO/IEC 10646-1 Standard is an International Standard that covers:

  • 16-bit or 32-bit code;
  • "transformed formats" for compatibility with existing transmission standards;
  • three levels of compliance for the internal representation of characters:

    1. no composition of characters (all characters fully formed instead of including basic characters followed by diacritics). This excludes many languages but simplifies the life of programming languages for all Western languages without exception (and for languages that do not require composition, such as all Far Eastern languages);
    2. the composition of some, but not all, characters (obscure level, not sufficiently thought out; will be little used);
    3. the mix of the technique of composition with the possibility of coding fully formed characters;
  • total openness in the use of characters (no canonical form, no equivalency between composed characters and fully formed characters);
  • possible support for dead languages, in addition to all the living languages;
  • developmental possibilities for all practical purposes unlimited (eventually up to two billion separate characters).

Unicode provides:

  • exclusively 16-bit code;
  • a transformed format to allow access to at most a million 32-bit coding characters from ISO/IEC 10646-1 (this is considered amply sufficient for the foreseeable future, even long term, for business purposes);
  • a canonical form allowing for "normative" equivalency of characters that are precomposed or formed in a predetermined order from a base character and diacriticals;
  • rigid methods of presentation (no exceptions);
  • in parallel with this, various other closed methods of processing and presentation (the advantage is that implementation is rigorously predictable); what is noteworthy is that this standard is directly related to a classification method that constitutes a "delta" framed within the ISO/IEC 10651 International Standard.

These are the essential differences, but the coding is essentially the same. Whatever complies with Unicode complies with the International Standard, but the opposite is not necessarily true.

from FAQ: ISO/IEC 10646-1 Versus Unicode

ISO 10646 and Unicode have an identical repertoire and numbers — the same characters with the same numbers exist on both standards. The difference between them is that Unicode adds rules and specifications that are outside the scope of ISO 10646. ISO 10646 is a simple character map, an extension of previous standards like ISO 8859. ISO/IEC 8859 is a joint ISO and IEC standard for 8-bit Character encodings for use by computers In contrast, Unicode adds rules for collation, normalization of forms, and the bidirectional algorithm for scripts like Hebrew and Arabic. Text normalization is a process by which text is transformed in some way to make it consistent in a way which it might not have been before Bi-directional text is used as some Writing systems of the world notably the Arabic (including variants such as Nasta'liq) and Hebrew scripts The Arabic alphabet is the script used for writing several languages of Asia and Africa such as Arabic, Persian, and Urdu. For interoperability between platforms, especially if bidirectional scripts are used, it is not enough to support ISO 10646; Unicode must be implemented.

To support these rules and algorithms, Unicode adds many properties to each character in the set such as properties determining a character’s default bidirectional class and properties to determine how the character combines with other characters. If the character represents a numeric value such as the European number ‘8’, or the vulgar fraction ‘¼’, that numeric value is also added as a property of the character. Unicode intends these properties to support interoperable text handling with a mixture of languages.

Some applications support ISO 10646 characters but do not fully support Unicode. One such application, Linux xterm, can properly display all ISO 10646 characters that have a one-to-one character-to-glyph mapping and a single directionality. Linux (commonly pronounced ˈlɪnəks It can handle some combining marks by simple overstriking methods, but cannot display Hebrew (bidirectional), Devanagari (one character to many glyphs) or Arabic (both features). Most GUI applications use standard OS text drawing routines which handle such scripts, although the applications themselves still do not always handle them correctly. For instance, selecting text in certain scripts in Mozilla Firefox causes the text to jump around.

Citing the Universal Character Set

ISO 10646, a general, informal citation for the ISO/IEC 10646 family of standards, is acceptable in most prose. And even though it is a separate standard, the term Unicode is used just as often, informally, when discussing the UCS. However, any normative references to the UCS as a publication should cite a particular part and version, using the form ISO/IEC 10646-{part}:{year}; for example: ISO/IEC 10646-1:1993.

Correlation to Unicode

See §C.1 of The Unicode Standard and http://www.unicode.org/versions/Unicode5.1.0/ for more detail.

References

See also

External links

In Computing, Unicode is an Industry standard allowing Computers to consistently represent and manipulate text expressed in most of the world's A character encoding consists of a code that pairs a sequence of characters from a given character set (sometimes incorrectly referred to as Code page UTF-8 (8- Bit UCS / Unicode Transformation Format) is a variable-length Character encoding for Unicode. In Computing, UTF-16 (16- Bit Unicode Transformation Format is a variable-length Character encoding for Unicode, capable of encoding UTF-32 (or UCS-4) is a protocol for encoding Unicode characters that uses exactly 32 Bits for each Unicode Code point. This is a list of ISO standards that are discussed in Wikipedia articles ISO 2022, more formally ISO/IEC 2022 "Information Technology—Character code structure and extension techniques" is an ISO standard (equivalent to the Control character article i need to think about merging these ISO/IEC 8859 is a joint ISO and IEC standard for 8-bit Character encodings for use by computers ISO/IEC 146512007, titled Information technology -- International string ordering and comparison -- Method for comparing character strings and description of the common template tailorable In SGML, HTML and XML documents the logical constructs known as character data and attribute values consist of sequences of characters This is a list of Typefaces. Serif Here you can find a graphical version of this table Working Group can mean Working group, an interdisciplinary group of researchers or Working Group (dogs, kennel club designation for
© 2009 citizendia.org; parts available under the terms of GNU Free Documentation License, from http://en.wikipedia.org
Dapyx Software network: MP3 Explorer | Ebook Manager | Zenithic