Citizendia

Unicode
Character encodings
UCS
Mapping
Bi-directional text
BOM
Han unification
Unicode and HTML
Unicode and E-mail
Unicode typefaces

BOCU-1 is a MIME compatible Unicode compression scheme. In Computing, Unicode is an Industry standard allowing Computers to consistently represent and manipulate text expressed in most of the world's A character encoding consists of a code that pairs a sequence of characters from a given character set (sometimes incorrectly referred to as Code page This article compares Unicode encodings Two situations are considered eight-bit-clean environments and environments like Simple Mail Transfer Protocol that forbid use of UTF-7 (7- Bit Unicode Transformation Format) is a variable-length character encoding that was proposed for representing Unicode -encoded text using a UTF-1 is a way of transforming ISO 10646/ Unicode into a stream of Bytes Due to the design it is not possible to resynchronise if decoding starts in the middle of a UTF-8 (8- Bit UCS / Unicode Transformation Format) is a variable-length Character encoding for Unicode. "Compatibility Encoding Scheme for UTF-16 8-Bit" ( CESU-8) is a variant of UTF-8 that is described in Unicode Technical Report #26. In Computing, UTF-16 (16- Bit Unicode Transformation Format is a variable-length Character encoding for Unicode, capable of encoding UTF-32 (or UCS-4) is a protocol for encoding Unicode characters that uses exactly 32 Bits for each Unicode Code point. UTF-EBCDIC is a Character encoding used to represent Unicode characters The Standard Compression Scheme for Unicode (SCSU is a Unicode Technical Standard for reducing the number of Bytes needed to represent Unicode text especially Punycode is a Computer programming encoding syntax by which a Unicode string of characters can be translated into the more-limited character set permitted An internationalized domain name ( GB18030 is the registered Internet name for the official Character set of the People's Republic of China (PRC superseding GB2312. The Universal Character Set (UCS defined by the ISO / IEC 10646 International Standard, is a standard set of characters upon which Unicode ’s Bi-directional text is used as some Writing systems of the world notably the Arabic (including variants such as Nasta'liq) and Hebrew scripts A byte-order mark ( BOM) is the Unicode character at code point U+FEFF ("zero-width no-break space" when that character is used to denote Web pages authored using hypertext markup language ( HTML) may contain multilingual text represented with the Unicode universal character set. Many E-mail clients now offer some support for Unicode in E-mail bodies Unicode typefaces (also known as UCS fonts and Unicode fonts) are Typefaces containing a wide range of characters, letters, Digits Multipurpose Internet Mail Extensions ( MIME) is an Internet standard that extends the format of e-mail to support text in Character BOCU stands for Binary Ordered Compression for Unicode. BOCU-1 combines the wide applicability of UTF-8 with the compactness of SCSU. UTF-8 (8- Bit UCS / Unicode Transformation Format) is a variable-length Character encoding for Unicode. The Standard Compression Scheme for Unicode (SCSU is a Unicode Technical Standard for reducing the number of Bytes needed to represent Unicode text especially This Unicode encoding is designed to be useful for compressing short strings, and maintains code point order. In Computing, Unicode is an Industry standard allowing Computers to consistently represent and manipulate text expressed in most of the world's A character encoding consists of a code that pairs a sequence of characters from a given character set (sometimes incorrectly referred to as Code page BOCU-1 is specified in an Unicode Technical Note. [1]

For comparison SCSU was adopted as standard Unicode compression scheme with a byte/code point ratio similar to language-specific code pages. Code page is the traditional IBM term used to map a specific set of characters to numerical Code point values. SCSU has not been widely adopted, as it is not suitable for MIME “text” media types. For example, SCSU cannot be used directly in emails and similar protocols. SCSU requires a complicated encoder design for good performance. Usually, the zip, bzip2, and other industry standard algorithms compact larger amounts of Unicode text more efficiently. The ZIP File format is a Data compression and archival format. bzip2 is a free and open source Lossless data compression Algorithm and program developed by Julian Seward. [2].

Both SCSU[3] and BOCU-1[4] are IANA registered charsets. The Internet Assigned Numbers Authority (IANA is the entity that oversees global IP address allocation, DNS root zone management, media types

Details

All numbers in this section are hexadecimal, and all ranges are inclusive. In Mathematics and Computer science, hexadecimal (also base -, hexa, or hex) is a Numeral system with a

Code points from U+0000 to U+0020 are encoded in BOCU-1 as the corresponding byte value. All other code points (that is, U+0021 through U+D7FF and U+E000 through U+10FFFF) are encoded as a difference between the code point and a normalized version of the most recently encoded code point that was not an ASCII space (U+0020). The initial state is U+0040. The normalization mapping is as follows:

Code rangeNormalized code pointNotes
U+3040 to U+309FU+3070Hiragana
U+4E00 to U+9FA5U+7711Unihan
U+AC00 to U+D7A3U+C1D1Hangul
U+0020encoder state kept as isSpace
U+xxxx00 to U+xxxx7F
(excluding ranges above)
U+xxxx40middle
of 128
U+xxxx80 to U+xxxxFF
(excluding ranges above)
U+xxxxC0middle
of 128

The difference between the current code point and the normalized previous code point is encoded as follows:

Difference rangeByte sequence range
(see below)
-10FF9F to -2DD0D21 F0 58 D9 to 21 FF FF FF
-2DD0C to -291222 01 01 to 24 FF FF
-2911 to -4125 01 to 4F FF
-40 to 3F50 to CF
40 to 2910D0 01 to FA FF
2911 to 2DD0BFB 01 01 to FD FF FF
2DD0C to 10FFBFFE 01 01 01 to FE 19 B4 54

Each byte range is lexicographically ordered with the following thirteen byte values excluded: 00 07 08 09 0A 0B 0C 0D 0E 0F 1A 1B 20. is a Japanese Syllabary, one component of the Japanese writing system, along with Katakana and Kanji; the Latin alphabet In Mathematics, the lexicographic or lexicographical order, (also known as dictionary order, alphabetic order or lexicographic(al product For example, the byte sequence FC 06 FF, coding for a difference of 1156B, is immediately followed by the byte sequence FC 10 01, coding for a difference of 1156C.

Any ASCII input U+0000 to U+007F excluding space U+0020 resets the encoder to U+0040. Because the above mentioned values cover line end code points U+000D and U+000A as is (0D 0A), the encoder is in a known state at the begin of each line. The corruption of a single byte therefore affects at most one line. For comparison, the corruption of a single byte in UTF-8 affects at most one code point, for SCSU it can affect the entire document. UTF-8 (8- Bit UCS / Unicode Transformation Format) is a variable-length Character encoding for Unicode. The Standard Compression Scheme for Unicode (SCSU is a Unicode Technical Standard for reducing the number of Bytes needed to represent Unicode text especially

BOCU-1 offers a similar robustness also for input texts without the above mentioned values with the special reset code 0xFF. When a decoder finds this octet it resets its state to U+0040 as for a line end. The use of 0xFF reset bytes is not recommended in the BOCU-1 specification, because it conflicts with other BOCU-1 design goals, notably the binary order.

The optional use of a signature U+FEFF at the begin of BOCU-1 encoded texts, i. A byte-order mark ( BOM) is the Unicode character at code point U+FEFF ("zero-width no-break space" when that character is used to denote e. the BOCU-1 byte sequence FB EE 28, changes the initial state U+0040 to U+FE80. In other words the signature cannot simply be stripped as in most other Unicode encoding schemes. Adding a reset byte after the signature (FB EE 28 FF) could avoid this effect, but the BOCU-1 specification does not recommend this practise.

In theory UTF-1 and UTF-8 could encode the original UCS-4 set with 31 bits up to 7FFFFFFF. UTF-1 is a way of transforming ISO 10646/ Unicode into a stream of Bytes Due to the design it is not possible to resynchronise if decoding starts in the middle of a UTF-8 (8- Bit UCS / Unicode Transformation Format) is a variable-length Character encoding for Unicode. The Universal Character Set (UCS defined by the ISO / IEC 10646 International Standard, is a standard set of characters upon which BOCU-1 and UTF-16 can encode the modern Unicode set from U+0000 to U+10FFFF. In Computing, UTF-16 (16- Bit Unicode Transformation Format is a variable-length Character encoding for Unicode, capable of encoding In Computing, Unicode is an Industry standard allowing Computers to consistently represent and manipulate text expressed in most of the world's Excluding the thirteen protected code points encoded as single octets BOCU-1 can use 256 − 13 = 243 octets in multi-byte encodings. BOCU-1 needs at most four bytes consisting of a lead byte and one to three trail bytes. The trail bytes encode a remaining "modulo 243" (base 243) difference, the lead byte determines the number of trail bytes and an initial difference. The word modulo (Latin with respect to a modulus of ___ is the Latin Ablative of Modulus which itself means "a small measure Note that the reset byte 0xFF is not protected and can occur as trail byte.

References

  1. ^ UTN #6: BOCU-1. Retrieved on 2008-05-18. 2008 ( MMVIII) is the current year in accordance with the Gregorian calendar, a Leap year that started on Tuesday of the Common Events 1152 - Henry II of England marries Eleanor of Aquitaine.
  2. ^ UTN #14: A Survey of Unicode Compression. Retrieved on 2008-06-02. 2008 ( MMVIII) is the current year in accordance with the Gregorian calendar, a Leap year that started on Tuesday of the Common Events 455 - The Vandals enter Rome, and plunder the city for two weeks
  3. ^ IANA registration record for SCSU
  4. ^ IANA registration record for BOCU-1

See also

UTF-1 is a way of transforming ISO 10646/ Unicode into a stream of Bytes Due to the design it is not possible to resynchronise if decoding starts in the middle of a UTF-8 (8- Bit UCS / Unicode Transformation Format) is a variable-length Character encoding for Unicode. International Components for Unicode (ICU is an Open source project of mature C / C++ and Java libraries for Unicode support
© 2009 citizendia.org; parts available under the terms of GNU Free Documentation License, from http://en.wikipedia.org
Dapyx Software network: MP3 Explorer | Ebook Manager | Zenithic