Citizendia

UTF-8 (8-bit UCS/Unicode Transformation Format) is a variable-length character encoding for Unicode. A bit is a binary digit, taking a value of either 0 or 1 Binary digits are a basic unit of Information storage and communication The Universal Character Set (UCS defined by the ISO / IEC 10646 International Standard, is a standard set of characters upon which In Computing, Unicode is an Industry standard allowing Computers to consistently represent and manipulate text expressed in most of the world's A variable-width encoding is a type of Character encoding scheme in which codes of differing lengths are used to encode a Character set (a repertoire of symbols for A character encoding consists of a code that pairs a sequence of characters from a given character set (sometimes incorrectly referred to as Code page In Computing, Unicode is an Industry standard allowing Computers to consistently represent and manipulate text expressed in most of the world's It is able to represent any character in the Unicode standard, yet the initial encoding of byte codes and character assignments for UTF-8 is backwards compatible with ASCII. In Technology, especially Computing (irrespective of platform a product is said to be backward compatible when it is able to take the place of an older product American Standard Code for Information Interchange ( ASCII) For these reasons, it is steadily becoming the preferred encoding for e-mail, web pages[1], and other places where characters are stored or streamed. Electronic mail, often abbreviated to e-mail, email, or originally eMail, is a Store-and-forward method of writing sending receiving A web page or webpage is a resource of information that is suitable for the World Wide Web and can be accessed through a Web browser. Computer data storage, often called storage or memory, refers to Computer components devices and recording media that retain digital In computing the term stream is used in a number of ways in all cases referring to a succession of data elements made available over time

UTF-8 encodes each character in one to four octets (8-bit bytes):

  1. One byte is needed to encode the 128 US-ASCII characters (Unicode range U+0000 to U+007F). In Computing, an octet is a grouping of eight Bits Octet, with the only exception noted below always refers to an entity having exactly eight A byte (pronounced "bite" baɪt is the basic unit of measurement of information storage in Computer science. In Computing, Unicode is an Industry standard allowing Computers to consistently represent and manipulate text expressed in most of the world's
  2. Two bytes are needed for Latin letters with diacritics and for characters from Greek, Cyrillic, Armenian, Hebrew, Arabic, Syriac and Thaana alphabets (Unicode range U+0080 to U+07FF). A diacritic ( also called a diacritic or diacritical mark, point, or sign, is a small sign added to a letter to alter pronunciation The Greek alphabet (Ελληνικό αλφάβητο is a set of twenty-four letters that has been used to write the Greek language since the late 9th or early The Cyrillic alphabet (səˈrɪlɪk also called azbuka, from the old name of the first two letters is actually a family of Alphabets, subsets of which are used by The Armenian alphabet is an Alphabet that has been used to write the Armenian language since the year 405 or 406. The Hebrew alphabet (אָלֶף-בֵּית עִבְרִי alephbet ’ivri) consists of 22 letters used for writing the Hebrew language. The Arabic alphabet is the script used for writing several languages of Asia and Africa such as Arabic, Persian, and Urdu. The Syriac alphabet is a Writing system used to write the Syriac language from around the 2nd century BC. See also Dhivehi writing systems Thaana, Taana or Tāna (written in Tāna script is the modern writing system of the
  3. Three bytes are needed for the rest of the Basic Multilingual Plane (which contains virtually all characters in common use). See also Mapping of Unicode characters The Unicode characters can be categorized in many different ways Unicode code points can be logically divided into 17
  4. Four bytes are needed for characters in the other planes of Unicode, which are rarely used in practice. Unicode ’s

Four bytes may seem like a lot for one character (code point). The Universal Character Set (UCS defined by the ISO / IEC 10646 International Standard, is a standard set of characters upon which However, code points outside the Basic Multilingual Plane are generally very rare. Furthermore, UTF-16 (the main alternative to UTF-8) also needs four bytes for these code points. In Computing, UTF-16 (16- Bit Unicode Transformation Format is a variable-length Character encoding for Unicode, capable of encoding Whether UTF-8 or UTF-16 is more efficient depends on the range of code points being used. However, the differences between different encoding schemes can become negligible with the use of traditional compression systems like DEFLATE. For short items of text where traditional algorithms do not perform well and size is important, the Standard Compression Scheme for Unicode could be considered instead. The Standard Compression Scheme for Unicode (SCSU is a Unicode Technical Standard for reducing the number of Bytes needed to represent Unicode text especially

The Internet Engineering Task Force (IETF) requires all Internet protocols to identify the encoding used for character data, and the supported character encodings must include UTF-8. The Internet is a global system of interconnected Computer networks In computing, a protocol is a convention or standard that controls or enables the connection Communication, and Data transfer between two computing A character encoding consists of a code that pairs a sequence of characters from a given character set (sometimes incorrectly referred to as Code page [2] The Internet Mail Consortium (IMC) recommends that all email programs be able to display and create mail using UTF-8. The Internet Mail Consortium provides information about all the Internet mail standards and technologies [3]

Unicode
Character encodings
UCS
Mapping
Bi-directional text
BOM
Han unification
Unicode and HTML
Unicode and E-mail
Unicode typefaces

Contents

History

By early 1992 a search was on for a good byte-stream encoding of multi-byte character sets. In Computing, Unicode is an Industry standard allowing Computers to consistently represent and manipulate text expressed in most of the world's A character encoding consists of a code that pairs a sequence of characters from a given character set (sometimes incorrectly referred to as Code page This article compares Unicode encodings Two situations are considered eight-bit-clean environments and environments like Simple Mail Transfer Protocol that forbid use of UTF-7 (7- Bit Unicode Transformation Format) is a variable-length character encoding that was proposed for representing Unicode -encoded text using a UTF-1 is a way of transforming ISO 10646/ Unicode into a stream of Bytes Due to the design it is not possible to resynchronise if decoding starts in the middle of a "Compatibility Encoding Scheme for UTF-16 8-Bit" ( CESU-8) is a variant of UTF-8 that is described in Unicode Technical Report #26. In Computing, UTF-16 (16- Bit Unicode Transformation Format is a variable-length Character encoding for Unicode, capable of encoding UTF-32 (or UCS-4) is a protocol for encoding Unicode characters that uses exactly 32 Bits for each Unicode Code point. UTF-EBCDIC is a Character encoding used to represent Unicode characters The Standard Compression Scheme for Unicode (SCSU is a Unicode Technical Standard for reducing the number of Bytes needed to represent Unicode text especially BOCU-1 is a MIME compatible Unicode compression scheme BOCU stands for B inary O rdered C ompression for U nicode Punycode is a Computer programming encoding syntax by which a Unicode string of characters can be translated into the more-limited character set permitted An internationalized domain name ( GB18030 is the registered Internet name for the official Character set of the People's Republic of China (PRC superseding GB2312. The Universal Character Set (UCS defined by the ISO / IEC 10646 International Standard, is a standard set of characters upon which Unicode ’s Bi-directional text is used as some Writing systems of the world notably the Arabic (including variants such as Nasta'liq) and Hebrew scripts A byte-order mark ( BOM) is the Unicode character at code point U+FEFF ("zero-width no-break space" when that character is used to denote Web pages authored using hypertext markup language ( HTML) may contain multilingual text represented with the Unicode universal character set. Many E-mail clients now offer some support for Unicode in E-mail bodies Unicode typefaces (also known as UCS fonts and Unicode fonts) are Typefaces containing a wide range of characters, letters, Digits The draft ISO 10646 standard contained a non-required annex called UTF that provided a byte-stream encoding of its 32-bit characters. The Universal Character Set (UCS defined by the ISO / IEC 10646 International Standard, is a standard set of characters upon which This encoding was not satisfactory on performance grounds, but did introduce the notion that bytes in the ASCII range of 0–127 represent themselves in UTF, thereby providing backward compatibility.

In July 1992 the X/Open committee XoJIG was looking for a better encoding. X/Open Company Ltd was a Consortium founded by several European UNIX systems manufacturers in 1984 to identify and promote Open standards in the field Dave Prosser of Unix System Laboratories submitted a proposal for one that had faster implementation characteristics and introduced the improvement that 7-bit ASCII characters would only represent themselves; all multibyte sequences would include only 8-bit characters, i. Unix System Laboratories or USL was originally organized as part of Bell Labs in 1989 e. those where the high bit was set.

In August 1992 this proposal was circulated by an IBM X/Open representative to interested parties. International Business Machines Corporation abbreviated IBM and nicknamed "Big Blue", is a multinational Computer Technology Ken Thompson of the Plan 9 operating system group at Bell Laboratories then made a crucial modification to the encoding, to allow it to be self-synchronizing, meaning that it was not necessary to read from the beginning of the string in order to find character boundaries. Kenneth Lane Thompson (born February 4 1943) commonly referred to as Ken Thompson (or simply Plan 9 from Bell Labs is a Distributed operating system, primarily used for research An operating system (commonly abbreviated OS and O/S) is the software component of a Computer system that is responsible for the management and coordination Bell Laboratories (also known as Bell Labs and formerly known as AT&T Bell Laboratories and Bell Telephone Laboratories) is the Research organization Thompson's design was outlined on September 2, 1992, on a placemat in a New Jersey diner with Rob Pike. Events 44 BC - Pharaoh Cleopatra VII of Egypt declares her son co-ruler as Ptolemy XV Caesarion. Year 1992 ( MCMXCII) was a Leap year starting on Wednesday (link will display full 1992 Gregorian calendar) A placemat is a protective table pad usually made of Paper, Plastic or cloth for Restaurants and Households Asian-style placemats New Jersey ( is a state in the Mid-Atlantic and Northeastern regions of the United States. A diner is a prefabricated Restaurant building characteristic of North America, especially on Long Island; in New York City; in Robert C Pike (born 1956 is a Software engineer and Author. He is best known for his work at Bell Labs, where he was a member of the Unix The following days, Pike and Thompson implemented it and updated Plan 9 to use it throughout, and then communicated their success back to X/Open. Plan 9 from Bell Labs is a Distributed operating system, primarily used for research [4]

UTF-8 was first officially presented at the USENIX conference in San Diego, from January 2529 1993. The USENIX Association is the Advanced Computing Technical Association Events 41 - After a night of negotiation Claudius is accepted as Roman Emperor by the Senate Events 904 - Sergius III comes out of retirement to take over the papacy from the deposed Antipope Christopher. Year 1993 ( MCMXCIII) was a Common year starting on Friday (link will display full 1993 Gregorian calendar)

Description

There are several current definitions of UTF-8 in various standards documents:

They supersede the definitions given in the following obsolete works:

They are all the same in their general mechanics, with the main differences being on issues such as allowed range of code point values and safe handling of invalid input.

The bits of a Unicode character are divided into several groups which are then divided among the lower bit positions inside the UTF-8 bytes. A character whose code point is below U+0080 is encoded with a single byte that contains its code point: these correspond exactly to the 128 characters of 7-bit ASCII. A bit is a binary digit, taking a value of either 0 or 1 Binary digits are a basic unit of Information storage and communication American Standard Code for Information Interchange ( ASCII) In other cases, up to four bytes are required. The most significant bit of these bytes is 1, to prevent confusion with 7-bit ASCII characters and therefore keep standard byte-oriented string processing safe. In Computing, the most significant bit ( msb) is the Bit position in a binary number having the greatest value

Code range
hexadecimal
Scalar value
binary
UTF-8
binary / hexadecimal
Notes
00000000007F
128 codes
00000000 00000000 0zzzzzzz0zzzzzzzASCII equivalence range; byte begins with zero
seven zseven z; byte value 007F
0000800007FF
1920 codes
00000000 00000yyy yyzzzzzz110yyyyy 10zzzzzzfirst byte begins with 110, the following byte begins with 10. In Mathematics and Computer science, hexadecimal (also base -, hexa, or hex) is a Numeral system with a The binary numeral system, or base-2 number system, is a Numeral system that represents numeric values using two symbols usually 0 and 1. The binary numeral system, or base-2 number system, is a Numeral system that represents numeric values using two symbols usually 0 and 1. In Mathematics and Computer science, hexadecimal (also base -, hexa, or hex) is a Numeral system with a
three y; two y, six zfive y, six z; byte values C2DF and 80BF
00080000D7FF
00E00000FFFF
61440 codes [Note 1]
00000000 xxxxyyyy yyzzzzzz1110xxxx 10yyyyyy 10zzzzzzfirst byte begins with 1110, the following 2 bytes begin with 10.
four x, four y; two y, six zfour x, six y, six z; byte values E0EF and 2x 80BF
01000010FFFF
1048576 codes
000wwwxx xxxxyyyy yyzzzzzz11110www 10xxxxxx 10yyyyyy 10zzzzzzFirst byte begins with 11110, the following 3 bytes begin with 10
three w, two x; four x, four y; two y, six zthree w; six x; six y; six z; byte values F0F4 and 3x 80BF
Note 1  The range D800-DFFF is disallowed by Unicode. The encoding scheme reliably transforms values in that range, but they are not valid scalar values in Unicode. See Table 3-7 in the Unicode 5. 0 standard.

For example, the character aleph (א), which is Unicode U+05D0, is encoded into UTF-8 in this way:

Another example: when the number of bits to be filled is less than the maximum number of free bits available, the high bits are padded with 0's.

For example, the Cent Sign (¢), which is Unicode U+00A2, is encoded into UTF-8 in this way.

Width by first byte:

BinaryHexadecimalDecimalWidth
00000000-0111111100-7F0-1271 byte
11000010-11011111C2-DF194-2232 bytes
11100000-11101111E0-EF224-2393 bytes
11110000-11110100F0-F4240-2444 bytes

So the first 128 characters (US-ASCII) need one byte. The binary numeral system, or base-2 number system, is a Numeral system that represents numeric values using two symbols usually 0 and 1. In Mathematics and Computer science, hexadecimal (also base -, hexa, or hex) is a Numeral system with a The decimal ( base ten or occasionally denary) Numeral system has ten as its base. Length is the long Dimension of any object The length of a thing is the distance between its ends its linear extent as measured from end to end The next 1920 characters need two bytes to encode. This includes Latin alphabet characters with diacritics, Greek, Cyrillic, Coptic, Armenian, Hebrew, and Arabic characters. A diacritic ( also called a diacritic or diacritical mark, point, or sign, is a small sign added to a letter to alter pronunciation The Greek alphabet (Ελληνικό αλφάβητο is a set of twenty-four letters that has been used to write the Greek language since the late 9th or early The Cyrillic alphabet (səˈrɪlɪk also called azbuka, from the old name of the first two letters is actually a family of Alphabets, subsets of which are used by The Coptic alphabet is the script used for writing the Coptic language. The Armenian alphabet is an Alphabet that has been used to write the Armenian language since the year 405 or 406. The Hebrew alphabet (אָלֶף-בֵּית עִבְרִי alephbet ’ivri) consists of 22 letters used for writing the Hebrew language. The Arabic alphabet is the script used for writing several languages of Asia and Africa such as Arabic, Persian, and Urdu. The rest of the BMP characters use three bytes, and additional characters are encoded in four bytes. See also Mapping of Unicode characters The Unicode characters can be categorized in many different ways Unicode code points can be logically divided into 17

By continuing the pattern given above it is possible to deal with much larger numbers. The original specification allowed for sequences of up to six bytes covering numbers up to 31 bits (the original limit of the universal character set). The Universal Character Set (UCS defined by the ISO / IEC 10646 International Standard, is a standard set of characters upon which However, UTF-8 was restricted by RFC 3629 to use only the area covered by the formal Unicode definition, U+0000 to U+10FFFF, in November 2003. With these restrictions, the following byte values never appear in a legal UTF-8 sequence:

Codes (binary)Codes (hexadecimal)Notes
1100000xC0, C1Overlong encoding: lead byte of a 2-byte sequence, but code point <= 127
11110101
1111011x
F5, F6, F7Restricted by RFC 3629: lead byte of 4-byte sequence for codepoint above 10FFFF
111110xx
1111110x
F8, F9, FA, FB, FC, FDRestricted by RFC 3629: lead byte of a sequence 5 or 6 bytes long
1111111xFE, FFInvalid: lead byte of a sequence 7 or 8 bytes long

While the two categories labeled "Restricted by RFC" above were technically allowed by earlier UTF-8 specifications, no characters were ever assigned to the code points they represent, so they should never have appeared in UTF-8-encoded text. The binary numeral system, or base-2 number system, is a Numeral system that represents numeric values using two symbols usually 0 and 1. In Mathematics and Computer science, hexadecimal (also base -, hexa, or hex) is a Numeral system with a

UTF-8 derivations

Windows

Many Windows programs (including Windows Notepad) use the byte sequence EF BB BF at the beginning of a file to indicate that the file is encoded using UTF-8. Microsoft Windows is a series of Software Operating systems and Graphical user interfaces produced by Microsoft. Notepad is a simple Text editor included with all versions of Microsoft Windows since Windows 1 This is the Byte Order Mark U+FEFF encoded in UTF-8, which appears as the ISO-8859-1 characters "" in most text editors and web browsers not prepared to handle UTF-8. A byte-order mark ( BOM) is the Unicode character at code point U+FEFF ("zero-width no-break space" when that character is used to denote ISO 8859-1, more formally cited as ISO/IEC 8859-1 is part 1 of ISO/IEC 8859, a standard Character encoding of the Latin alphabet. A text editor is a type of program used for editing plain Text files Text editors are often provided with Operating systems or software development A web browser is a software application which enables a user to display and interact with text images videos music games and other information typically located on a

Java

In normal usage, the Java programming language supports standard UTF-8 when reading and writing strings through InputStreamReader and OutputStreamWriter.

However, Java also supports a non-standard variant of UTF-8 called modified UTF-8 for object serialization, for the Java Native Interface, and for embedding constants in class files. In Computer science, in the context of data storage and transmission serialization is the process of saving an object onto a storage medium (such as a The Java Native Interface ( JNI) is a programming framework that allows Java code running in the Java virtual machine (JVM to call and be called In the Java Programming language, Source files (java files are Compiled into class files which have a. There are two differences between modified and standard UTF-8.

The first difference is that the null character (U+0000) is encoded as 0xc0 0x80 rather than 0x00. The null character (also null terminator) is a character with the value zero present in the ASCII and Unicode character sets and available (0xc0 0x80 is not legal standard UTF-8 because it is not the shortest possible representation. ) This guarantees that if an extra null terminator byte 0x00 is placed at the end of the string, it will be the only 0x00 encountered if a string containing embedded null characters is processed in a language such as C using traditional ASCIIZ string functions. tags please moot on the talk page first! --> In Computing, C is a general-purpose cross-platform block structured In computing a C string is a character sequence stored as a one-dimensional character Array and terminated with a Null character ('\0' In standard UTF-8 the embedded nulls would be encoded as 0x00, signalling the end of the string and causing premature truncation.

The second difference is in the way characters outside the Basic Multilingual Plane are encoded. Unicode ’s In standard UTF-8 these characters are encoded using the four-byte format above. In modified UTF-8 these characters are first represented as surrogate pairs (as in UTF-16), and then the surrogate pairs are encoded individually in sequence as in CESU-8, taking up 6 bytes in total. "Compatibility Encoding Scheme for UTF-16 8-Bit" ( CESU-8) is a variant of UTF-8 that is described in Unicode Technical Report #26. Each Java character represents a 16-bit value. This aspect of the language predates the supplementary planes of Unicode; however, it is important for performance as well as backwards compatibility, and is unlikely to change.

Because modified UTF-8 is not UTF-8, one needs to be very careful to avoid mislabelling data in modified UTF-8 as UTF-8 when interchanging information over the Internet. The Internet is a global system of interconnected Computer networks

Tcl

Tcl uses the same modified UTF-8 as Java for internal representation of Unicode data. Tcl (originally from "Tool Command Language" but nonetheless conventionally rendered as "Tcl" rather than "TCL" pronounced as " tickle "

Mac OS X

The Mac OS X Operating System uses canonically decomposed Unicode, encoded using UTF-8 for file names in the filesystem. Mac OS X (mæk oʊ ɛs tɛn is a line of computer Operating systems developed marketed and sold by Apple Inc, the latest of which is pre-loaded on all currently In Computing, a file system (often also written as filesystem) is a method for storing and organizing Computer files and the data they contain to make This is sometimes referred to as UTF-8-MAC. In canonically decomposed Unicode, the use of precomposed characters is forbidden and combining diacritics must be used to replace them. In Digital typography, combining characters are characters that are intended to modify other characters

A common argument is that this makes sorting far simpler, but this argument is easily refuted: for one, sorting is language dependent (in German, the ä character sorts just after the a character, while in Swedish ä sorts after z). The term collating sequence refers to the order in which character strings should be placed when Sorting them Therefore, it can be confusing for software built around the assumption that precomposed characters are the norm and combining diacritics are only used to form unusual combinations. This is an example of the NFD variant of Unicode normalization—most other platforms, including Windows and Linux, use the NFC form of Unicode normalization, which is also used by W3C standards, so NFD data must typically be converted to NFC for use on other platforms or the Web. Microsoft Windows is a series of Software Operating systems and Graphical user interfaces produced by Microsoft. Linux (commonly pronounced ˈlɪnəks

This is discussed in Apple Q&A 1173. [5]

CESU-8

See main article: CESU-8. "Compatibility Encoding Scheme for UTF-16 8-Bit" ( CESU-8) is a variant of UTF-8 that is described in Unicode Technical Report #26.

Oracle databases use CESU-8. Oracle Database (commonly referred to as Oracle RDBMS or simply Oracle) is a Relational database management system (RDBMS produced and marketed by Characters outside the BMP are first encoded as surrogate pairs, which are then each encoded as UTF-8. It is the same as modified UTF-8 from Java, but without the special encoding of the NUL character. It is not valid UTF-8.

Rationale behind UTF-8's design

As a consequence of the design of UTF-8, the following properties of multi-byte sequences hold:

UTF-8 was designed to satisfy these properties in order to guarantee that no byte sequence of one character is contained within a longer byte sequence of another character. A byte-order mark ( BOM) is the Unicode character at code point U+FEFF ("zero-width no-break space" when that character is used to denote This ensures that byte-wise sub-string matching can be applied to search for words or phrases within a text; some older variable-length 8-bit encodings (such as Shift-JIS) did not have this property and thus made string-matching algorithms rather complicated. Although this property adds redundancy to UTF-8–encoded text, the advantages outweigh this concern; besides, data compression is not one of Unicode's aims and must be considered independently. Redundancy in Information theory is the number of bits used to transmit a message minus the number of bits of actual information in the message This also means that if one or more complete bytes are lost due to error or corruption, one can resynchronize at the beginning of the next character and thus limit the damage.

Also due to the design of the byte sequences, if a sequence of bytes supposed to represent text validates as UTF-8 then it is fairly safe to assume it is UTF-8. The chance of a random sequence of bytes being valid UTF-8 and not pure ASCII is 3. 1% for a 2 byte sequence, 0. 39% for a 3 byte sequence and even lower for longer sequences.

While natural languages encoded in traditional encodings are not random byte sequences, they are also unlikely to produce byte sequences that would pass a UTF-8 validity test and then be misinterpreted. For example, for ISO-8859-1 text to be misrecognized as UTF-8, the only non-ASCII characters in it would have to be in sequences starting with either an accented letter or the multiplication symbol and ending with a symbol. ISO 8859-1, more formally cited as ISO/IEC 8859-1 is part 1 of ISO/IEC 8859, a standard Character encoding of the Latin alphabet. Pure ASCII text would pass a UTF-8 validity test and it would be interpreted correctly because the UTF-8 encoding for the same text is the same as the ASCII encoding.

The bit patterns can be used to identify UTF-8 characters. If the byte's first hex code begins with 0–7, it is an ASCII character. If it begins with C or D, it is an 11-bit character (expressed in two bytes). If it begins with E, it is 16-bit (expressed in 3 bytes), and if it begins with F, it is 21 bits (expressed in 4 bytes). 8 through B cannot be first hex codes, but all following bytes must begin with a hex code between 8 through B. Thus, at a glance, it can be seen that "0xA9" is not a valid UTF-8 character, but that "0x54" and "0xE3 0xB4 0xB1" are valid UTF-8 characters.

There is no good validity test for traditional 8-bit encodings like ISO-8859-1. ISO 8859-1, more formally cited as ISO/IEC 8859-1 is part 1 of ISO/IEC 8859, a standard Character encoding of the Latin alphabet. It must be known otherwise which encoding is used, otherwise bad text will be shown. This is called mojibake and other names. Mojibake is the happenstance of incorrect unreadable characters (garbage characters shown when Computer software fails to render a text correctly according to its associated The fact that there is a working validity test for UTF-8-encoded texts is a big advantage.

Overlong forms, invalid input, and security considerations

The exact response required of a UTF-8 decoder on invalid input is not uniformly defined by the standards. In general, there are several ways a UTF-8 decoder might behave in the event of an invalid byte sequence:

  1. Insert a replacement character (usually '?' or '�' (U+FFFD)).
  2. Ignore the bytes.
  3. Interpret each byte according to another encoding (often ISO-8859-1 or CP1252). ISO 8859-1, more formally cited as ISO/IEC 8859-1 is part 1 of ISO/IEC 8859, a standard Character encoding of the Latin alphabet. Windows-1252 (also known as WinLatin1) is a Character encoding of the Latin alphabet, used by default in the legacy components of Microsoft Windows
  4. Not notice and decode as if the bytes were some similar bit of UTF-8.
  5. Stop decoding and report an error (possibly giving the caller the option to continue).

It is possible for a decoder to behave in different ways for different types of invalid input.

RFC 3629 states that "Implementations of the decoding algorithm MUST protect against decoding invalid sequences. "[6] The Unicode Standard requires a Unicode-compliant decoder to "…treat any ill-formed code unit sequence as an error condition. This guarantees that it will neither interpret nor emit an ill-formed code unit sequence. "

Overlong forms are one of the most troublesome types of UTF-8 data. The current RFC says they must not be decoded, but older specifications for UTF-8 only gave a warning, and many simpler decoders will happily decode them. Overlong forms have been used to bypass security validations in high profile products including Microsoft's IIS web server. Internet Information Services ( IIS)&mdashformerly called Internet Information Server &mdashis a Microsoft -produced set of Internet-based services Therefore, great care must be taken to avoid security issues if validation is performed before conversion from UTF-8, and it is generally much simpler to handle overlong forms before any input validation is done.

Another common problem is decoders that do not check that the trailing bytes are really trailing bytes. This will cause more characters to be lost than necessary if some bytes are lost or corrupted.

To maintain security in the case of invalid input, there are a few options. The first is to decode the UTF-8 before doing any input validation checks. The second is to use a decoder that, in the event of invalid input either returns an error or text that the application knows to be harmless. A third possibility is to not decode the UTF-8 at all, this is quite practical if the system only treats some ASCII characters (like slash and NUL) specially, and treats all other bytes as identifiers or other data but requires care to avoid passing invalid UTF-8 to other code (such as third party libraries or an operating system) that cannot safely handle it.

Advantages and disadvantages

A note on string length and character indexes

A common criticism from beginners of variable-length encoding such as UTF-8 is that the algorithms to find the number of characters between two points, or the point that is n characters after another point, are not O(1) (constant time), causing programs using them to be slower. However the use of these algorithms by actual working software is often over-estimated:

So while the number of octets in a UTF-8 string or substring is related in a more complex way to the number of code points than for UTF-32, it is very rare to encounter a situation where this makes a difference in practice, and this cannot be used as either an advantage or disadvantage of UTF-8.

General

Advantages

Disadvantages

Compared to single-byte encodings

Advantages

Disadvantages

Compared to other multi-byte encodings

Advantages

Disadvantages

Compared to UTF-7

Advantages

Disadvantages

Compared to UTF-16

Advantages

Disadvantages

See also

References

  1. ^ Moving to Unicode 5.1. In PCs running the Microsoft Windows or DOS Operating systems additional characters to those available by the current Keyboard layout may be typed American Standard Code for Information Interchange ( ASCII) A byte-order mark ( BOM) is the Unicode character at code point U+FEFF ("zero-width no-break space" when that character is used to denote The following tables compare general and technical information between a number of E-mail client programs This article compares Unicode encodings Two situations are considered eight-bit-clean environments and environments like Simple Mail Transfer Protocol that forbid use of HTML has been in use since 1991, but HTML 40 (December 1997 was the first standardized version where international characters were given reasonably complete treatment ISO/IEC 8859 is a joint ISO and IEC standard for 8-bit Character encodings for use by computers iconv is a Computer program and a standardized API used to convert between different Character encodings iconv API The iconv API is the A character encoding consists of a code that pairs a sequence of characters from a given character set (sometimes incorrectly referred to as Code page GB18030 is the registered Internet name for the official Character set of the People's Republic of China (PRC superseding GB2312. Percent-encoding, also known as URL encoding, is a mechanism for encoding information in a Uniform Resource Identifier (URI under certain circumstances Many E-mail clients now offer some support for Unicode in E-mail bodies Web pages authored using hypertext markup language ( HTML) may contain multilingual text represented with the Unicode universal character set. The Universal Character Set (UCS defined by the ISO / IEC 10646 International Standard, is a standard set of characters upon which In Computing, UTF-16 (16- Bit Unicode Transformation Format is a variable-length Character encoding for Unicode, capable of encoding UTF-9 and UTF-18 (9- and 18- Bit Unicode Transformation Format, respectively were two April Fools' Day RFC joke specifications for encoding unicode Official Google Blog (May 5, 2008). Events 553 - The Second Council of Constantinople begins 1215 - Rebel Barons renounce their allegiance to King John 2008 ( MMVIII) is the current year in accordance with the Gregorian calendar, a Leap year that started on Tuesday of the Common Retrieved on 2008-05-08. 2008 ( MMVIII) is the current year in accordance with the Gregorian calendar, a Leap year that started on Tuesday of the Common Events 589 - Reccared summons the Third Council of Toledo 1450 - Jack Cade's Rebellion: Kentishmen
  2. ^ Alvestrand, H. (1998), “IETF Policy on Character Sets and Languages”, RFC 2277, Internet Engineering Task Force 
  3. ^ Using International Characters in Internet Mail. Internet Mail Consortium (August 1, 1998). Events 30 BC - Octavian (later known as Augustus enters Alexandria, Egypt, bringing it under the control of the Roman Year 1998 ( MCMXCVIII) was a Common year starting on Thursday (link will display full 1998 Gregorian calendar) Retrieved on 2007-11-08. Year 2007 ( MMVII) was a Common year starting on Monday of the Gregorian calendar in the 21st century. Events 1519 - Hernán Cortés enters Tenochtitlán and Aztec ruler Moctezuma welcomes him with great a Celebration
  4. ^ Pipe, Rob (2003-04-03). Year 2003 ( MMIII) was a Common year starting on Wednesday of the Gregorian calendar. Events 1043 - Edward the Confessor is crowned King of England. UTF-8 history.
  5. ^ Text Encodings in VFS. Apple Inc. (February 10, 2003). Apple Inc, ( formerly Apple Computer Inc, is an American Multinational corporation with a focus on designing and manufacturing Consumer electronics Events 1355 - The St Scholastica's Day riot breaks out in Oxford, England, leaving 63 scholars and perhaps 30 locals dead Year 2003 ( MMIII) was a Common year starting on Wednesday of the Gregorian calendar. Retrieved on 2007-11-08. Year 2007 ( MMVII) was a Common year starting on Monday of the Gregorian calendar in the 21st century. Events 1519 - Hernán Cortés enters Tenochtitlán and Aztec ruler Moctezuma welcomes him with great a Celebration
  6. ^ Yergeau, F. (2003), “UTF-8, a transformation format of ISO 10646”, RFC 3629, Internet Engineering Task Force 
  7. ^ Code Page Identifiers from MSDN. Accessed on January 5, 2008. Events 1477 - Battle of Nancy: Charles the Bold is killed and Burgundy becomes part of France. 2008 ( MMVIII) is the current year in accordance with the Gregorian calendar, a Leap year that started on Tuesday of the Common

External links

Plan 9 from Bell Labs is a Distributed operating system, primarily used for research JavaScript is a Scripting language most often used for Client-side web development
© 2009 citizendia.org; parts available under the terms of GNU Free Documentation License, from http://en.wikipedia.org
Dapyx Software network: MP3 Explorer | Ebook Manager | Zenithic