| Unicode |
|---|
| Character encodings |
| UCS |
| Mapping |
| Bi-directional text |
| BOM |
| Han unification |
| Unicode and HTML |
| Unicode and E-mail |
| Unicode typefaces |
Contents |
Some writing systems of the world, notably the Arabic (including variants such as Nasta'liq) and Hebrew scripts, are written in a form known as right-to-left (RTL), in which writing begins at the right-hand side of a page and concludes at the left-hand side. In Computing, Unicode is an Industry standard allowing Computers to consistently represent and manipulate text expressed in most of the world's A character encoding consists of a code that pairs a sequence of characters from a given character set (sometimes incorrectly referred to as Code page This article compares Unicode encodings Two situations are considered eight-bit-clean environments and environments like Simple Mail Transfer Protocol that forbid use of UTF-7 (7- Bit Unicode Transformation Format) is a variable-length character encoding that was proposed for representing Unicode -encoded text using a UTF-1 is a way of transforming ISO 10646/ Unicode into a stream of Bytes Due to the design it is not possible to resynchronise if decoding starts in the middle of a UTF-8 (8- Bit UCS / Unicode Transformation Format) is a variable-length Character encoding for Unicode. "Compatibility Encoding Scheme for UTF-16 8-Bit" ( CESU-8) is a variant of UTF-8 that is described in Unicode Technical Report #26. In Computing, UTF-16 (16- Bit Unicode Transformation Format is a variable-length Character encoding for Unicode, capable of encoding UTF-32 (or UCS-4) is a protocol for encoding Unicode characters that uses exactly 32 Bits for each Unicode Code point. UTF-EBCDIC is a Character encoding used to represent Unicode characters The Standard Compression Scheme for Unicode (SCSU is a Unicode Technical Standard for reducing the number of Bytes needed to represent Unicode text especially BOCU-1 is a MIME compatible Unicode compression scheme BOCU stands for B inary O rdered C ompression for U nicode Punycode is a Computer programming encoding syntax by which a Unicode string of characters can be translated into the more-limited character set permitted An internationalized domain name ( GB18030 is the registered Internet name for the official Character set of the People's Republic of China (PRC superseding GB2312. The Universal Character Set (UCS defined by the ISO / IEC 10646 International Standard, is a standard set of characters upon which Unicode ’s A byte-order mark ( BOM) is the Unicode character at code point U+FEFF ("zero-width no-break space" when that character is used to denote Web pages authored using hypertext markup language ( HTML) may contain multilingual text represented with the Unicode universal character set. Many E-mail clients now offer some support for Unicode in E-mail bodies Unicode typefaces (also known as UCS fonts and Unicode fonts) are Typefaces containing a wide range of characters, letters, Digits A writing system is a type of Symbolic system used to represent elements or statements expressible in Language. The Arabic alphabet is the script used for writing several languages of Asia and Africa such as Arabic, Persian, and Urdu. (also anglicized as Nastaleeq;) is one of the main genres of Islamic calligraphy. The Hebrew alphabet (אָלֶף-בֵּית עִבְרִי alephbet ’ivri) consists of 22 letters used for writing the Hebrew language. This is different from the left-to-right (LTR) direction used by most languages in the world. When LTR text is mixed with RTL in the same paragraph, each type of text should be written in its own direction, which is known as bi-directional text. This can get rather complex when multiple levels of quotation are used.
Many computer programs fail to display bi-directional text correctly. For example, the Hebrew name Sarah (שרה) should be spelled shin (ש) resh (ר) heh (ה) from right to left. Some Web browsers may display the Hebrew text in this article in the opposite direction. A web browser is a software application which enables a user to display and interact with text images videos music games and other information typically located on a
There are very few scripts that can be written in either direction. A writing system is a type of Symbolic system used to represent elements or statements expressible in Language. Such was the case with Egyptian hieroglyphics, where the signs had a distinct "head" that faced the beginning of a line and "tail" that faced the end. Egyptian is an Afro-Asiatic language most closely related to the Berber, Semitic, Somali and Beja languages Egyptian hieroglyphs (ˈhaɪərəʊɡlɪf from Greek grc-Grek ἱερογλύφος " sacred carving " also hieroglyphic = grc-Grek Chinese characters can also be written in either direction, especially in signs (but the orientation of the individual characters is never changed). A Chinese character, also known as a Han character ( is a Logogram used in writing Chinese (hanzi Japanese (
Another variety of writing style, called boustrophedon, was used in some ancient Greek inscriptions, Tuareg, and Hungarian runes. Boustrophedon (ˌbustroʊˈfiːdən from Greek βουστροφηδόν "ox-turning"&mdashthat is turning like Oxen in Ploughing Greek (el ελληνική γλώσσα or simply el ελληνικά — "Hellenic" is an Indo-European language, spoken today by 15-22 million people mainly The Tuareg (also Twareg or Touareg, Amazigh: Imuhagh / Itargiyen, besides regional ethnyms are a Nomadic The Old Hungarian script, also known as rovásírás (rovásírás hu ''székely rovásírás'' ( or simply hu ''rovás'' is a type of Writing system used This method of writing alternates direction on each successive line.
Bidirectional script support is the capability of a computer system to correctly display bi-directional text. A computer is a Machine that manipulates data according to a list of instructions. The term is often shortened to the jargon term BiDi or bidi. For Wikipedia jargon see WikipediaGlossary. For hacker slang see Jargon File.
Early computer installations were designed only to support a single writing system, typically for left-to-right scripts based on the Latin alphabet only. A writing system is a type of Symbolic system used to represent elements or statements expressible in Language. Adding new character sets and character encodings enabled a number of other left-to-right scripts to be supported, but did not easily support right-to-left scripts such as Arabic or Hebrew, and mixing the two was not practical. A character encoding consists of a code that pairs a sequence of characters from a given character set (sometimes incorrectly referred to as Code page A character encoding consists of a code that pairs a sequence of characters from a given character set (sometimes incorrectly referred to as Code page The Arabic alphabet is the script used for writing several languages of Asia and Africa such as Arabic, Persian, and Urdu. The Hebrew alphabet (אָלֶף-בֵּית עִבְרִי alephbet ’ivri) consists of 22 letters used for writing the Hebrew language. It is possible to simply flip the left-to-right display order to a right-to-left display order, but doing this sacrifices the ability to correctly display left-to-right scripts. With bidirectional script support, it is possible to mix scripts from different scripts on the same page, regardless of writing direction.
In particular, the Unicode standard provides foundations for complete BiDi support, with detailed rules as to how mixtures of left-to-right and right-to-left scripts are to be encoded and displayed. In Computing, Unicode is an Industry standard allowing Computers to consistently represent and manipulate text expressed in most of the world's
In Unicode encoding, all non-punctuation characters are stored in writing order. For other uses see Character. In Computer and machine-based Telecommunications terminology a character is a unit of This means that the writing direction of characters is stored within the characters. If this is the case, the character is called "strong". Punctuation characters however, can appear in both LTR and RTL languages. They are called "weak" characters because they do not contain any directional information. So it is up to the software to decide in which direction these "weak" characters will be placed. Sometimes (in mixed-directions text) this leads to display errors, caused by the bidi-algorithm that runs through the text and identifies LTR and RTL strong characters and assigns a direction to weak characters, according to the algorithm's rules.
In the algorithm, each sequence of concatenated strong characters is called a "run". A weak character that is located between two strong characters with the same orientation will inherit their orientation. A weak character that is located between two strong characters with a different writing direction, will inherit the main context's writing direction (in an LTR document the character will become LTR, in an RTL document, it will become RTL). If a "weak" character is followed by another "weak" character, the algorithm will look at the first neighbouring "strong" character. Sometimes this leads to unintentional display errors. To correct or prevent these errors, you can use "pseudo-strong" characters. These Unicode control characters are called "marks". Many characters are used to control the interpretation or display of text but these characters themselves have no visual or spatial representation The mark (U+200E LTR or U+200F RTL) is to be inserted into a location to make an enclosed weak character inherit its writing direction.
For example, to have the trademark symbol ™ (TM; U+8482) for an English name brand (LTR) in an Arabic (RTL) passage display correctly, you need to add an LTR mark after the trademark symbol if the symbol is not followed by LTR text. This is because if you do not add the LTR mark, the weak character ™ will be neighboured by a strong LTR character and a strong RTL character. Hence, in an RTL context, it will be considered to be RTL, and displayed in an incorrect order.
҉, or Combining Cyrillic Millions, is a Cyrillic numeral representing a modifier of one million. Cyrillic numerals was a numbering system derived from the Cyrillic alphabet, used by South and East Slavic peoples. It has gained popularity as an internet meme; it is supposedly a character that is responsible for creating "backwards text" that applies to every character succeeding the symbol on the same line. A meme (miːm consists of any idea or behavior that can pass from one person to another by learning or imitation However, the symbol itself has no special properties. Rather, it is simply the Combining Cyrillic Millions symbol surrounded by multiple style modifiers, including ones which reverse the left to right flow of text.
When the symbol is copied and pasted, the style formatting is copied along with it, unbeknownst to the one doing the copying. This has caused a great deal of confusion, leading the majority of users of the symbol to believe the backwards text is a property of the symbol itself rather than the Unicode that accompanies it. [1]
The effect of the current iteration of the symbol can be seen as follows (although any symbol could actually be substituted).
҉Backwards Text.