| Unicode |
|---|
| Character encodings |
| UCS |
| Mapping |
| Bi-directional text |
| BOM |
| Han unification |
| Unicode and HTML |
| Unicode and E-mail |
| Unicode typefaces |
| HTML |
|---|
|
Web pages authored using hypertext markup language (HTML) may contain multilingual text represented with the Unicode universal character set. In Computing, Unicode is an Industry standard allowing Computers to consistently represent and manipulate text expressed in most of the world's A character encoding consists of a code that pairs a sequence of characters from a given character set (sometimes incorrectly referred to as Code page This article compares Unicode encodings Two situations are considered eight-bit-clean environments and environments like Simple Mail Transfer Protocol that forbid use of UTF-7 (7- Bit Unicode Transformation Format) is a variable-length character encoding that was proposed for representing Unicode -encoded text using a UTF-1 is a way of transforming ISO 10646/ Unicode into a stream of Bytes Due to the design it is not possible to resynchronise if decoding starts in the middle of a UTF-8 (8- Bit UCS / Unicode Transformation Format) is a variable-length Character encoding for Unicode. "Compatibility Encoding Scheme for UTF-16 8-Bit" ( CESU-8) is a variant of UTF-8 that is described in Unicode Technical Report #26. In Computing, UTF-16 (16- Bit Unicode Transformation Format is a variable-length Character encoding for Unicode, capable of encoding UTF-32 (or UCS-4) is a protocol for encoding Unicode characters that uses exactly 32 Bits for each Unicode Code point. UTF-EBCDIC is a Character encoding used to represent Unicode characters The Standard Compression Scheme for Unicode (SCSU is a Unicode Technical Standard for reducing the number of Bytes needed to represent Unicode text especially BOCU-1 is a MIME compatible Unicode compression scheme BOCU stands for B inary O rdered C ompression for U nicode Punycode is a Computer programming encoding syntax by which a Unicode string of characters can be translated into the more-limited character set permitted An internationalized domain name ( GB18030 is the registered Internet name for the official Character set of the People's Republic of China (PRC superseding GB2312. The Universal Character Set (UCS defined by the ISO / IEC 10646 International Standard, is a standard set of characters upon which Unicode ’s Bi-directional text is used as some Writing systems of the world notably the Arabic (including variants such as Nasta'liq) and Hebrew scripts A byte-order mark ( BOM) is the Unicode character at code point U+FEFF ("zero-width no-break space" when that character is used to denote Many E-mail clients now offer some support for Unicode in E-mail bodies Unicode typefaces (also known as UCS fonts and Unicode fonts) are Typefaces containing a wide range of characters, letters, Digits HTML, an initialism of HyperText Markup Language, is the predominant Markup language for Web pages It provides a means to describe the structure HTML has been in use since 1991, but HTML 40 (December 1997 was the first standardized version where international characters were given reasonably complete treatment Dynamic HTML, or DHTML, is a collection of technologies used together to create interactive and animated Web sites by using a combination of a static Markup In HTML and XHTML, a font face or font family is the typeface that is applied to some text An HTML editor is a software application for creating Web pages Although the HTML markup of a web page can be written with any Text editor, specialized In Computing, an HTML element indicates structure in an HTML document and a way of hierarchically arranging content HTML Series The W3C HTML standard includes support for Client-side scripting. A layout engine, or rendering engine, is software that takes marked up content (such as HTML, XML, image files etc Quirks mode refers to a technique used by some Web browsers for the sake of maintaining backwards compatibility with Web pages designed for older browsers instead of Web style sheets are a form of Separation of presentation and content for Web design in which the markup (i Web colors are Colors used in designing web pages and the methods for describing and specifying those colors The Extensible Hypertext Markup Language, or XHTML, is a The following tables compare general and technical information for a number of Web browsers Please see the individual products' articles for further information The following tables compare HTML compatibility and support for a number of Layout engines Please see the individual products' articles for further information The following tables compare support of HTML 5 differences from HTML 4 for a number of Layout engines The specification is still a working draft not The following tables compare deprecated and proprietary HTML tags and attributes compatibility and support for a number of Layout engines Please see the individual products' articles for The following tables compare XHTML compatibility and support for a number of Layout engines Please see the individual products' articles for further information HTML, an initialism of HyperText Markup Language, is the predominant Markup language for Web pages It provides a means to describe the structure
The relationship between Unicode and HTML tends to be a difficult topic for many computer professionals, document authors, and web users alike. In Computing, Unicode is an Industry standard allowing Computers to consistently represent and manipulate text expressed in most of the world's The World Wide Web (commonly shortened to the Web) is a system of interlinked Hypertext documents accessed via the Internet. The accurate representation of text in web pages from different natural languages and writing systems is complicated by the details of character encoding, markup language syntax, font, and varying levels of support by web browsers. A web page or webpage is a resource of information that is suitable for the World Wide Web and can be accessed through a Web browser. In the Philosophy of language, a natural language (or ordinary language) is a Language that is spoken or written in phonemic-alphabetic or phonemically-related A writing system is a type of Symbolic system used to represent elements or statements expressible in Language. A character encoding consists of a code that pairs a sequence of characters from a given character set (sometimes incorrectly referred to as Code page A markup language is an Artificial language using a set of annotations to text that give instructions regarding the structure of text or how it is to be displayed In Typography, a typeface is a set of one or more Fonts designed with stylistic unity each comprising a coordinated set of Glyphs A typeface usually comprises A web browser is a software application which enables a user to display and interact with text images videos music games and other information typically located on a
Contents |
Web pages are typically HTML or XHTML documents. HTML, an initialism of HyperText Markup Language, is the predominant Markup language for Web pages It provides a means to describe the structure The Extensible Hypertext Markup Language, or XHTML, is a Both types of documents consist, at a fundamental level, of characters, which are graphemes and grapheme-like units, independent of how they manifest in computer storage systems and networks. For other uses see Character. In Computer and machine-based Telecommunications terminology a character is a unit of In Typography, a grapheme is the fundamental unit in written language. Computer data storage, often called storage or memory, refers to Computer components devices and recording media that retain digital A computer network is a group of interconnected Computers. Networks may be classified according to a wide variety of characteristics
An HTML document is a sequence of Unicode characters. More specifically, HTML 4. 0 documents are required to consist of characters in the HTML document character set: a character repertoire wherein each character is assigned a unique, non-negative integer code point. This set is defined in the HTML 4. 0 DTD, which also establishes the syntax (allowable sequences of characters) that can produce a valid HTML document. Document Type Definition ( DTD) is one of several SGML and XML schema languages and is also the term used to describe a document or portion thereof that The HTML document character set for HTML 4. 0 consists of most, but not all, of the characters jointly defined by Unicode and ISO/IEC 10646: the Universal Character Set (UCS). In Computing, Unicode is an Industry standard allowing Computers to consistently represent and manipulate text expressed in most of the world's The Universal Character Set (UCS defined by the ISO / IEC 10646 International Standard, is a standard set of characters upon which
Like HTML documents, an XHTML document is a sequence of Unicode characters. However, an XHTML document is an XML document, which, while not having an explicit "document character" layer of abstraction, nevertheless relies upon a similar definition of permissible characters that cover most, but not all, of the Unicode/UCS character definitions. Don't change "Extensible" --> Abstraction is the process or result of generalization by reducing the information The sets used by HTML and XHTML/XML are slightly different, but these differences have little effect on the average document author.
Regardless of whether the document is HTML or XHTML, when stored on a file system or transmitted over a network, the document's characters are encoded as a sequence of bit octets (bytes) according to a particular character encoding. In Computing, a file system (often also written as filesystem) is a method for storing and organizing Computer files and the data they contain to make A bit is a binary digit, taking a value of either 0 or 1 Binary digits are a basic unit of Information storage and communication In Computing, an octet is a grouping of eight Bits Octet, with the only exception noted below always refers to an entity having exactly eight A byte (pronounced "bite" baɪt is the basic unit of measurement of information storage in Computer science. This encoding may either be a Unicode Transformation Format, like UTF-8, that can directly encode any Unicode character, or a legacy encoding, like Windows-1252, that can't. In Computing, Unicode is an Industry standard allowing Computers to consistently represent and manipulate text expressed in most of the world's UTF-8 (8- Bit UCS / Unicode Transformation Format) is a variable-length Character encoding for Unicode. Windows-1252 (also known as WinLatin1) is a Character encoding of the Latin alphabet, used by default in the legacy components of Microsoft Windows
In order to work around the limitations of legacy encodings, HTML is designed such that it is possible to represent characters from the whole of Unicode inside an HTML document by using a numeric character reference: a sequence of characters that explicitly spell out the Unicode code point of the character being represented. A numeric character reference (NCR is a common markup construct used in SGML and other SGML-based markup languages such as HTML and XML. A numeric character reference (NCR is a common markup construct used in SGML and other SGML-based markup languages such as HTML and XML. A character reference takes the form &#N;, where N is either a decimal number for the Unicode code point, or a hexadecimal number, in which case it must be prefixed by x. The decimal ( base ten or occasionally denary) Numeral system has ten as its base. In Mathematics and Computer science, hexadecimal (also base -, hexa, or hex) is a Numeral system with a The characters that comprise the numeric character reference are universally representable in every encoding approved for use on the Internet.
For example, a Unicode code point like 33865 (decimal), which corresponds to a particular Chinese character, has to be preceded by &# and followed by ;, like this: 葉, which produces this: 葉 (if it doesn't look like a Chinese character, see the special characters note at bottom of article).
The support for hexadecimal in this context is more recent, so older browsers might have problems displaying characters referenced with hexadecimal numbers — but they will probably have a problem displaying Unicode characters above code point 255 anyway. To ensure better compatibility with older browsers, it is still a common practice to convert the hexadecimal code point into a decimal value (for example ♠ instead of ♠).
In HTML, there is a standard set of 252 named character entities for characters — some common, some obscure — that are either not found in certain character encodings or are markup sensitive in some contexts (for example angle brackets and quotation marks). In the Markup languages SGML, HTML, XHTML and XML, a character entity reference is a reference to a particular kind of named Although any Unicode character can be referenced by its numeric code point, some HTML document authors prefer to use these named entities instead, where possible, as they are less cryptic and were better supported by early browsers.
Character entities can be included in an HTML document via the use of entity references, which take the form &EntityName;, where EntityName is the name of the entity. For example, —, much like — or —, represents U+2014: the em dash character — like this — even if the character encoding used doesn't contain that character. In Computing, Unicode is an Industry standard allowing Computers to consistently represent and manipulate text expressed in most of the world's A dash is a Punctuation mark It is longer than a Hyphen and is used differently
In order to correctly process HTML, a web browser must ascertain which Unicode characters are represented by the encoded form of an HTML document. In order to do this, the web browser must know what encoding was used. When a document is transmitted via a MIME message or a transport that uses MIME content types such as an HTTP response, the message may signal the encoding via a Content-Type header, such as Content-Type: text/html; charset=ISO-8859-1. Multipurpose Internet Mail Extensions ( MIME) is an Internet standard that extends the format of e-mail to support text in Character Hypertext Transfer Protocol ( HTTP) is a Communications protocol for the transfer of information on the Internet. Other external means of declaring encoding are permitted, but rarely used. The encoding may also be declared within the document itself, in the form of a META element, like <meta http-equiv="content-type" content="text/html; charset=ISO-8859-1">. This requires an extension of ASCII to be used, like UTF-8. American Standard Code for Information Interchange ( ASCII) UTF-8 (8- Bit UCS / Unicode Transformation Format) is a variable-length Character encoding for Unicode. When there is no encoding declaration, the default varies depending on the localisation of the browser.
For a system set up mainly for Western European languages, it will generally be ISO-8859-1 or its close relation Windows-1252. ISO 8859-1, more formally cited as ISO/IEC 8859-1 is part 1 of ISO/IEC 8859, a standard Character encoding of the Latin alphabet. ISO 8859-1, more formally cited as ISO/IEC 8859-1 is part 1 of ISO/IEC 8859, a standard Character encoding of the Latin alphabet. For a browser from a location where multibyte character encodings are the norm, some form of autodetection is likely to be applied.
Because of the legacy of 8-bit text representations in programming languages and operating systems, and the desire to avoid burdening users with the need to understand the nuances of encoding, many text editors used by HTML authors are unable or unwilling to offer a choice of encodings when saving files to disk, and often do not even allow input of characters beyond a very limited range. A programming language is an Artificial language that can be used to write programs which control the behavior of a machine particularly a Computer. An operating system (commonly abbreviated OS and O/S) is the software component of a Computer system that is responsible for the management and coordination Consequently, many HTML authors are unaware of encoding issues and may not have any idea what encoding their documents actually use. It is also a common misunderstanding that the encoding declaration effects a change in the actual encoding, whereas it is actually just a label that could be inaccurate.
Many HTML documents are served with inaccurate encoding declarations, or no declarations at all. In order to determine the encoding in such cases, many browsers allow the user to manually select one from a list. They may also employ an encoding autodetection algorithm that works in concert with the manual override. The manual override may apply to all documents, or only those for which the encoding cannot be ascertained by looking at declarations and/or byte patterns. The fact that the manual override is present and widely used hinders the adoption of accurate encoding declarations on the Web; therefore the problem is likely to persist. This has been addressed somewhat by XHTML, which, being XML, requires that encoding declarations be accurate and that no workarounds be employed when they're found to be inaccurate.
Many browsers are only capable of displaying a small subset of the full Unicode repertoire. Here is how your browser displays various Unicode code points:
| Character | HTML char ref | Unicode name | What your browser displays |
|---|---|---|---|
| U+0041 | A or A | Latin capital letter A | A |
| U+00DF | ß or ß | Latin small letter Sharp S | ß |
| U+00FE | þ or þ | Latin small letter Thorn | þ |
| U+0394 | Δ or Δ | Greek capital letter Delta | Δ |
| U+0419 | Й or Й | Cyrillic capital letter Short I | Й |
| U+05E7 | ק or ק | Hebrew letter Qof | ק |
| U+0645 | م or م | Arabic letter Meem | م |
| U+0E57 | ๗ or ๗ | Thai digit 7 | ๗ |
| U+1250 | ቐ or ቐ | Ge'ez syllable Qha | ቐ |
| U+3042 | あ or あ | Hiragana letter A (Japanese) | あ |
| U+53F6 | 叶 or 叶 | CJK Unified Ideograph-53F6 (Simplified Chinese "Leaf") | 叶 |
| U+8449 | 葉 or 葉 | CJK Unified Ideograph-8449 (Traditional Chinese "Leaf") | 葉 |
| U+B5AB | 떫 or 떫 | Hangul syllable Tteolp (Korean "Ssangtikeut Eo Rieulbieup") | 떫 |
| U+10346 | 𐍆 or 𐍆 | Gothic letter Faihu | 𐍆 |
| To display all of the characters above, you may need to install one or more large multilingual fonts, like Code2000 (and Code2001 for some extinct languages, for example Gothic). The letter A is the first letter in the Latin alphabet. Its name in English is a (eɪ plural The letter ß ( Unicode U+00DF is a letter in the German alphabet. Î, î ( I - Circumflex) is a letter of Kurdish and Romanian language. The Greek alphabet (Ελληνικό αλφάβητο is a set of twenty-four letters that has been used to write the Greek language since the late 9th or early Delta (uppercase Δ, lowercase δ; Δέλτα Thelta is the fourth letter of the Greek alphabet. The Cyrillic alphabet (səˈrɪlɪk also called azbuka, from the old name of the first two letters is actually a family of Alphabets, subsets of which are used by The Hebrew alphabet (אָלֶף-בֵּית עִבְרִי alephbet ’ivri) consists of 22 letters used for writing the Hebrew language. Qoph or Qop (In modern Hebrew Kuf, Arabic Qāf) is the nineteenth letter in many Semitic abjads, including Phoenician, Aramaic The Arabic alphabet is the script used for writing several languages of Asia and Africa such as Arabic, Persian, and Urdu. Mem (also spelled Meem or Mim) is the thirteenth letter of many Semitic abjads, including Phoenician, Aramaic, Hebrew The Thai Alphabet (อักษรไทย àksŏn thai) is used to write the Thai language and other minority languages in Thailand In Mathematics and Computer science, a digit is a symbol (a number symbol e In mathematics Seven is the fourth Prime number. It is not only a Mersenne prime (since 23 &minus 1 = 7 but also a Ge'ez (gez ግዕዝ) also called Ethiopic, is an Abugida script that was originally developed to write Ge'ez, a Semitic language is a Japanese Syllabary, one component of the Japanese writing system, along with Katakana and Kanji; the Latin alphabet CJK is a collective term for Chinese, Japanese, and Korean, which constitute the main East Asian languages. An ideogram or ideograph (from Greek idea "idea" + grafo "to write" is a Graphic symbol that represents an Idea CJK is a collective term for Chinese, Japanese, and Korean, which constitute the main East Asian languages. An ideogram or ideograph (from Greek idea "idea" + grafo "to write" is a Graphic symbol that represents an Idea A syllable ( Greek:) is a unit of organization for a sequence of speech sounds This article is about the 4th century alphabet of the Gothic bible The Fe Rune ( Old Norse fé; Old English feoh) represents the f -sound in the Younger Code2000 is a pan- Unicode digital font, which includes characters and symbols from a very large range of Writing systems It is designed and implemented Code2000 is a pan- Unicode digital font, which includes characters and symbols from a very large range of Writing systems It is designed and implemented Gothic is an extinct Germanic language that was spoken by the Goths. | |||
Some web browsers, such as Mozilla Firefox, Opera, and Safari, are able to display multilingual web pages by intelligently choosing a font to display each individual character on the page. Opera is a Web browser and Internet suite developed by the Opera Software company Safari is a Web browser developed by Apple Inc and included in Mac OS X. They will correctly display any mix of Unicode blocks, as long as appropriate fonts are present in the operating system. Unicode ’s This is a list of Typefaces. Serif Here you can find a graphical version of this table An operating system (commonly abbreviated OS and O/S) is the software component of a Computer system that is responsible for the management and coordination
Internet Explorer version 6 for Windows is capable of displaying the full range of Unicode characters, but characters which are not present in the first available font specified in the web page will only display if they are present in the designated fallback font for the current international script[1] (for example, only Arial font will be considered for a block beginning with Latin text, or Arial Unicode MS if it is also installed; subsequent fonts specified in a list are ignored). Windows Internet Explorer (formerly Microsoft Internet Explorer abbreviated MSIE) commonly abbreviated to IE, is a series of graphical Arial, sometimes marketed as Arial MT, is a Sans-serif Typeface and Computer font packaged with Microsoft Windows, other In digital Typography, the TrueType font Arial Unicode MS is an extended version of the font Arial. [2] Otherwise, Internet Explorer will display placeholder squares. For characters not present in a web page's fonts, Web page authors must guess which other appropriate fonts might be present on users' systems, and manually specify them as the preferred choices for each block or range of text containing such characters—Microsoft recommends using CSS to specify a font for each block of text in a different language or script. The characters in the table above haven't been assigned specific fonts, yet most should render correctly if appropriate fonts have been installed.
Older browsers, such as Netscape Navigator 4. Netscape Navigator and Netscape are the names for the proprietary Web browser popular in the 1990s and the Flagship product of the Netscape 77, can only display text supported by the current font associated with the character encoding of the page, and may misinterpret numeric character references as being references to code values within the current character encoding, rather than references to Unicode code points. When you are using such a browser, it is unlikely that your computer has all of those fonts, or that the browser can use all available fonts on the same page. As a result, the browser will not display the text in the examples above correctly, though it may display a subset of them. Because they are encoded according to the standard, though, they will display correctly on any system that is compliant and does have the characters available. Further, those characters given names for use in named entity references are likely to be more commonly available than others.
For displaying characters outside the Basic Multilingual Plane, like the Gothic letter faihu in the table above, some systems (like Windows 2000) need manual adjustments of their settings. See also Mapping of Unicode characters The Unicode characters can be categorized in many different ways Unicode code points can be logically divided into 17
According to internal data from Google's web index, in December 2007 the UTF-8 Unicode encoding became the most frequently used encoding on web pages, overtaking both ASCII (US) and 8859-1/1252 (Western European). Google Inc is an American public corporation, earning revenue from advertising related to its Internet search, e-mail, online UTF-8 (8- Bit UCS / Unicode Transformation Format) is a variable-length Character encoding for Unicode. American Standard Code for Information Interchange ( ASCII) ISO 8859-1, more formally cited as ISO/IEC 8859-1 is part 1 of ISO/IEC 8859, a standard Character encoding of the Latin alphabet. Windows-1252 (also known as WinLatin1) is a Character encoding of the Latin alphabet, used by default in the legacy components of Microsoft Windows [3]