In computer science and information theory, data compression or source coding is the process of encoding information using fewer bits (or other information-bearing units) than an unencoded representation would use through use of specific encoding schemes. Computer science (or computing science) is the study and the Science of the theoretical foundations of Information and Computation and their Information theory is a branch of Applied mathematics and Electrical engineering involving the quantification of Information. A bit is a binary digit, taking a value of either 0 or 1 Binary digits are a basic unit of Information storage and communication In Communications a code is a rule for converting a piece of Information (for example a letter, Word, Phrase, or An encoder is a device used to change a signal (such as a Bitstream) or Data into a Code. One popular instance of compression with which many computer users are familiar is the ZIP file format, which, as well as providing compression, acts as an archiver, storing many source files in a single destination output file. The ZIP File format is a Data compression and archival format. A file archiver is a Computer program that combines a number of files together into one Archive file, or a series of archive files for easier transportation
As with any communication, compressed data communication only works when both the sender and receiver of the information understand the encoding scheme. A sender was a circuit in a 20th century electromechanical Telephone exchange which sent Telephone numbers and other information to another exchange Information as a concept has a diversity of meanings from everyday usage to technical settings For example, this text makes sense only if the receiver understands that it is intended to be interpreted as characters representing the English language. Similarly, compressed data can only be understood if the decoding method is known by the receiver.
Compression is useful because it helps reduce the consumption of expensive resources, such as hard disk space or transmission bandwidth (computing). A hard disk drive ( HDD) commonly referred to as a hard drive, hard disk, or fixed disk drive, is a Non-volatile storage device In Computer networking and Computer science, digital bandwidth or just bandwidth is the capacity for a given system to transfer data over a connection On the downside, compressed data must be decompressed to be used, and this extra processing may be detrimental to some applications. For instance, a compression scheme for video may require expensive hardware for the video to be decompressed fast enough to be viewed as it's being decompressed (the option of decompressing the video in full before watching it may be inconvenient, and requires storage space for the decompressed video). The design of data compression schemes therefore involves trade-offs among various factors, including the degree of compression, the amount of distortion introduced (if using a lossy compression scheme), and the computational resources required to compress and uncompress the data. A lossy compression method is one where compressing data and then decompressing it retrieves data that may well be different from the original but is close enough to be useful
Lossless compression algorithms usually exploit statistical redundancy in such a way as to represent the sender's data more concisely without error. Lossless data compression is a class of Data compression Algorithms that allows the exact original data to be reconstructed from the compressed data Lossless compression is possible because most real-world data has statistical redundancy. For example, in English text, the letter 'e' is much more common than the letter 'z', and the probability that the letter 'q' will be followed by the letter 'z' is very small.
Another kind of compression, called lossy data compression or perceptual coding, is possible if some loss of fidelity is acceptable. A lossy compression method is one where compressing data and then decompressing it retrieves data that may well be different from the original but is close enough to be useful Psychoacoustics is the study of subjective human Perception of Sounds Alternatively it can be described as the study of the Psychological correlates Fidelity is a notion that at its most abstract level implies a truthful connection to a source or sources Generally, a lossy data compression will be guided by research on how people perceive the data in question. For example, the human eye is more sensitive to subtle variations in luminance than it is to variations in color. Luminance is a photometric measure of the density of Luminous intensity in a given direction JPEG image compression works in part by "rounding off" some of this less-important information. Lossy data compression provides a way to obtain the best fidelity for a given amount of compression. In some cases, transparent (unnoticeable) compression is desired; in other cases, fidelity is sacrificed to reduce the amount of data as much as possible.
Lossless compression schemes are reversible so that the original data can be reconstructed, while lossy schemes accept some loss of data in order to achieve higher compression.
However, lossless data compression algorithms will always fail to compress some files; indeed, any compression algorithm will necessarily fail to compress any data containing no discernible patterns. Attempts to compress data that has been compressed already will therefore usually result in an expansion, as will attempts to compress encrypted data.
In practice, lossy data compression will also come to a point where compressing again does not work, although an extremely lossy algorithm, which for example always removes the last byte of a file, will always compress a file up to the point where it is empty.
An example of lossless vs. lossy compression is the following string:
This string can be compressed as:
Interpreted as, "twenty five point 9 eights", the original string is perfectly recreated, just written in a smaller form. In a lossy system, using
instead, the original data is lost, at the benefit of a smaller file size.
The above is a very simple example of run-length encoding, wherein large runs of consecutive identical data values are replaced by a simple code with the data value and length of the run. Run-length encoding ( RLE) is a very simple form of Data compression in which runs of data (that is sequences in which the same data value occurs in many This is an example of lossless data compression. It is often used to optimize disk space on office computers, or better use the connection bandwidth in a computer network. In Telecommunication, the term bandwidth compression has the following meanings The reduction of the bandwidth needed to transmit a given amount of A computer network is a group of interconnected Computers. Networks may be classified according to a wide variety of characteristics For symbolic data such as spreadsheets, text, executable programs, etc. Executable compression is any means of compressing an Executable file and combining the compressed data with the decompression code it needs into a single executable , losslessness is essential because changing even a single bit cannot be tolerated (except in some limited cases).
For visual and audio data, some loss of quality can be tolerated without losing the essential nature of the data. By taking advantage of the limitations of the human sensory system, a great deal of space can be saved while producing an output which is nearly indistinguishable from the original. These lossy data compression methods typically offer a three-way tradeoff between compression speed, compressed data size and quality loss.
Lossy image compression is used in digital cameras, to increase storage capacities with minimal degradation of picture quality. Image compression is the application of Data compression on Digital images In effect the objective is to reduce redundancy of the image data in order to be able to Many compact digital still cameras can record Sound and moving Video as well as still Photograph. Similarly, DVDs use the lossy MPEG-2 codec for video compression. DVD (also known as " Digital Versatile Disc " or " Digital Video Disc " - see Etymology)is MPEG-2 is a standard for "the generic coding of moving pictures and associated audio information" A video Codec is a device or Software that enables Video compression and/or decompression for digital video Video compression refers to reducing the quantity of Data used to represent video images and is a straightforward combination of Image compression and Motion
In lossy audio compression, methods of psychoacoustics are used to remove non-audible (or less audible) components of the signal. Psychoacoustics is the study of subjective human Perception of Sounds Alternatively it can be described as the study of the Psychological correlates Audio signal processing, sometimes referred to as audio processing, is the processing of a representation of auditory signals, or Sound. Compression of human speech is often performed with even more specialized techniques, so that "speech compression" or "voice coding" is sometimes distinguished as a separate discipline than "audio compression". Speech coding is the application of Data compression of Digital audio signals containing Speech. Different audio and speech compression standards are listed under audio codecs. An audio codec is a Hardware device or a Computer program that compresses/decompresses Digital audio data according to a given Audio file Voice compression is used in Internet telephony for example, while audio compression is used for CD ripping and is decoded by audio players. Voice-over-Internet protocol ( VoIP, vɔɪp is a protocol optimized for the transmission of voice through the Internet
The theoretical background of compression is provided by information theory (which is closely related to algorithmic information theory) and by rate-distortion theory. Information theory is a branch of Applied mathematics and Electrical engineering involving the quantification of Information. Algorithmic information theory is a subfield of Information theory and Computer science that concerns itself with the relationship between computation Rate–distortion theory is a major branch of Information theory which provides the theoretical foundations for Lossy data compression; it addresses the problem of These fields of study were essentially created by Claude Shannon, who published fundamental papers on the topic in the late 1940s and early 1950s. Claude Elwood Shannon (April 30 1916 – February 24 2001 an American Electronic engineer and Mathematician, is "the father of Information Cryptography and coding theory are also closely related. Cryptography (or cryptology; from Greek grc κρυπτός kryptos, "hidden secret" and grc γράφω gráphō, "I write" Coding theory is one of the most important and direct applications of Information theory. The idea of data compression is deeply connected with statistical inference.
Many lossless data compression systems can be viewed in terms of a four-stage model. Lossy data compression systems typically include even more stages, including, for example, prediction, frequency transformation, and quantization.
The Lempel-Ziv (LZ) compression methods are among the most popular algorithms for lossless storage. DEFLATE is a variation on LZ which is optimized for decompression speed and compression ratio, although compression can be slow. DEFLATE is used in PKZIP, gzip and PNG. PKZIP is an archiving tool originally written by Phil Katz and marketed by his company PKWARE Inc gzip is a Software application used for File compression. gzip is short for GNU zip; the program is a Free software replacement for the Portable Network Graphics ( PNG) is a bitmapped image format that employs Lossless data compression. LZW (Lempel-Ziv-Welch) is used in GIF images. Lempel-Ziv-Welch ( LZW) is a universal Lossless data compression Algorithm created by Abraham Lempel, Jacob Ziv, and Terry Also noteworthy are the LZR (LZ-Renau) methods, which serve as the basis of the Zip method. LZ methods utilize a table-based compression model where table entries are substituted for repeated strings of data. For most LZ methods, this table is generated dynamically from earlier data in the input. The table itself is often Huffman encoded (e. History In 1951 David A Huffman and his MIT information theory classmates were given g. SHRI, LZX). A current LZ-based coding scheme that performs well is LZX, used in Microsoft's CAB format. LZX is also the name of the programming language used in the OpenLaszlo platform LZX is the name of an LZ77 family compression In Computing, CAB is the Microsoft Windows native compressed archive format
The very best compressors use probabilistic models whose predictions are coupled to an algorithm called arithmetic coding. Arithmetic coding is a method for Lossless data compression. Normally a string of characters such as the words "hello there" is represented using a fixed number of Arithmetic coding, invented by Jorma Rissanen, and turned into a practical method by Witten, Neal, and Cleary, achieves superior compression to the better-known Huffman algorithm, and lends itself especially well to adaptive data compression tasks where the predictions are strongly context-dependent. Jorma J Rissanen (born 1932 in Finland) is an information theorist, known for inventing the Arithmetic coding technique of Lossless data compression Arithmetic coding is used in the bilevel image-compression standard JBIG, and the document-compression standard DjVu. JBIG is a lossless Image compression standard from the Joint Bi-level Image Experts Group, standardized as ISO / IEC standard DjVu (pronounced Déjà vu) is a Computer File format designed primarily to store scanned images especially those containing text and line The text entry system, Dasher, is an inverse-arithmetic-coder. Dasher is a Computer accessibility tool which enables users to write without using a keyboard, by entering text on a screen using a Pointing device
There is a close connection between machine learning and compression: a system that predicts the posterior probabilities of a sequence given its entire history can be used for optimal data compression (by using arithmetic coding on the output distribution), while an optimal compressor can be used for prediction (by finding the symbol that compresses best, given the previous history). This equivalence has been used as justification for data compression as a benchmark for "general intelligence" .
Data collections, commonly used for comparing compression algorithms. History In 1951 David A Huffman and his MIT information theory classmates were given bzip2 is a free and open source Lossless data compression Algorithm and program developed by Julian Seward. PAQ is a series of Data compression archivers that have evolved through collaborative development to top rankings on several benchmarks measuring compression ratio (although Context mixing is a type of Data compression Algorithm in which the next- Symbol predictions of two or more Statistical models are combined to The Moving Picture Experts Group, commonly referred to as simply MPEG, is a Working group of ISO / IEC charged with the development of video and A discrete cosine transform ( DCT) expresses a sequence of finitely many data points in terms of a sum of Cosine functions oscillating at different frequencies MPEG-1 Audio Layer 3, more commonly referred to as MP3, is a Digital audio encoding format using a form of Lossy data compression MPEG-1 was an early Standard for Lossy compression of Video and audio. The modified discrete cosine transform (MDCT is a Fourier-related transform based on the type-IV Discrete cosine transform (DCT-IV with the additional property Advanced Audio Coding ( AAC) is a standardized lossy compression and encoding scheme for Digital audio. MPEG-2 is a standard for "the generic coding of moving pictures and associated audio information" MPEG-4 is a collection of methods defining compression of audio and visual (AV digital data The modified discrete cosine transform (MDCT is a Fourier-related transform based on the type-IV Discrete cosine transform (DCT-IV with the additional property Vorbis is a free and open source, lossy audio Codec project headed by the Xiph JPEG 2000 is a Wavelet -based Image compression standard It was created by the Joint Photographic Experts Group committee in the year 2000 with the intention Linear predictive coding ( LPC) is a tool used mostly in Audio signal processing and Speech processing for representing the Spectral envelope Free Lossless Audio Codec ( FLAC) is a File format for lossless Audio data compression. Linear predictive coding ( LPC) is a tool used mostly in Audio signal processing and Speech processing for representing the Spectral envelope
The Canterbury Corpus is a collection of files intended for use as a benchmark for testing Lossless data compression algorithms The Calgary Corpus is a collection of text and binary data files commonly used for comparing Data compression algorithms Carnegie Mellon University (also known as CMU) is a private Research University in Pittsburgh, Pennsylvania, United