Why some files can be compressed more than others?

Posted on: Aug, 26th 2012

Have you ever tried to compress a video or picture in gzip, zip or rar and you have found that the file occupied more than the original? Do you want to know why?

To explain this I'm going to make use of a statistic measure invented by E. Shannon called entropy of the information. This mathematical tool allows us measure the quantity of information of something.

The applications of the entropy inside and outside of the informatics field are infinite. Inside of the informatics field can be used in compression, speak recognition, text recognition, weather prediction, etc. In archaeology, for instance, to know if ancient symbols belong to a language. In biology to calculate the quantity of information of the DNA. In physics, to calculate the disorder in a particle system (in this last case is calculate in a way slightly different and is called thermodynamics entropy) and a large etc.

To understand the entropy suppose that we have a language compound by symbols. These can be defined as we want, is that to say, can be words, characters or groups of characters. Let's see an example where the set of possible symbols are compound by the words "Yes" and "No". Next we will generate a message of this language and we will calculate it's entropy. In the message each symbol will indicate us if we have won something or not in the lottery for each time that we play:

M="No No No No Yes No No No No No"

How the message "No" appears 9 times, the probability of that appears a "No" is 9 between the 10 symbols that appears in the message so P(No)=9/10=0.9.

While the probability of the message "Yes" is P(Sí)=1/10=0.1

With these data now we can calculate the entropy of the message denoted by H(H), this will give us a value between 0 and log₂(number of symbols)=log₂(2)=1 that will indicate us the quantity of information that contains:

H(M)=-(P(No)*log₂(P(No))+P(Sí)*log₂(P(Sí))=

-(0.9*log₂(0.9)+0.1*log₂(0.1))=0.46 average bits by symbol.

These 0.45 bits represent the quantity of average information that contains the symbols of the message. Also can be defined statistically as the expectation of the uncertainty of the message. In this case how is difficult that we win something in the lottery, the uncertainty of that we don't win is little I(No)=-log₂(0.9)=0.15, because we aren't going to doubt a lot of that we get "No" because is the more probable, and the uncertainty of that we win is high I(Yes)=-log₂(0.1)=3.32, because is very difficult predict if we are going to win, so the entropy will give us a value of the average quantity of this uncertainty. Now follow with the principal topic: why we can't compress a file several times until it occupies little?

When a compression algorithm like the Lempel Ziv is applied in a file, the codification is changed grouping symbols of the language and assigning other symbols shorter to the combinations more probable. The codification is simply change a symbol by other, for instance, for the previous message we could use the following codification:

"No No" is equal to 0

"Yes No" is equal to 1

So the compressed message M' would look like that:

M'=00100

The size of the message is smaller. However, to get the original message, we will need store also the additional information to know how to decode each symbol, so in this case we don't get a file more little due to that additional information, so already we know that if we compress a file so little we are not going to reduce a lot of the size.

In the practice the files aren't usually so small and compress them allows get more space, because the additional information that is stored to decompress it is so little in comparison with the total size of the file.

Then, Why some big files don't reduce its size compressing them? To answer to this question I want that you look at in other effect that have been produced in the previous compressed file, the entropy of the new generated message have increased. It would look like this:

P(0)=4/5=0.8

P(1)=1/5=0.2

H(M')=-1*(P(0)*log₂(P(0))+P(1)*log₂(P(1)))=0.72 average bits needed to encode each symbol

Now the message contains more information and so is more difficult compresses it. If we return to compress not only it will not occupy less, otherwise will occupy more, due to the additional information that we will have to add to decompress it and to that the entropy can't be condensed a lot of more.

In conclusion, the languages that contains a lot of information as the ones that we can find inside of a file already compressed in ZIP, RAR, JPEG, MP3, etc. get high values of entropy. While the languages with a lot of repetition of words or symbols, for instance, an internet chat, a text document, a web page, or a WAV file, etc. will give values very low of this. If the entropy is low we will can compress more the information, building with other symbols a shorter message with the information more condensed, but if it is high we will not can do nothing to compress it, because when we try we get a bigger file, due to that the message will occupy approximately the same and we will have to add the necessary information to decompress it.

Comments (0): Comment

Categories: Algorithms, Software

Copy and paste in your page:

How about you!? Don't give your opinion?

Send this post to a friend

< Previous post

Computable Minds | Return

Next post >

En español: ¿Por qué algunos archivos se comprimen más que otros?