精华内容
下载资源
问答
  • LZW

    2020-12-27 17:24:53
    <div><p>Implemented LZW for use with geoTiffs. - Passes LZW test. - Needs more error handling. - Probably not yet compatible with node. - compression code could use optimization. - Needs input from...
  • lzw

    2012-12-19 11:22:00
    The code that the LZW algorithm outputs can be of any arbitrary length, but it must have more bits in it than a single character. The first 256 codes (when using eight bit characters) are by default ...

    Note: I have an updated article on LZW posted here. Please check out the new article and tell me what you think. I hope it improves on this post and makes LZW easier to understand.

    Thanks to Jan Hakenberg for correction of a couple of errors! In Figure 4, the values for new table entries 259 and 265 were truncated.

    Thanks to David Littlewood for pointing out the missing line of pseudocde in Figure 6.

    Thanks to Joe Snyder for pointing out a line where a macro should replace a hard-coded constant.

    Any programmer working on mini or microcomputers in this day and age should have at least some exposure to the concept of data compression. In MS-DOS world, programs like ARC, by System Enhancement Associates, and PKZIP, by PKware are ubiquitous. ARC has also been ported to quite a few other machines, running UNIX, CP/M, and so on. CP/M users have long had SQ and USQ to squeeze and expand programs. Unix users have the COMPRESS and COMPACT utilities. Yet the data compression techniques used in these programs typically only show up in two places: file transfers over phone lines, and archival storage.

    Data compression has an undeserved reputation for being difficult to master, hard to implement, and tough to maintain. In fact, the techniques used in the previously mentioned programs are relatively simple, and can be implemented with standard utilities taking only a few lines of code. This article discusses a good all-purpose data compression technique: Lempel-Ziv-Welch, or LZW compression.

    The routines shown here belong in any programmer's toolbox. For example, a program that has a few dozen help screens could easily chop 50K bytes off by compressing the screens. Or 500K bytes of software could be distributed to end users on a single 360K byte floppy disk. Highly redundant database files can be compressed down to 10% of their original size. Once the tools are available, the applications for compression will show up on a regular basis.

    LZW Fundamentals

    The original Lempel Ziv approach to data compression was first published in in 1977, followed by an alternate approach in 1978. Terry Welch's refinements to the 1978 algorithm were published in 1984. The algorithm is surprisingly simple. In a nutshell, LZW compression replaces strings of characters with single codes. It does not do any analysis of the incoming text. Instead, it just adds every new string of characters it sees to a table of strings. Compression occurs when a single code is output instead of a string of characters.

    The code that the LZW algorithm outputs can be of any arbitrary length, but it must have more bits in it than a single character. The first 256 codes (when using eight bit characters) are by default assigned to the standard character set. The remaining codes are assigned to strings as the algorithm proceeds. The sample program runs as shown with 12 bit codes. This means codes 0-255 refer to individual bytes, while codes 256-4095 refer to substrings.

    Compression

    The LZW compression algorithm in its simplest form is shown in Figure 1. A quick examination of the algorithm shows that LZW is always trying to output codes for strings that are already known. And each time a new code is output, a new string is added to the string table.

    Routine LZW_COMPRESS

    CODE:
    1. STRING = get input character
    2. WHILE there are still input characters DO
    3.     CHARACTER = get input character
    4.     IF STRING+CHARACTER is in the string table then
    5.         STRING = STRING+character
    6.     ELSE
    7.         output the code for STRING
    8.         add STRING+CHARACTER to the string table
    9.         STRING = CHARACTER
    10.     END of IF
    11. END of WHILE
    12. output the code for STRING

    The Compression Algorithm
    Figure 1

    A sample string used to demonstrate the algorithm is shown in Figure 2. The input string is a short list of English words separated by the '/' character. Stepping through the start of the algorithm for this string, you can see that the first pass through the loop, a check is performed to see if the string "/W" is in the table. Since it isn't, the code for '/' is output, and the string "/W" is added to the table. Since we have 256 characters already defined for codes 0-255, the first string definition can be assigned to code 256. After the third letter, 'E', has been read in, the second string code, "WE" is added to the table, and the code for letter 'W' is output. This continues until in the second word, the characters '/' and 'W' are read in, matching string number 256. In this case, the code 256 is output, and a three character string is added to the string table. The process continues until the string is exhausted and all of the codes have been output.


    Input String = /WED/WE/WEE/WEB/WET
    Character Input Code Output New code value New String
    /W / 256 /W
    E W 257 WE
    D E 258 ED
    / D 259 D/
    WE 256 260 /WE
    / E 261 E/
    WEE 260 262 /WEE
    /W 261 263 E/W
    EB 257 264 WEB
    / B 265 B/
    WET 260 266 /WET
    EOF T

    The Compression Process
    Figure 2

    The sample output for the string is shown in Figure 2 along with the resulting string table. As can be seen, the string table fills up rapidly, since a new string is added to the table each time a code is output. In this highly redundant input, 5 code substitutions were output, along with 7 characters. If we were using 9 bit codes for output, the 19 character input string would be reduced to a 13.5 byte output string. Of course, this example was carefully chosen to demonstrate code substitution. In real world examples, compression usually doesn't begin until a sizable table has been built, usually after at least one hundred or so bytes have been read in.

    Decompression

    The companion algorithm for compression is the decompression algorithm. It needs to be able to take the stream of codes output from the compression algorithm, and use them to exactly recreate the input stream. One reason for the efficiency of the LZW algorithm is that it does not need to pass the string table to the decompression code. The table can be built exactly as it was during compression, using the input stream as data. This is possible because the compression algorithm always outputs the STRING and CHARACTER components of a code before it uses it in the output stream. This means that the compressed data is not burdened with carrying a large string translation table.

    Routine LZW_DECOMPRESS

    CODE:
    1. Read OLD_CODE
    2. output OLD_CODE
    3. WHILE there are still input characters DO
    4.     Read NEW_CODE
    5.     STRING = get translation of NEW_CODE
    6.     output STRING
    7.     CHARACTER = first character in STRING
    8.     add OLD_CODE + CHARACTER to the translation table
    9.     OLD_CODE = NEW_CODE
    10. END of WHILE

    The Decompression Algorithm
    Figure 3

    The algorithm is shown in Figure 3. Just like the compression algorithm, it adds a new string to the string table each time it reads in a new code. All it needs to do in addition to that is translate each incoming code into a string and send it to the output.

    Figure 4 shows the output of the algorithm given the input created by the compression earlier in the article. The important thing to note is that the string table ends up looking exactly like the table built up during compression. The output string is identical to the input string from the compression algorithm. Note that the first 256 codes are already defined to translate to single character strings, just like in the compression code.


    Input Codes: / W E D 256 E 260 261 257 B 260 T
    Input/
    NEW_CODE
    OLD_CODE STRING/
    Output
    CHARACTER New table entry
    / / /

    W / W W 256 = /W
    E W E E 257 = WE
    D E D D 258 = ED
    256 D /W / 259 = D/
    E 256 E E 260 = /WE
    260 E /WE / 261 = E/
    261 260 E/ E 262 = /WEE
    257 261 WE W 263 = E/W
    B 257 B B 264 = WEB
    260 B /WE / 265 = B/
    T 260 T T 266 = /WET

    The Decompression Process
    Figure 4

    The Catch

    Unfortunately, the nice simple decompression algorithm shown in Figure 4 is just a little too simple. There is a single exception case in the LZW compression algorithm that causes some trouble to the decompression side. If there is a string consisting of a (STRING,CHARACTER) pair already defined in the table, and the input stream then sees a sequence of STRING, CHARACTER, STRING, CHARACTER, STRING, the compression algorithm will output a code before the decompressor gets a chance to define it.

    A simple example will illustrate the point. Imagine the the string JOEYN is defined in the table as code 300. Later on, the sequence JOEYNJOEYNJOEY occurs in the table. The compression output will look like that shown in Figure 5.


    Input String: ...JOEYNJOEYNJOEY
    Character Input New Code/String Code Output
    JOEYN 300 = JOEYN 288 (JOEY)
    A 301 = NA N
    . . .
    . . .
    . . .
    JOEYNJ 400 = JOEYNJ 300 (JOEYN)
    JOEYNJO 401 = JOEYNJO 400 (???)

    A problem
    Figure 5

    When the decompression algorithm sees this input stream, it first decodes the code 300, and outputs the JOEYN string. After doing the output, it will add the definition for code 399 to the table, whatever that may be. It then reads the next input code, 400, and finds that it is not in the table. This is a problem, what do we do?

    Fortunately, this is the only case where the decompression algorithm will encounter an undefined code. Since it is in fact the only case, we can add an exception handler to the algorithm. The modified algorithm just looks for the special case of an undefined code, and handles it. In the example in Figure 5, the decompression routine sees a code of 400, which is undefined. Since it is undefined, it translates the value of OLD_CODE, which is code 300. It then adds the CHARACTER value, which is 'J', to the string. This results in the correct translation of code 400 to string "JOEYNJ".

    Routine LZW_DECOMPRESS

    CODE:
    1. Read OLD_CODE
    2. output OLD_CODE
    3. CHARACTER = OLD_CODE
    4. WHILE there are still input characters DO
    5.     Read NEW_CODE
    6.     IF NEW_CODE is not in the translation table THEN
    7.         STRING = get translation of OLD_CODE
    8.         STRING = STRING+CHARACTER
    9.     ELSE
    10.         STRING = get translation of NEW_CODE
    11.     END of IF
    12.     output STRING
    13.     CHARACTER = first character in STRING
    14.     add OLD_CODE + CHARACTER to the translation table
    15.     OLD_CODE = NEW_CODE
    16. END of WHILE

    The Modified Decompression Algorithm
    Figure 6

    The Implementation Blues

    The concepts used in the compression algorithm are very simple -- so simple that the whole algorithm can be expressed in only a dozen lines. Implementation of this algorithm is somewhat more complicated, mainly due to management of the string table.

    In the code accompanying this article, I have used code sizes of 12, 13, and 14 bits. In a 12 bit code program, there are potentially 4096 strings in the string table. Each and every time a new character is read in, the string table has to be searched for a match. If a match is not found, then a new string has to be added to the table. This causes two problems. First, the string table can get very large very fast. If string lengths average even as low as three or four characters each, the overhead of storing a variable length string and its code could easily reach seven or eight bytes per code.

    In addition, the amount of storage needed is indeterminate, as it depends on the total length of all the strings.

    The second problem involves searching for strings. Each time a new character is read in, the algorithm has to search for the new string formed by STRING+CHARACTER. This means keeping a sorted list of strings. Searching for each string will take on the order of log2 string comparisons. Using 12 bit words means potentially doing 12 string compares for each code. The computational overhead caused by this would be prohibitive.

    The first problem can be solved by storing the strings as code/character combinations. Since every string is actually a combination of an existing code and an appended character, we can store each string as single code plus a character. For example, in the compression example shown, the string "/WEE" is actually stored as code 260 with appended character E. This takes only three bytes of storage instead of 5 (counting the string terminator). By backtracking, we find that code 260 is stored as code 256 plus an appended character E. Finally, code 256 is stored as a '/' character plus a 'W'.

    Doing the string comparisons is a little more difficult. Our new method of storage reduces the amount of time for a string comparison, but it doesn't cut into the the number of comparisons needed to find a match. This problem is solved by storing strings using a hashing algorithm. What this means is that we don't store code 256 in location 256 of an array. We store it in a location in the array based on an address formed by the string itself. When we are trying to locate a given string, we can use the test string to generate a hashed address, and with luck can find the target string in one search.

    Since the code for a given string is no longer known merely by its position in the array, we need to store the code for a given string along with the string data. In the attached program, there are three array elements for each string. They are code_value[i], prefix_code[i], and append_character[i].

    When we want to add a new code to the table, we use the hashing function in routine find_match to generate the correct value of i. First find_match generates an address, then checks to see if the location is already in use by a different string. If it is, it performs a secondary probe until an open location is found.

    The hashing function in use in this program is a straightforward XOR type hash function. The prefix code and appended character are combined to form an array address. If the contents of the prefix code and character in the array are a match, the correct address is returned. If that element in the array is in use, a fixed offset probe is used to search new locations. This continues until either an empty slot is found, or a match is found. By using a table about 25% larger than needed, the average number of searches in the table usually stays below 3. Performance can be improved by increasing the size of the table.

    Note that in order for the secondary probe to always work, the size of the table needs to be a prime number. This is because the probe can be any integer between 0 and the table size. If the probe and the table size were not mutually prime, a search for an open slot could fail even if there were still open slots available.

    Implementing the decompression algorithm has its own set of problems. One of the problems from the compression code goes away. When we are compressing, we need to search the table for a given string. During decompression, we are looking for a particular code. This means that we can store the prefix codes and appended characters in the table indexed by their string code. This eliminates the need for a hashing function, and frees up the array used to store code values.

    Unfortunately, the method we are using of storing string values causes the strings to be decoded in reverse order. This means that all the characters for a given string have to be decoded into a stack buffer, then output in reverse order. In the program here this is done in the decode_string function. Once this code is written, the rest of the algorithm turns into code very easily.

    One problem encountered when reading in data streams is determining when you have reached the end of the input data stream. In this particular implementation, I have reserved the last defined code, MAX_VALUE, as a special end of data indicator. While this may not be necessary when reading in data files, it is very helpful when reading compressed buffers out of memory. The expense of losing one defined code is minimal in comparison to the convenience.

    Results

    It is somewhat difficult to characterize the results of any data compression technique. The level of compression achieved varies quite a bit depending on several factors. LZW compression excels when confronted with data streams that have any type of repeated strings. Because of this, it does extremely well when compressing English text. Compression levels of 50% or better should be expected. Likewise, compressing saved screens and displays will generally show very good results.

    Trying to compress binary data files is a little more risky. Depending on the data, compression may or may not yield good results. In some cases, data files will compress even more than text. A little bit of experimentation will usually give you a feel for whether your data will compress well or not.

    Your Implementation

    The code accompanying this article works. However, it was written with the goal of being illuminating, not efficient. Some parts of the code are relatively inefficient. For example, the variable length input and output routines are short and easy to understand, but require a lot of overhead. An LZW program using fixed length 12 bit codes could experience real improvements in speed just by recoding these two routines.

    One problem with the code listed here is that it does not adapt well to compressing files of differing sizes. Using 14 or 15 bit codes gives better compression ratios on large files, (because they have a larger string table to work with), but actually has poorer performance on small files. Programs like ARC get around this problem by using variable length codes. For example, while the value of next_code is between 256 and 511, ARC inputs and outputs 9 bit codes. When the value of next_code increases to the point where 10 bit codes are needed, both the compression and decompression routines adjust the code size. This means that the 12 bit and 15 bit versions of the program will do equally well on small files.

    Another problem on long files is that frequently the compression ratio begins to degrade as more of the file is read in. The reason for this is simple. Since the string table is of finite size, after a certain number of strings have been defined, no more can be added. But the string table is only good for the portion of the file that was read in while it was built. Later sections of the file may have different characteristics, and really need a different string table.

    The conventional way to solve this problem is to monitor the compression ratio. After the string table is full, the compressor watches to see if the compression ratio degrades. After a certain amount of degradation, the entire table is flushed, and gets rebuilt from scratch. The expansion code is flagged when this happens by seeing a special code from the compression routine.

    An alternative method would be to keep track of how frequently strings are used, and to periodically flush values that are rarely used. An adaptive technique like this may be too difficult to implement in a reasonably sized program.

    One final technique for compressing the data is to take the LZW codes and run them through an adaptive Huffman coding filter. This will generally exploit a few more percentage points of compression, but at the cost of considerable more complexity in the code, as well as quite a bit more run time.

    Portability

    The code linked below was written and tested on MS-DOS machines, and has successfully compiled and executed with several C compilers. It should be portable to any machine that supports 16 integers and 32 bit longs in C. MS-DOS C compilers typically have trouble dealing with arrays larger than 64 Kbytes, preventing an easy implementation of 15 or 16 bit codes in this program. On machines using different processors, such as a VAX, these restrictions are lifted, and using larger code sizes becomes much simpler.

    In addition, porting this code to assembly language should be fairly simple on any machine that supports 16 and 32 bit math. Significant performance improvements could be seen this way. Implementations in other high level languages should be straightforward.

    Bibliography

    1. Terry Welch, "A Technique for High-Performance Data Compression", Computer, June 1984
    2. J. Ziv and A. Lempel, "A Universal Algorithm for Sequential Data Compression", IEEE Transactions on Information Theory, May 1977
    3. >Rudy Rucker, "Mind Tools", Houghton Mifflin Company, 1987

    Source Code

    LZW.C

    Update - September 9, 2007

    Reader David McKellar offers up some source code with the following comments:

    I modified your fine 1989 lzw.c code so it outputs the same as the GNU/Linux compress command. Of course, I also made the corresponding changes in the expand part of your code.

    I made the smallest number of changes possible. You can diff to see what I did. I realize your program is an example program to show the concepts but I think it doesn't add to the complexity too much to be compress compatible.

    I should mention that my changes don't totally support the GNU/Linux compress. The real compress flushes its hash table when it notices the ratio goes down. It changes the bits used for encoding when the table is full. And probably more tricky stuff. However that is only an issue when lzw.c is attempting to read a compressed file.

    lzw_nelson.cpp

    Update - November 8, 2007

    Reader Bogdan Banica contributes source code which is an adapted version of the program that operates on memory buffers instead of files: stream_lzw.c

    转载于:https://my.oschina.net/lyr/blog/96925

    展开全文
  • LZW压缩 LZW压缩 LZW压缩 LZW压缩 LZW压缩
  • LZW.js LZW 用 Ja​​vaScript 实现。 执照
  • LZW编码

    2021-03-22 20:20:41
    LZW编码的学习与实现 LZW压缩算法编解码示例 LZW算法问题
    展开全文
  • 描述 使用 需要“ lzw” -压缩 本地代码,dict = lzw.deflate('asdf') -解压缩 lzw.inflate(代码,dict) 执照
  • 这是著名的 LZW 算法的简单实现。
  • lzw压缩算法

    2019-06-06 11:57:54
    lzw压缩算法,适用于图片压缩、应用于低速率无线网络中的数据传输
  • LZW编码C++

    2018-12-22 20:11:03
    使用C++实现的LZW编码,从屏幕读取输入字符串,输出整体编码及字典
  • LZW压缩算法

    2019-01-11 17:34:43
    LZW完整压缩/解压缩算法,可以直接对文件操作!VS2013编译通过!
  • LZW编码及解码实现

    2020-06-14 14:01:19
    设计一个LZW编码解码系统,掌握LZW编码的特点、储存方法及基本原理。运用理论课知识解决实际问题,选用一种语言实现LZW编码译码的相关函数的基本框架设计,如LZW树的构建,LZW编码的实现,LZW译码的实现。
  • LZW编解码详解

    万次阅读 多人点赞 2019-06-07 20:41:41
    最近在看LZW编码和解码,正好看到一篇好文章,在此记录: 转自:https://segmentfault.com/a/1190000011425787 最近整理Github上以前胡乱写的代码,发现自己还写过压缩算法,大概是不知道什么时候用来练手的。里面...

    最近在看LZW编码和解码,正好看到一篇好文章,在此记录:

    转自:https://segmentfault.com/a/1190000011425787

    最近整理Github上以前胡乱写的代码,发现自己还写过压缩算法,大概是不知道什么时候用来练手的。里面我实现了哈夫曼树,LZW字典和算数编码三种压缩算法,时隔几年几乎没什么印象了,尤其是后两种连原理都基本忘了,所以把它们拎出来整理一下,也算是逼自己做个回忆。

    本篇先讲LZW算法,Wiki里给出的介绍和样例其实还不错,不过我在网上并没有找到很多其它的比较清晰的讲解,很多都是贴贴代码和流程图了事,所以这里我用我个人的理解,把LZW的原理再整理一遍。

    一个简单的例子

    LZW编码 (Encoding) 的核心思想其实比较简单,就是把出现过的字符串映射到记号上,这样就可能用较短的编码来表示长的字符串,实现压缩,例如对于字符串:

    ABABAB

    可以看到子串AB在后面重复出现了,这样我们可以用一个特殊记号表示AB,例如数字2,这样原来的字符串就可以表示为:

    AB22

    这里我们称2是字串AB的记号(Symbol)。那么A和B有没有记号来表示?当然有,例如我们规定数字0表示A,数字1表示B。实际上最后得到的压缩后的数据应该是一个记号流 (Symbol Stream) :

    0122

    这样我们就有一个记号和字符串的映射表,即字典 (Dictionary) :

    Symbol String
    0 A
    1 B
    2 AB

    有了压缩后的编码0122,结合字典,就能够很轻松地解码 (Decoding) 原字符串ABABAB。

    当然在真正的LZW中A和B不会用数字0和1表示,而是它们的ASCII值。实际上LZW初始会有一个默认的字典,包含了所有256个8bit字符,单个字符的记号就是它自身,用数字表示就是ASCII值。在此基础上,编码过程中加入的新的记号的映射,从256开始,称为扩展表(Extended Dictionary)。在这个例子里是为了简单起见,只有两个基础字符,所以规定0表示A,1表示B,从记号2开始就是扩展项了。

    字典的生成

    这里有一个问题:为什么第一个AB不也用2表示?即表示为222,这样不又节省了一个记号?这个问题实际上引出的是LZW的一个核心思想,即压缩后的编码是自解释 (self-explaining) 的。什么意思?即字典是不会被写进压缩文件的,在解压缩的时候,一开始字典里除了默认的0->A和1->B之外并没有其它映射,2->AB是在解压缩的过程中一边加入的。这就要求压缩后的数据自己能告诉解码器,完整的字典,例如2->AB是如何生成的,在解码的过程中还原出编码时用的字典。

    用上面的例子来说明,我们可以想象ABABAB编码的过程:

    1. 遇到A,用0表示,编码为0。
    2. 遇到B,用1表示,编码为1。
    3. 发现了一个子串AB,添加映射2->AB到字典里。
    4. 后面又出现了AB子串,都用2来编码。

    以上过程只是一个概述,并非真正LZW编码过程,只是为了表示它的思想。可以看出最前面的A和B是用来生成表项2->AB的,所以它们必须被保留在压缩编码里,作为表项2->AB生成的“第一现场”。这样在解码0122的时候,解码器首先通过01直接解析出最前面A和B,并且生成表项2->AB,这样才能将后面出现的2都解析为AB。实际上解码器是自己还原出了编码时2->AB生成的过程。

    编码和解码都是从前往后步步推进的,同时生成字典,所以解码的过程也是一个不断还原编码字典的过程。解码器一边解码,向后推进,一边在之前已经解出的原始数据上重现编码的过程,构建出编码时用的字典。

    LZW算法详解

    下面给出完整的LZW编码和解码的过程,结合一个稍微复杂一点的例子,来说明LZW的原理,重点是理解解码中的每一步是如何对应和还原编码中的步骤,并恢复编码字典的。

    编码算法

    编码器从原字符串不断地读入新的字符,并试图将单个字符或字符串编码为记号 (Symbol)。这里我们维护两个变量,一个是P (Previous),表示手头已有的,还没有被编码的字符串,一个是C (current),表示当前新读进来的字符

     1. 初始状态,字典里只有所有的默认项,例如0->a,1->b,2->c。此时P和C都是空的。
     2. 读入新的字符C,与P合并形成字符串P+C。
     3. 在字典里查找P+C,如果:
        - P+C在字典里,P=P+C。
        - P+C不在字典里,将P的记号输出;在字典中为P+C建立一个记号映射;更新P=C。
     4. 返回步骤2重复,直至读完原字符串中所有字符。

    以上表示的是编码中间的一般过程,在收尾的时候有一些特殊的处理,即步骤2中,如果到达字符串尾部,没有新的C读入了,则将手头的P对应的记号输出,结束。

    编码过程的核心就在于第3步,我们需要理解P究竟是什么。P是当前维护的,可以被编码为记号的子串。注意P是可以被编码为记号,但还并未输出。新的字符C不断被读入并添加到P的尾部,只要P+C仍然能在字典里找到,就不断增长更新P=P+C,这样就能将一个尽可能长的字串P编码为一个记号,这就是压缩的实现。当新的P+C无法在字典里找到时,我们没有办法,输出已有的P的编码记号,并为新子串P+C建立字典表项。然后新的P从单字符C开始,重新增长,重复上述过程。

    这里用一个例子来说明编码的过程,之所以用小写的字符串是为了和P,C区分。

    ababcababac

    初始状态字典里有三个默认的映射:

    Symbol String
    0 a
    1 b
    2 c

    开始编码:

    Step P C P+C P+C in Dict ? Action Output
    1 - a a Yes 更新P=a -
    2 a b ab No 添加3->ab,更新P=b 0
    3 b a ba No 添加4->ba,更新P=a 1
    4 a b ab Yes 更新P=ab -
    5 ab c abc No 添加5->abc,更新P=c 3
    6 c a ca No 添加6->ca,更新P=a 2
    7 a b ab Yes 更新P=ab -
    8 ab a aba No 添加7->aba,更新P=a 3
    9 a b ab Yes 更新P=ab -
    10 ab a aba Yes 更新P=aba -
    11 aba c abac No 添加8->abac,更新P=c 7
    12 c - - - - 2

    注意编码过程中的第3-4步,第7-8步以及8-10步,子串P发生了增长,直到新的P+C无法在字典中找到,则将当前的P输出,P则更新为单字符C,重新开始增长。

    输出的结果为0132372,完整的字典为:

    Symbol String
    0 a
    1 b
    2 c
    3 ab
    4 ba
    5 abc
    6 ca
    7 aba
    8 abac

    这里用一个图来展示原字符串是如何对应到压缩后的编码的:
    图片

    --

    解码算法

    解码的过程比编码复杂,其核心思想在于解码需要还原出编码时的用的字典。因此要理解解码的原理,必须分析它是如何对应编码的过程的。下面首先给出算法:

    解码器的输入是压缩后的数据,即记号流 (Symbol Stream)。类似于编码,我们仍然维护两个变量pW (previous word) 和cW (current word),后缀W的含义是word,实际上就是记号 (Symbol),一个记号就代表一个word,或者说子串。pW表示之前刚刚解码的记号;cW表示当前新读进来的记号。

    注意cW和pW都是记号,我们用Str(cW)和Str(pW)表示它们解码出来的原字符串。

    1. 初始状态,字典里只有所有的默认项,例如0->a,1->b,2->c。此时pW和cW都是空的。
    2. 读入第一个的符号cW,解码输出。注意第一个cW肯定是能直接解码的,而且一定是单个字符。
    3. 赋值pW=cW。
    4. 读入下一个符号cW。
    5. 在字典里查找cW,如果:
       a. cW在字典里:
         (1) 解码cW,即输出 Str(cW)。
         (2) 令P=Str(pW),C=Str(cW)的**第一个字符**。
         (3) 在字典中为P+C添加新的记号映射。
       b. cW不在字典里:
         (1) 令P=Str(pW),C=Str(pW)的**第一个字符**。
         (2) 在字典中为P+C添加新的记号映射,这个新的记号一定就是cW。
         (3) 输出P+C。
    6. 返回步骤3重复,直至读完所有记号。

    显然,最重要的是第5步,也是最难理解的。在这一步中解码器不断地在已经破译出来的数据上,模拟编码的过程,还原出字典。我们还是结合之前的例子来说明,我们需要从记号流

    0 1 3 2 3 7 2

    解码出:

    a b ab c ab aba c

    这里我用空格表示出了记号是如何依次对应解码出来的子串的,当然在解码开始时我们根本不知道这些,我们手里的字典只有默认项,即:

    Symbol String
    0 a
    1 b
    2 c

    解码开始:
    首先读取第一个记号cW=0,解码为a,输出,赋值pW=cW=0。然后开始循环,依此读取后面的记号:

    Step pW cW cW in Dict ? Action Output
    1 0 1 Yes P=a,C=b,P+C=ab,添加3->ab b
    2 1 3 Yes P=b,C=a,P+C=ba,添加4->ba ab
    3 3 2 Yes P=ab,C=c,P+C=abc,添加5->abc c

    好,先解码到这里,我们已经解出了前5个字符 a b ab c。一步一步走下来我们可以看出解码的思想。首先直接解码最前面的a和b,然后生成了3->ab这一映射,也就是说解码器利用前面已经解出的字符,如实还原了编码过程中字典的生成。这也是为什么第一个a和b必须保留下来,而不能直接用3来编码,因为解码器一开始根本不知道3表示ab。而第二个以及以后的ab就可以用记号3破译出来,因为此时我们已经建立了3->ab的关系。

    仔细观察添加新映射的过程,就可以看出它是如何还原编码过程的。解码步骤5.a中,P=Str(pW),C=Str(cW)的第一个字符,我们可以用下图来说明:
    图片描述

    注意P+C构成的方式,取前一个符号pW,加上当前最新符号cW的第一个字符。这正好对应了编码过程中遇到P+C不在字典中的情况:将P编码为pW输出,并更新P=C,P从单字符C开始重新增长。

    到目前为止,我们只用到了解码步骤5.a的情况,即每次新读入的cW都能在字典里找到,只有这样我们才能直接解码cW输出,并拿到cW的第一个字符C,与P组成P+C。但实际上还有一种可能就是5.b中的cW不在字典里。为什么cW会不在字典里?回到例子,我们此时已经解出了5个字符,继续往下走:

    Step pW cW cW in Dict ? Action Output
    4 2 3 Yes P=c,C=a,P+C=ca,添加6->ca ab
    5 3 7 No P=ab,C=a,P+C=aba,添加7->aba aba
    6 7 2 Yes P=aba,C=c,P+C=abac,添加8->abac c

    好到此为止,后面的 ab aba c 也解码出来了,解码过程结束。这里最重要的就是Step-5,新读入一个cW为7,可7此时并不在字典里。当然我们事实上知道7最终应该对应aba,可是解码器应该如何反推出来?

    为什么解码进行到这一步7->aba还没有被编入字典?因为解码比编码有一步的延迟,实际上aba正是由当前的P=ab,和那个还未知的cw=7的第一个字符C组成的,所以cW映射的就是这个即将新加入的子串P+C,也因此cW的第一个字符就是pW的第一个字符a,cW就是aba。

    我们看到解码器在这里做了一个推理,既然cW到目前为止还没有被加入字典,可解码却偏偏遇到了,说明cW的映射并不是很早之前加入的,而是就在当前这一步。对应到编码的过程,就是新的cW映射,即7->aba刚被写进字典,紧接着后面的一个字串就用到了它。读者可以对照后半部分 ab aba c 编码的过程,对比解码过程反推,理解它的原理。这也是解码算法中最难的部分。

    总结

    好了,LZW的编码和解码过程到此就讲解完毕了。其实它的思想本身是简单的,就是将原始数据中的子串用记号表示,类似于编一部字典。编码过程中如何切割子串,建立映射的方式,其实并不是唯一的,但是LZW算法的严格之处在于,它提供了一种方式,使得压缩后的编码能够唯一地反推出编码过程中建立的字典,从而不必将字典本身写入压缩文件。试想,如果字典也需要写入压缩文件,那它占据的体积本身就会很大,可能到最后起不到压缩的效果。

    我用C++实现了一下LZW,详见github:https://github.com/Wzing0421/LZW

    展开全文
  • 这是 Giuseppe Ridino 对“Simple LZW 算法”的修订,因为它的性能不好处理图像数据。 您可以从以下位置获取原始版本: MATLAB Central>文件交换>通信>编码和信息论> LZW压缩算法
  • LZW(Lempel-Ziv & Welch)编码又称字串表编码,属 于一种无损编码,LZW编码与行程编码类似,也是对字符串 进行编码从而实现压缩,但它在编码的同时还生成了特定 字符......利用c++实现对简单字符串的lzw编码与解码,...

    LZW(Lempel-Ziv & Welch)编码又称字串表编码,属 于一种无损编码,LZW编码与行程编码类似,也是对字符串 进行编码从而实现压缩,但它在编码的同时还生成了特定 字符......

    利用c++实现对简单字符串的lzw编码与解码,直接复制粘贴到即可使用 VIP专享...

    1. 图像的 LZW 编码研究 要求:实现灰度图像的 LZW 编码和解码恢复图像...

    LZW 编码算法详解 LZW(Lempel-Ziv & Welch)编码...

    年 月 日 一、题目 LZW 编码与译码 二、编程要求要求一:对字符串进行 LZW 编码,输出与字符串相一一对应的码字,本次 实验所选的字符串为 “ABBABABAC” 。 ...

    标准的 LZW 压缩原理: ~~~ 先来解释一下几个基本概念: LZW 压缩有三个重要的对象:数据流(CharStream)、编码流(CodeStream)和编译 表(String Table)。在编码......

    西南科技大学 课程设计报告 课程名称: 数字通信课程设计 课程名称: 设计名称: 设计名称:编写 Matlab 函数实现哈夫曼编码的算法 姓学班名: 号: 级: 林正红 ......

    LZW算法_IT/计算机_专业资料。LZW算法 Page 1 of 27 IMPLEMENTATION OF LEMPEL-ZIV COMPRESSION ALGORITHM USING MATLAB PROJECT at ELECTRONICS & COMMUNICATION ENGG......

    10 解:该信源符号的 LZW 编码过程如下: Step1:对 X 1 = 0 ...

    实验4 实验4:LZW 编码学生姓名: 学生姓名: 学号: 一、实验室名称:信息与编码课程组 实验室名称: 二、实验项目名称:LZW 编码 实验项目名称: 三、实验原理: ......

    一种基于Huffman和LZW编码的移动应用混淆方法_电子/电路_工程科技_专业...

    标准的 LZW 压缩原理: ~~~ 先来解释一下几个基本概念: LZW 压缩有三个重要的对象:数据流(CharStream)、编码流(CodeStream)和编译 表(String Table)。在编码......

    定长编码和哈夫曼编码的基础上,详细说明了将一般文件转换为文本文件的具体 过程.在详细介绍 LZW 码的改进算法的基础上,通过实例给出了这种算法的具体 实现过程.......

    编码的流程图: LZW 编码的流程图 : 编码的源代码: LZW 编码的源代码 : #include #include *prefix,char void copy1(char *p refix,......

    一种改进的 LZW 算法在图像编码中的应用 蓝波;林小竹;籍俊伟 【摘要】在医学...

    采用二叉树结构存储词条,且对词条出现次数进行统计,再根据 LZW 压缩结 果进行 Huffman 编码.经过测试分析,该混合算法能够节省 LZW 编码过程中的内存 资源,压缩效果......

    安全的 LZW 编码算法及其在 GIF 图像加密中的应用 向涛*,王安 【摘要】摘要:提出了一种安全的 LZW 编码算法——SLZW。该算法在改进的 LZW 编码过程中嵌入加密......

    [解析] 词典编码是利用数据本身包含重复代码这种特性的压缩技术,适用于开始时不知道编码数据的统计特性,也不一定允许事先知道它们的统计特性的场合,编码时对数据不会......

    Key words: Chinese text; data compression; compression algorithm; encoding; LZW 摘要: 结合中文文本中的汉字编码方式、 大字符集以及重复字串不长三个不同于......

    02-0054-03 基于 LZW和Huffman 的混合编码压缩算法 崔方送 ( 安徽黄梅戏艺术职业学院 图文信息中心,安徽 安庆 246052) 摘要:串表压缩( LempelZivWelch,LZW) 算法......

    展开全文
  • LZW解码

    2021-04-29 10:52:16
    LZW解码 数据压缩实验三 1.原理: (1)编码 初始字典包含所有的单字符,初始化P=NULL; 将数据流的下一个字符赋给C; 判断P+C(P连接C)这个字符是否在字典里: 是:P=P+C; 否:输出P对应的码字CW,将P+C作为新串...
  • 在早期的 norm2lzwlzw2norm 示例的基础上,这些例程实现了 LZW 压缩/解压缩。 表管理已被重写以提高性能 - 时间和空间,并实现了固定的表大小。
  • 压缩-解压缩-使用-LZW 实现了用于压缩和解压缩文本文件的LZW 压缩率为60%(尺寸比原始尺寸小60%)。 我提供了一些txt输入文件来检查程序。 另外,如果您想检查其他文件,只需打开测试程序并在顶部更改input...
  • Fix lzw decompression

    2020-12-25 20:03:21
    <div><p>Some geotiff files, which are compressed with lzw, cannot be parsed and are hung out. For example, ;1716646&cs=gs&format=TIFF&width=720&height=360">NASA's...
  • LZW_with_DLB 根据我在算法课程中的项目改编而成,它使用了经过修改的de la briandais树,而不是Robert Sedgewick的“算法”中提供的TST树。 主要特征 搜索前缀和添加新代码字是同时完成的。 与其他实现相比,该...
  • LZW压缩源码

    2014-08-28 16:50:51
    LZW压缩源码,详细说明,参见LZW压缩源码文档
  • LZW编解码

    2021-04-20 21:27:05
    调试LZW编码程序,写出LZW解码程序;理解LZW编解码思想及过程。 二、基本原理及思想 1.编码原理 LZW的编码思想是不断地从字符流中提取新的字符串,通俗地理解为新“词条”,然后用“代号”也就是码字表示这个“词条...
  • lzw加密算法源码

    2017-02-09 10:18:52
    lzw加密算法源码
  • LZW_APP.zip

    2020-04-28 12:53:29
    适合小白学习的lzw文本编码matlab app,软件平台:matlab2019b。中英文均适用,可以测试lzw编码效率,内含相关参考资料及连接。

空空如也

空空如也

1 2 3 4 5 ... 20
收藏数 2,098
精华内容 839
关键字:

lzw