精华内容
下载资源
问答
  • hadoop lzo使用测试

    2019-04-15 17:32:27
    hadoop lzo 使用 测试 ,是否split ,index索引

    lzo测试

    上篇文章hadoop lzo配置介绍lzo配置
    本片测一下lzo使用

    准备原始数据

       压缩前1.4G lzop压缩后 213M, 
        [root@spark001 hadoop]# du -sh *
        1.4G	baidu.log
        [root@spark001 hadoop]# lzop baidu.log
        [root@spark001 hadoop]# du -sh *
        1.4G	baidu.log
        213M	baidu.log.lzo
    

    上传hdfs

    [root@spark001 hadoop]# hdfs dfs -mkdir -p /user/hadoop/compress/log/200M
    [root@spark001 hadoop]# hdfs dfs -put baidu.log.lzo /user/hadoop/compress/log/200M/
    
    
    lzo不分片
        清洗测试
            hadoop jar hadoop-train-1.0.jar  com.bigdata.hadoop.mapreduce.driver.LogETLDirverLzo  /user/hadoop/compress/log/200M/ 	/user/hadoop/compress/log/etl_lzo/200/
    
          [root@spark001 hadoop]#  hadoop jar hadoop-train-1.0.jar  com.bigdata.hadoop.mapreduce.driver.LogETLDirverLzo  /user/hadoop/compress/log/200M/ /user/hadoop/compress/log/etl_lzo/200/
        19/04/15 17:06:34 INFO driver.LogETLDirverLzo: Processing trade with value: /user/hadoop/compress/log/etl_lzo/200/  
        19/04/15 17:06:34 INFO client.RMProxy: Connecting to ResourceManager at spark001/172.31.220.218:8032
        19/04/15 17:06:34 WARN mapreduce.JobResourceUploader: Hadoop command-line option parsing not performed. Implement the Tool interface and execute your application with ToolRunner to remedy this.
        19/04/15 17:06:35 INFO input.FileInputFormat: Total input paths to process : 1
        19/04/15 17:06:35 INFO mapreduce.JobSubmitter: number of splits:1  只有1个分片 说明这种lzo 不支持分片
    

    lzo支持分片

    需要建立lzo索引

    [root@spark001 hadoop]# hdfs dfs -mkdir -p /user/hadoop/compress/log/200M_index
    [root@spark001 hadoop]# hdfs dfs -put baidu.log.lzo /user/hadoop/compress/log/200M_index/
    创建索引
     hadoop jar /opt/cloudera/parcels/GPLEXTRAS-5.13.1-1.cdh5.13.1.p0.2/lib/hadoop/lib/hadoop-lzo-0.4.15-cdh5.13.1.jar \
    com.hadoop.compression.lzo.DistributedLzoIndexer \
    /user/hadoop/compress/log/200M_index
    同目录下生成了一个index后缀的文件
    [root@spark001 hadoop]# hdfs dfs -ls /user/hadoop/compress/log/200M_index/
    Found 2 items
    -rw-r--r--   3 root supergroup  222877272 2019-04-15 16:00 /user/hadoop/compress/log/200M_index/baidu.log.lzo
    -rw-r--r--   3 root supergroup      43384 2019-04-15 16:09 /user/hadoop/compress/log/200M_index/baidu.log.lzo.index
    执行etl操作
    [root@spark001 hadoop]#  hadoop jar hadoop-train-1.0.jar  com.bigdata.hadoop.mapreduce.driver.LogETLDirverLzo  /user/hadoop/compress/log/200M_index/ /user/hadoop/compress/log/etl_lzo/200_index/
    19/04/15 17:10:09 INFO driver.LogETLDirverLzo: Processing trade with value: /user/hadoop/compress/log/etl_lzo/200_index/  
    19/04/15 17:10:09 INFO client.RMProxy: Connecting to ResourceManager at spark001/172.31.220.218:8032
    19/04/15 17:10:09 WARN mapreduce.JobResourceUploader: Hadoop command-line option parsing not performed. Implement the Tool interface and execute your application with ToolRunner to remedy this.
    19/04/15 17:10:09 INFO input.FileInputFormat: Total input paths to process : 2
    19/04/15 17:10:10 INFO mapreduce.JobSubmitter: number of splits:2                  2个分片
    说明想要lzo支持分片需要创建索引
    

    总结

       213M支持分片的话应该有 2个split,不支持就一个split,
        支持分片当大于blocksize时,会有2个map处理,提高效率
        不支持分片不论多大都只有一个 map处理  耗费时间
    	所以工作中使用lzo要合理控制生成的lzo大小,不要超过一个block大小。因为如果没有lzo的index文件,该lzo会由一个map处理。如果lzo过大,	会导致某个map处理时间过长。
    
    	也可以配合lzo.index文件使用,这样就支持split,好处是文件大小不受限制,可以将文件设置的稍微大点,这样有利于减少文件数目。但是生成lzo.index文件虽然占空间不大但也本身需要开销。
    
    展开全文
  • hadoop lzo配置

    2019-04-15 16:39:44
    hadoop2.6.0 cdh 5.7.0 配置lzo

    lzo相关环境安装编译部署

    lzo配置

    [root@hadoop001 tar]# wget http://www.oberhumer.com/opensource/lzo/download/lzo-2.10.tar.gz
    [root@hadoop001 tar]# tar -zxvf lzo-2.10.tar.gz -C /home/hadoop/app/
    [root@hadoop001 app]# chown -R hadoop:hadoop lzo-2.10
    [root@hadoop001 app]# cd lzo-2.10/
    [root@hadoop001 lzo-2.10]# ./configure
    [root@hadoop001 lzo-2.10]# make install
    
    make[1]: Entering directory `/home/hadoop/app/lzo-2.10'
     /usr/bin/mkdir -p '/usr/local/lib'
     /bin/sh ./libtool   --mode=install /usr/bin/install -c   src/liblzo2.la '/usr/local/lib'
    libtool: install: /usr/bin/install -c src/.libs/liblzo2.lai /usr/local/lib/liblzo2.la
    libtool: install: /usr/bin/install -c src/.libs/liblzo2.a /usr/local/lib/liblzo2.a
    libtool: install: chmod 644 /usr/local/lib/liblzo2.a
    libtool: install: ranlib /usr/local/lib/liblzo2.a
    libtool: finish: PATH="/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/root/bin:/sbin" ldconfig -n /usr/local/lib
    ----------------------------------------------------------------------
    Libraries have been installed in:
       /usr/local/lib
    
    If you ever happen to want to link against installed libraries
    in a given directory, LIBDIR, you must either use libtool, and
    specify the full pathname of the library, or use the '-LLIBDIR'
    flag during linking and do at least one of the following:
       - add LIBDIR to the 'LD_LIBRARY_PATH' environment variable
         during execution
       - add LIBDIR to the 'LD_RUN_PATH' environment variable
         during linking
       - use the '-Wl,-rpath -Wl,LIBDIR' linker flag
       - have your system administrator add LIBDIR to '/etc/ld.so.conf'
    
    See any operating system documentation about shared libraries for
    more information, such as the ld(1) and ld.so(8) manual pages.
    ----------------------------------------------------------------------
     /usr/bin/mkdir -p '/usr/local/share/doc/lzo'
     /usr/bin/install -c -m 644 AUTHORS COPYING NEWS THANKS doc/LZO.FAQ doc/LZO.TXT doc/LZOAPI.TXT '/usr/local/share/doc/lzo'
     /usr/bin/mkdir -p '/usr/local/lib/pkgconfig'
     /usr/bin/install -c -m 644 lzo2.pc '/usr/local/lib/pkgconfig'
     /usr/bin/mkdir -p '/usr/local/include/lzo'
     /usr/bin/install -c -m 644 include/lzo/lzo1.h include/lzo/lzo1a.h include/lzo/lzo1b.h include/lzo/lzo1c.h include/lzo/lzo1f.h include/lzo/lzo1x.h include/lzo/lzo1y.h include/lzo/lzo1z.h include/lzo/lzo2a.h include/lzo/lzo_asm.h include/lzo/lzoconf.h include/lzo/lzodefs.h include/lzo/lzoutil.h '/usr/local/include/lzo'
    make[1]: Leaving directory `/home/hadoop/app/lzo-2.10'
    

    配置lzop

    [root@hadoop001 tar]# wget http://www.lzop.org/download/lzop-1.04.tar.gz
    
    [root@hadoop001 tar]# tar -zxvf lzop-1.04.tar.gz -C /home/hadoop/app/
    [root@hadoop001 app]# cd lzop-1.04/
    [root@hadoop001 lzop-1.04]# ./configure
     。。。
    Type `make' to build lzop. Type `make install' to install lzop.
    After installing lzop, please read the accompanied documentation.
    
    [root@hadoop001 lzop-1.04]# make  && make install
    

    编译hadoop-lzo

    [root@hadoop001 tar]# wget https://github.com/twitter/hadoop-lzo/archive/master.zip
    [root@hadoop001 tar]# unzip -d /home/hadoop/app/ master.zip 
    修改hadoop version
      <properties>
        <project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
        <hadoop.current.version>2.6.0-cdh5.7.0</hadoop.current.version>    
        <hadoop.old.version>1.0.4</hadoop.old.version>
      </properties>
      添加仓库
          <repository>
          <id>cloudera</id>
          <url>https://repository.cloudera.com/artifactory/cloudera-repos</url>
        </repository>
        readme文件需要设置
    [hadoop@hadoop001 hadoop-lzo-master]$C_INCLUDE_PATH=/usr/local/include 
    [hadoop@hadoop001 hadoop-lzo-master]$LIBRARY_PATH=/usr/local/lib
    [hadoop@hadoop001 hadoop-lzo-master]$  mvn clean package -Dmaven.test.skip=true
    [INFO] ------------------------------------------------------------------------
    [INFO] BUILD SUCCESS
    [INFO] ------------------------------------------------------------------------
    [INFO] Total time: 02:41 min
    [INFO] Finished at: 2019-04-14T12:03:05+00:00
    [INFO] Final Memory: 37M/1252M
    [INFO] ------------------------------------------------------------------------
    拷贝相关文件放到本地库
    [hadoop@hadoop001 hadoop-lzo-master]$ cd target/native/Linux-amd64-64/
    [hadoop@hadoop001 Linux-amd64-64]$ tar -cBf - -C lib . | tar -xBvf - -C ~
    [hadoop@hadoop001 ~]$ cp ~/libgplcompression* $HADOOP_HOME/lib/native/
    把编译好的jar  加入 hadoop包下
    [hadoop@hadoop001 target]$ cp hadoop-lzo-0.4.21-SNAPSHOT.jar $HADOOP_HOME/share/hadoop/common/
    

    配置core-site.xml

        <property>
            <name>io.compression.codecs</name>
            <value>org.apache.hadoop.io.compress.GzipCodec,
                    org.apache.hadoop.io.compress.DefaultCodec,
                    org.apache.hadoop.io.compress.BZip2Codec,
                    org.apache.hadoop.io.compress.SnappyCodec,
                    com.hadoop.compression.lzo.LzoCodec,
                    com.hadoop.compression.lzo.LzopCodec
            </value>
         </property>
            <!--支持LZO使用类-->
         <property>
             <name>io.compression.codec.lzo.class</name>
             <value>com.hadoop.compression.lzo.LzopCodec</value>
         </property>
    

    配置mapred-site.xml

    <!--启用map中间压缩类-->
    <property>
       <name>mapred.map.output.compression.codec</name>
       <value>com.hadoop.compression.lzo.LzopCodec</value>
    </property>
    <!--启用mapreduce文件压缩-->
    <property>
        <name>mapreduce.output.fileoutputformat.compress</name>
        <value>true</value>
    </property> 
    <!--启用mapreduce压缩类-->
    <property>
       <name>mapreduce.output.fileoutputformat.compress.codec</name>
       <value>com.hadoop.compression.lzo.LzopCodec</value>
    </property>
    <!--配置Jar包-->
    <property>
        <name>mapred.child.env</name>
        <value>LD_LIBRARY_PATH=/usr/local/lib</value>
    </property>
    

    下篇

    lzo使用测试

    展开全文
  • hadoop lzo

    2013-10-28 18:34:04
    1.安装LZO sudo apt-get install liblzo2-dev 或者下载lzo2[http://www.oberhumer.com/opensource/lzo/download/]. wget [http://www.oberhumer.com/opensource/lzo/download/lzo-2.06.tar.gz] ./configure ...

    1.安装LZO

    sudo apt-get install liblzo2-dev
    或者下载lzo2[http://www.oberhumer.com/opensource/lzo/download/].
    wget [http://www.oberhumer.com/opensource/lzo/download/lzo-2.06.tar.gz]
    ./configure \--enable-shared
    make
    make install
    

    2.安装hadoop-lzo

    wget [https://github.com/kevinweil/hadoop-lzo/archive/master.zip]
    或 git clone [https://github.com/kevinweil/hadoop-lzo.git]
    
    64位机器:
    export&nbsp;CFLAGS=-m64
    export&nbsp;CXXFLAGS=-m64
    
    32位机器:
    
    export&nbsp;CFLAGS=-m32
    export&nbsp;CXXFLAGS=-m32
    
    编译打包:ant compile-native tar
    
    {color:#ff0000}编译过程中遇到错误:{color}
    compile-native:
    [mkdir] Created dir: /home/caodaoxi/soft/hadoop-lzo/build/native/Linux-i386-32/lib
    [mkdir] Created dir: /home/caodaoxi/soft/hadoop-lzo/build/native/Linux-i386-32/src/com/hadoop/compression/lzo
    [javah] 错误: 找不到类org.apache.hadoop.conf.Configuration。
    
    BUILD FAILED
    /home/caodaoxi/soft/hadoop-lzo/build.xml:269: compilation failed
    
    
    解决方法:
     在build.xml中添加 <classpath refid="classpath"/>
    
     <javah classpath="${build.classes}" destdir="${build.native}/src/com/hadoop/compression/lzo" force="yes" verbose="yes">
    
       <class name="com.hadoop.compression.lzo.LzoCompressor" />
    
       <class name="com.hadoop.compression.lzo.LzoDecompressor" />
       {color:#ff0000}<classpath refid="classpath"/>{color}
    
     </javah>
    

    3.将安装后的hadoop的lzo目录的native文件夹copy到hadoop的lib的native目录

     cp /home/hadoop/soft/hadoop-lzo/build/native /home/hadoop/soft/hadoop/lib/
    

    4.将安装后的hadoop的lzo目录的jar包copy到hadoop的lib目录

    cp /home/hadoop/soft/hadoop-lzo/build/hadoop-lzo-0.4.15/hadoop-lzo-0.4.15.jar /home/hadoop/soft/hadoop/share/hadoop/lib
    

    5.配置hadoop的配置文件

    修改$HADOOP_HOME/conf/core-site.xml, 加入下面这段配置。(后期测试发现,这个配置可以不用加,而且加了这个配置以后会导致sqoop等一些框架加载不到LzoCode.class)

    <property>
     <name>hadoop.tmp.dir</name>
     <value>/home/hadoop/soft/hadoop/tmp</value>
    </property>
    <property>
     <name>fs.trash.interval</name>
     <value>1440</value>
     <description>Number of minutes between trash checkpoints. If zero, the trash feature is disabled.</description>
    </property>
    <property>
     <name>io.compression.codecs</name>
     <value>
    org.apache.hadoop.io.compress.DefaultCodec,com.hadoop.compression.lzo.LzoCodec,com.hadoop.compression.lzo.LzopCodec,org.apache.hadoop.io.compress.GzipCodec,org.apache.hadoop.io.compress.BZip2Codec
     </value>
    </property>
    <property>
     <name>io.compression.codec.lzo.class</name>
     <value>com.hadoop.compression.lzo.LzoCodec</value>
    </property>
    修改$HADOOP_HOME/conf/mapred-site.xml
    
    <property>
     <name>mapreduce.map.output.compress</name>
     <value>true</value>
    </property>
    
    <property>
     <name>mapred.child.java.opts</name>
     <value>-Djava.library.path=/home/hadoop/soft/hadoop/lib/native/Linux-i386-32/</value>
    </property>
    
    <property>
     <name>mapred.map.output.compression.codec</name>
     <value>com.hadoop.compression.lzo.LzoCodec</value>
    </property>
    

    6.hadoop集群重启

     cd /home/hadoop/soft/hadoop/bin
    
     ./stop-all.sh
    
     ./start-all.sh
    

    *7.对集群进行测试

    a.测试环境的测试*

    1.安装lzop

    wget http://www.lzop.org/download/lzop-1.03.tar.gz&nbsp;

    /configure && make && sudo make install

    2.使用lzop压缩日志文件
    下载原始日志: hadoop fs -copyToLocal /user/hive/warehouse/labrador.db/pvlog/ptdate=2013-06-04/pv.log

    原始日志文件:

    rw-rr-  1 hadoop hadoop 497060688 Jul  1 10:36 pv.log

    使用lzop进行压缩: lzop pv.log
    压缩后的日志文件:

    rw-rr-  1 hadoop hadoop 497060688 Jul  1 10:36 pv.log
    rw-rr-  1 hadoop hadoop 163517168 Jul  1 10:36 pv.log.lzo

    压缩率为 163517168/497060688=33%

    hadoop fs -put pv.log.lzo  /user/hive/warehouse/labrador.db/pvlog/ptdate=2013-06-04/

    测试是否安装成功:hadoop jar hadoop-lzo-0.4.15.jar com.hadoop.compression.lzo.DistributedLzoIndexer /user/hive/warehouse/labrador.db/pvlog/ptdate=2013-06-04

    报错:

               13/07/01 15:01:35 ERROR lzo.GPLNativeCodeLoader: Could not load native gpl library
               java.lang.UnsatisfiedLinkError: no gplcompression in java.library.path
               at java.lang.ClassLoader.loadLibrary(ClassLoader.java:1860)
               at java.lang.Runtime.loadLibrary0(Runtime.java:845)
               at java.lang.System.loadLibrary(System.java:1084)
               at com.hadoop.compression.lzo.GPLNativeCodeLoader.<clinit>(GPLNativeCodeLoader.java:32)
               at com.hadoop.compression.lzo.LzoCodec.<clinit>(LzoCodec.java:71)
               at com.hadoop.compression.lzo.LzoIndexer.<init>(LzoIndexer.java:36)
               at com.hadoop.compression.lzo.LzoIndexer.main(LzoIndexer.java:134)
               at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
               at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
               at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
               at java.lang.reflect.Method.invoke(Method.java:601)
               at org.apache.hadoop.util.RunJar.main(RunJar.java:156)
               13/07/01 15:01:35 ERROR lzo.LzoCodec: Cannot load native-lzo without native-hadoop
               13/07/01 15:01:36 INFO lzo.LzoIndexer: [INDEX] LZO Indexing file /user/hive/warehouse/labrador.db/pvlog/ptdate=2013-06-14/pv.log.lzo, size 0.05 GB...
               Exception in thread "main" java.lang.RuntimeException: native-lzo library not available
               at com.hadoop.compression.lzo.LzopCodec.createDecompressor(LzopCodec.java:104)
               at com.hadoop.compression.lzo.LzoIndex.createIndex(LzoIndex.java:229)
               at com.hadoop.compression.lzo.LzoIndexer.indexSingleFile(LzoIndexer.java:117)
               at com.hadoop.compression.lzo.LzoIndexer.indexInternal(LzoIndexer.java:98)
               at com.hadoop.compression.lzo.LzoIndexer.index(LzoIndexer.java:52)
               at com.hadoop.compression.lzo.LzoIndexer.main(LzoIndexer.java:137)
               at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
               at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
               at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
               at java.lang.reflect.Method.invoke(Method.java:601)
               at org.apache.hadoop.util.RunJar.main(RunJar.java:156)

           说明lzo的库没有安装成功.

           解决方法:

           1.经过无数次的排查,终于发现jdk是32位的,而机器是64的,导致hadoop编译后是32位的,将jdk版本改成64位的.

           2.进过对hadoop-lzo源码和hadoop代码的阅读,发现其中几个代码片段.

              com.hadoop.compression.lzo.GPLNativeCodeLoader:

              try {
                     //try to load the lib
                    System.loadLibrary("gplcompression");
                    nativeLibraryLoaded = true;
                    LOG.info("Loaded native gpl library");
             } catch (Throwable t) {
                    LOG.error("Could not load native gpl library", t);
                    nativeLibraryLoaded = false;
             }

              /home/hadoop/soft/hadoop/bin/hadoop:

              HADOOP_OPTS="$HADOOP_OPTS -Djava.library.path=$JAVA_LIBRARY_PATH"

              在此之前输出JAVA_LIBRARY_PATH,发现并不包括lzo的动态链接库文件,而使用lzo的话,必须要将lzo的动态链接库文件加入到JAVA_LIBRARY_PATH,

              所以在365行添加:JAVA_LIBRARY_PATH=${JAVA_LIBRARY_PATH}:${HADOOP_HOME}/lib/native/${JAVA_PLATFORM}

              重启hadoop集群

              再次执行: hadoop jar hadoop-lzo-0.4.15.jar com.hadoop.compression.lzo.DistributedLzoIndexer /user/hive/warehouse/labrador.db/pvlog/ptdate=2013-06-04

              打印信息如下:

              13/07/01 17:40:53 INFO lzo.GPLNativeCodeLoader: Loaded native gpl library
              13/07/01 17:40:53 INFO lzo.LzoCodec: Successfully loaded & initialized native-lzo library [hadoop-lzo rev 6bb1b7f8b9044d8df9b4d2b6641db7658aab3cf8]
              13/07/01 17:40:54 INFO lzo.DistributedLzoIndexer: Adding LZO file hdfs://kooxoo1-154.kuxun.cn:9000/user/hive/warehouse/labrador.db/pvlog/ptdate=2013-06-04/pv.log.lzo to indexing list (no index currently exists)
              13/07/01 17:40:54 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same.
              13/07/01 17:40:54 INFO input.FileInputFormat: Total input paths to process : 1
              13/07/01 17:40:54 INFO mapred.JobClient: Running job: job_201307011738_0001
              13/07/01 17:40:55 INFO mapred.JobClient:  map 0% reduce 0%
              13/07/01 17:41:11 INFO mapred.JobClient:  map 100% reduce 0%
              13/07/01 17:41:16 INFO mapred.JobClient: Job complete: job_201307011738_0001
              13/07/01 17:41:16 INFO mapred.JobClient: Counters: 19
              13/07/01 17:41:16 INFO mapred.JobClient:   Job Counters
              13/07/01 17:41:16 INFO mapred.JobClient:     SLOTS_MILLIS_MAPS=15320
              13/07/01 17:41:16 INFO mapred.JobClient:     Total time spent by all reduces waiting after reserving slots (ms)=0
              13/07/01 17:41:16 INFO mapred.JobClient:     Total time spent by all maps waiting after reserving slots (ms)=0
              13/07/01 17:41:16 INFO mapred.JobClient:     Launched map tasks=1
              13/07/01 17:41:16 INFO mapred.JobClient:     Data-local map tasks=1
              13/07/01 17:41:16 INFO mapred.JobClient:     SLOTS_MILLIS_REDUCES=0
              13/07/01 17:41:16 INFO mapred.JobClient:   File Output Format Counters
              13/07/01 17:41:16 INFO mapred.JobClient:     Bytes Written=0
              13/07/01 17:41:16 INFO mapred.JobClient:   FileSystemCounters
              13/07/01 17:41:16 INFO mapred.JobClient:     HDFS_BYTES_READ=15388
              13/07/01 17:41:16 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=21849
              13/07/01 17:41:16 INFO mapred.JobClient:     HDFS_BYTES_WRITTEN=15176
              13/07/01 17:41:16 INFO mapred.JobClient:   File Input Format Counters
              13/07/01 17:41:16 INFO mapred.JobClient:     Bytes Read=15220
              13/07/01 17:41:16 INFO mapred.JobClient:   Map-Reduce Framework
              13/07/01 17:41:16 INFO mapred.JobClient:     Map input records=1897
              13/07/01 17:41:16 INFO mapred.JobClient:     Physical memory (bytes) snapshot=100438016
              13/07/01 17:41:16 INFO mapred.JobClient:     Spilled Records=0
              13/07/01 17:41:16 INFO mapred.JobClient:     CPU time spent (ms)=3770
              13/07/01 17:41:16 INFO mapred.JobClient:     Total committed heap usage (bytes)=189202432
              13/07/01 17:41:16 INFO mapred.JobClient:     Virtual memory (bytes) snapshot=3543986176
              13/07/01 17:41:16 INFO mapred.JobClient:     Map output records=1897
              13/07/01 17:41:16 INFO mapred.JobClient:     SPLIT_RAW_BYTES=164

              说明hadoop-lzo安装成功,创建的索引文件列表:

              -rw-r-r-   3 hadoop caodx  163517168 2013-07-01 10:55 /user/hive/warehouse/labrador.db/pvlog/ptdate=2013-06-04/pv.log.lzo
              -rw-r-r-   3 hadoop caodx      15176 2013-07-01 17:41 /user/hive/warehouse/labrador.db/pvlog/ptdate=2013-06-04/pv.log.lzo.index

        3,mapreduce测试:

            WordCount核心代码片段:

            TextOutputFormat.setCompressOutput(job, true);
            TextOutputFormat.setOutputCompressorClass(job, LzopCodec.class);

            运行WordCount:

            hadoop fs -put soft/hadoop/README.txt /user/hadoop

            hadoop jar lzotest.jar org.apache.hadoop.examples.WordCount /user/hadoop/README.txt /user/hadoop/lzo1

            13/07/01 18:12:40 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same.
            13/07/01 18:12:40 INFO input.FileInputFormat: Total input paths to process : 1
            13/07/01 18:12:40 INFO lzo.GPLNativeCodeLoader: Loaded native gpl library
            13/07/01 18:12:40 INFO lzo.LzoCodec: Successfully loaded & initialized native-lzo library [hadoop-lzo rev 6bb1b7f8b9044d8df9b4d2b6641db7658aab3cf8]
            13/07/01 18:12:40 INFO mapred.JobClient: Running job: job_201307011738_0004
            13/07/01 18:12:41 INFO mapred.JobClient:  map 0% reduce 0%
            13/07/01 18:12:55 INFO mapred.JobClient:  map 100% reduce 0%
            13/07/01 18:13:07 INFO mapred.JobClient:  map 100% reduce 100%
            13/07/01 18:13:12 INFO mapred.JobClient: Job complete: job_201307011738_0004

            查看输出文件:

            hadoop fs -ls /user/hadoop/lzo1

            -rw-r-r-   3 hadoop supergroup       1037 2013-07-01 18:13 /user/hadoop/lzo1/part-r-00000.lzo

            可以看出输出结果被压缩了.

     

        4,hive测试

           设置map中间结果压缩:

           hive (labrador)> set mapred.compress.map.output=true;

           hive (labrador)> set mapred.map.output.compression.codec=com.hadoop.compression.lzo.LzopCodec;

           hive (labrador)>select count(*) from pvlog where ptdate='2013-06-04';

        5.性能测试(针对9.1G的日志做测试)

          例子程序是对pv日志统计pv,uv,ip

          hadoop@kooxoo1-155:~$ ll -h

          -rw-r-r- 1 hadoop hadoop 9.1G Jul  2 14:55 pvlog2013-06-04.txt

          a.不设置压缩:

          hive (labrador)> select count(*) pv, count(distinct visitsid) uv, count(distinct ip) ip from pvlog where ptdate='2013-06-04';

        查看http://hadoop154.ikuxun.cn/jobconf.jsp?jobid=job_201307021641_0001页面: 

       

       
          运行时显示map和reduce的个数:

          Hadoop job information for Stage-1: number of mappers: 37; number of reducers:

          运行结果:

              pv               uv          ip
          14569944    946643    685518
          Time taken: 204.92 seconds

          b.设置map中间数据压缩:

          重建表结构(必须在建表的时候指定表的输入和输出格式,而不能在执行hsql之前指定,不然会报错):

         
          hive (labrador)>drop table pvlog; (这一步删除外部表结构的时候是不会删除表的数据的)

          hive (labrador)> CREATE EXTERNAL TABLE pvlog(ip string, current_date string, current_time string, entry_time string,

          hive (labrador)>visitor_id string, url string, first_refer string, last_refer string, fromid string, ifid string, external_source string,  internal_source       string, pagetype string,

          hive (labrador)>global_landing string, channel_landing string, visits_count string,     pv_count string, kuxun_id string, utm_source string, utm_medium string, utm_term string,

          hive (labrador)>utm_id string, utm_campaign string, pool string, reserve_a string, reserve_b string, reserve_c string, reserve_d string, city string, pvid string,

          hive (labrador)>lastpvid string, visitsid string, maxpvcount string, channelpv string, channelleads string)

          hive (labrador)>PARTITIONED BY (ptdate string)
          hive (labrador)>ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
          hive (labrador)>LINES TERMINATED BY '\n'
          hive (labrador)>STORED AS INPUTFORMAT "com.hadoop.mapred.DeprecatedLzoTextInputFormat"
          hive (labrador)>OUTPUTFORMAT "org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat"
          hive (labrador)>LOCATION '/user/hive/warehouse/labrador.db/pvlog/';

          hive (labrador)>ALTER TABLE pvlog ADD PARTITION (ptdate='2013-06-04');

          hive (labrador)> set mapred.compress.map.output=true;

          hive (labrador)> set mapred.map.output.compression.codec=com.hadoop.compression.lzo.LzopCodec;

          hive (labrador)> set hive.exec.compress.intermediate=true;

          hive (labrador)> set io.compression.codecs=com.hadoop.compression.lzo.LzopCodec

          hive (labrador)> select count(*) pv, count(distinct visitsid) uv, count(distinct ip) ip from pvlog where ptdate='2013-06-04';

          运行时显示map和reduce的个数:

          Hadoop job information for Stage-1: number of mappers: 37; number of reducers:

          查看http://hadoop154.ikuxun.cn/jobconf.jsp?jobid=job_201307021641_0001页面:

          

          

           运行结果:
              pv               uv          ip
          14569944    946643    685518
          Time taken: 184.92 seconds

          执行效率提高了20多秒,提升不是很明显,可能与测试环境的机器有关,测试环境4个节点,都是虚拟机,总的map槽位6个,总的reduce槽位6个. 

          c.测试索引自动创建:

            删除pvlog表并创建表

            load数据hive (labrador)> LOAD DATA local INPATH '/home/hadoop/pvlog2013-06-04.txt' INTO TABLE pvlog PARTITION(ptdate='2013-06-04');

            查看load的数据

            hadoop@kooxoo1-155:~$ hadoop fs -ls /user/hive/warehouse/labrador.db/pvlog/ptdate=2013-06-04

            -rw-r-r-   3 hadoop supergroup 9674697618 2013-07-04 10:12 /user/hive/warehouse/labrador.db/pvlog/ptdate=2013-06-04/pvlog2013-06-04.txt

            由此可以确定在load数据的时候是不会自动压缩的和创建索引的,所以要想数据被压缩和创建索引,必须手动压缩并创建索引.

            手动压缩和创建索引的脚本:

           #!/bin/bash
           hadoop fs -copyToLocal /user/hive/warehouse/labrador.db/pvlog/ptdate=2013-06-04/ /home/caodx/workspace/hadoopscript/lzo-test/
           cd /home/caodx/workspace/hadoopscript/lzo-test/ptdate=2013-06-04

           #创建lzo压缩文件
           /usr/local/bin/lzop pvlog2013-06-04.txt
           hadoop fs -moveFromLocal /home/caodx/workspace/hadoopscript/lzo-test/pvlog2013-06-04.txt.lzo /home/caodx/lzo-test
           hadoop fs -rmr /user/hive/warehouse/labrador.db/pvlog/ptdate=2013-06-04/pvlog2013-06-04.txt
           cd /home/caodx/workspace/hadoopscript/lzo-test/

           #创建压缩文件的索引文件
           hadoop jar hadoop-lzo-0.4.15.jar com.hadoop.compression.lzo.DistributedLzoIndexer /user/hive/warehouse/labrador.db/pvlog/ptdate=2013-06-04/
           rm -rf ptdate=2013-06-04

    展开全文
  • hadoop lzo并行map

    千次阅读 2013-11-30 18:31:18
    Hadoop集群中启用了lzo后,还需要一些配置,才能使集群能够对单个的lzo文件进行并行的map操作,以提升job的执行速度。 首先,要为lzo文件创建index。下面的命令对某个目录里的lzo文件创建index: $HADOOP_HOME...

    Hadoop集群中启用了lzo后,还需要一些配置,才能使集群能够对单个的lzo文件进行并行的map操作,以提升job的执行速度。

    首先,要为lzo文件创建index。下面的命令对某个目录里的lzo文件创建index:

    $HADOOP_HOME/bin/hadoop jar $HADOOP_HOME/lib/hadoop-lzo-0.4.10.jar com.hadoop.compression.lzo.LzoIndexer /log/source/cd/

    使用该命令创建index要花些时间的,我一个7.5GB大小的文件,创建index,花了2分30秒的样子。其实创建index时还有另外一个参数,即com.hadoop.compression.lzo.DistributedLzoIndexer。两个选项可以参考:https://github.com/kevinweil/hadoop-lzo,该文章对这两个选项的解释,我不是很明白,但使用后一个参数可以减少创建index时所花费的时间,而对mapreduce任务的执行没有影响。如下:

    $HADOOP_HOME/bin/hadoop jar $HADOOP_HOME/lib/hadoop-lzo-0.4.10.jar com.hadoop.compression.lzo.DistributedLzoIndexer /log/source/cd/    

    然后,在Hive中创建表时,要指定INPUTFORMAT和OUTPUTFORMAT,否则集群仍然不能对lzo进行并行的map处理。在hive中创建表时加入下列语句:

    SET FILEFORMAT      
    INPUTFORMAT "com.hadoop.mapred.DeprecatedLzoTextInputFormat"   
    OUTPUTFORMAT "org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat";  

    执行了这两步操作后,对hive执行速度的提升还是很明显的。在测试中,我们使用一个7.5GB大小的lzo文件,执行稍微复杂一点的Hive命令,使用上述配置后仅需34秒的时间,而原来要180秒。

    README.md 
    Hadoop-LZO
    Hadoop-LZO is a project to bring splittable LZO compression to Hadoop. LZO is an ideal compression format for Hadoop due to its combination of speed and compression size. However, LZO files are not natively splittable, meaning the parallelism that is the core of Hadoop is gone. This project re-enables that parallelism with LZO compressed files, and also comes with standard utilities (input/output streams, etc) for working with LZO files.

    Origins
    This project builds off the great work done at http://code.google.com/p/hadoop-gpl-compression. As of issue 41, the differences in this codebase are the following.

    it fixes a few bugs in hadoop-gpl-compression -- notably, it allows the decompressor to read small or uncompressable lzo files, and also fixes the compressor to follow the lzo standard when compressing small or uncompressible chunks. it also fixes a number of inconsistenly caught and thrown exception cases that can occur when the lzo writer gets killed mid-stream, plus some other smaller issues (see commit log).
    it adds the ability to work with Hadoop streaming via the com.apache.hadoop.mapred.DeprecatedLzoTextInputFormat class
    it adds an easier way to index lzo files (com.hadoop.compression.lzo.LzoIndexer)
    it adds an even easier way to index lzo files, in a distributed manner (com.hadoop.compression.lzo.DistributedLzoIndexer)
    Hadoop and LZO, Together at Last
    LZO is a wonderful compression scheme to use with Hadoop because it's incredibly fast, and (with a bit of work) it's splittable. Gzip is decently fast, but cannot take advantage of Hadoop's natural map splits because it's impossible to start decompressing a gzip stream starting at a random offset in the file. LZO's block format makes it possible to start decompressing at certain specific offsets of the file -- those that start new LZO block boundaries. In addition to providing LZO decompression support, these classes provide an in-process indexer (com.hadoop.compression.lzo.LzoIndexer) and a map-reduce style indexer which will read a set of LZO files and output the offsets of LZO block boundaries that occur near the natural Hadoop block boundaries. This enables a large LZO file to be split into multiple mappers and processed in parallel. Because it is compressed, less data is read off disk, minimizing the number of IOPS required. And LZO decompression is so fast that the CPU stays ahead of the disk read, so there is no performance impact from having to decompress data as it's read off disk.

    Building and Configuring
    To get started, see http://code.google.com/p/hadoop-gpl-compression/wiki/FAQ. This project is built exactly the same way; please follow the answer to "How do I configure Hadoop to use these classes?" on that page.

    You can read more about Hadoop, LZO, and how we're using it at Twitter at http://www.cloudera.com/blog/2009/11/17/hadoop-at-twitter-part-1-splittable-lzo-compression/.

    Once the libs are built and installed, you may want to add them to the class paths and library paths. That is, in hadoop-env.sh, set

        export HADOOP_CLASSPATH=/path/to/your/hadoop-lzo-lib.jar
        export JAVA_LIBRARY_PATH=/path/to/hadoop-lzo-native-libs:/path/to/standard-hadoop-native-libs
    Note that there seems to be a bug in /path/to/hadoop/bin/hadoop; comment out the line

        JAVA_LIBRARY_PATH=''
    because it keeps Hadoop from keeping the alteration you made to JAVA_LIBRARY_PATH above. (Update: seehttps://issues.apache.org/jira/browse/HADOOP-6453). Make sure you restart your jobtrackers and tasktrackers after uploading and changing configs so that they take effect.

    Using Hadoop and LZO
    Reading and Writing LZO Data
    The project provides LzoInputStream and LzoOutputStream wrapping regular streams, to allow you to easily read and write compressed LZO data.

    Indexing LZO Files
    At this point, you should also be able to use the indexer to index lzo files in Hadoop (recall: this makes them splittable, so that they can be analyzed in parallel in a mapreduce job). Imagine that big_file.lzo is a 1 GB LZO file. You have two options:

    index it in-process via:

    hadoop jar /path/to/your/hadoop-lzo.jar com.hadoop.compression.lzo.LzoIndexer big_file.lzo
    index it in a map-reduce job via:

    hadoop jar /path/to/your/hadoop-lzo.jar com.hadoop.compression.lzo.DistributedLzoIndexer big_file.lzo
    Either way, after 10-20 seconds there will be a file named big_file.lzo.index. The newly-created index file tells the LzoTextInputFormat's getSplits function how to break the LZO file into splits that can be decompressed and processed in parallel. Alternatively, if you specify a directory instead of a filename, both indexers will recursively walk the directory structure looking for .lzo files, indexing any that do not already have corresponding .lzo.index files.

    Running MR Jobs over Indexed Files
    Now run any job, say wordcount, over the new file. In Java-based M/R jobs, just replace any uses of TextInputFormat by LzoTextInputFormat. In streaming jobs, add "-inputformat com.hadoop.mapred.DeprecatedLzoTextInputFormat" (streaming still uses the old APIs, and needs a class that inherits from org.apache.hadoop.mapred.InputFormat). For Pig jobs, email me or check the pig list -- I have custom LZO loader classes that work but are not (yet) contributed back.

    Note that if you forget to index an .lzo file, the job will work but will process the entire file in a single split, which will be less efficient.

    展开全文
  • hadoop lzo安装

    千次阅读 2012-08-09 20:07:48
    最近我们部门在测试云计算平台hadoop,我被lzo折腾了三四天,累了个够呛。在此总结一下,也给大家做个参考。  操作系统:CentOS 5.5,Hadoop版本:hadoop-0.20.2-CDH3B4  安装lzo所需要软件包:gcc、ant、...
  • Hadoop lzo详细安装手册

    千次阅读 2013-02-21 16:35:08
    方式-1:yum install lzop 方式-2:手动安装,安装步骤如下:...gccantlzo-2.05.tar.gztoddlipcon-hadoop-lzo-2bd0d5b.tar.gzivy-2.0.0-rc2.jar 2. 安装lzo   tar -zxvf lzo-2.05.tar.
  • Hadoop LZO 安装教程

    千次阅读 2013-01-22 16:54:13
    1.安装 hadoop-gpl-compression 1.1 wget http://hadoop-gpl-compression.apache-extras.org.codespot.com/files/hadoop-gpl-compression-0.1.0-rc0.tar.gz 1.2 mv hadoop-gpl-compression-0.1.0/lib/native/Linux...
  • 这里至少需要安装3个软件包:lzo, lzop, hadoop-gpl-packaging。gpl-packaging的作用主要是对压缩的lzo文件创建索引,否则的话,无论压缩文件是否大于hdfs的block大小,都只会按照默认启动2个map操作。 一、安装lzop...
  • hadoop支持lzo压缩测试

    2019-10-10 09:08:54
    jdk1.8/maven3.6.2/hadoop伪分布式 安装需要的库 yum -y install lzo-devel zlib-devel gcc autoconf automake libtool 2.安装 lzo 2.1.下载并解压lzo # 下载lzo压缩包 [hadoop@hadoop000 software]$ wget ...
  • hadoop lzo文件的并行map处理

    千次阅读 2012-08-14 22:16:08
    Hadoop集群中启用了lzo后,还需要一些配置,才能使集群能够对单个的lzo文件进行并行的map操作,以提升job的执行速度。  首先,要为lzo文件创建index。下面的命令对某个目录里的lzo文件创建index: $HADOOP_HOME...
  • Hadoop LZO的安装与配置

    千次阅读 2013-10-17 21:33:02
    Hadoop支持好几种压缩算法,包括: Bzip2 Gzip DEFLATE Hadoop提供这些算法的Java实现,所以可以很方便的通过FileSystem API来进行文件的压缩和解压缩。这些压缩算法都有一个缺陷,那就是文件不能...
  • LZO索引说明 准备测试数据 因为默认的block是128M大,上传了一份小数据只有19M需要扩大,通过shell脚本解决。 1.编写shell脚本 [hadoop@192 data]$ touch create_data.sh [hadoop@192 data]$ vi create...
  • 下边是我的安装和测试记录。 1.gcc的安装:yum install lib* glibc* gcc* 如果lzo编译出错时可能需要安装,(namenode和datanode都需要) 执行yum install lib* glibc* gcc*,会自动判断需要update的 2.安装ant ...
  • Hadoop配置lzo压缩

    2020-06-11 19:57:12
    hadoop-lzo是一个围绕lzo压缩算法实现的Maven项目,基于hadoop提供的API实现了lzo压缩算法的编解码器,以及其他的一些自定义hadoop组件,本文主要演示如何编译此Git项目,并配置到hadoop集群中,实现lzo算法在集群中...
  • Hadoop安装LZO

    2019-02-11 14:11:15
    目录 环境 安装&amp;amp;amp;amp;amp;amp;amp;amp;...配置Hadoop支持LZO ...Hadoop-2.6.0-cdh5.7.0 LZO2.1.0包 hadoop-lzo ...1.安装依赖并下载lzo包 [hadoop@192 sbin]$ sudo yum -y install lzo-dev

空空如也

空空如也

1 2 3 4 5 ... 20
收藏数 2,857
精华内容 1,142
关键字:

hadooplzo测试