大数据产品datastage_datastage从大数据平台抽取数据 - CSDN
  • 如何将传统数据和大数据进行高效的集成、管理和分析呢?如何保证数据的准确性,一致性和可靠性呢?带着众多疑问,我们来看看IBM所提供的DataStage大数据集成方案,一切必将豁然开朗。
    大数据处理

    一、大数据已成为企业信息供应链中的重要一环

    我们对大数据的认知在前几年还仅仅停留在概念和理论中,但转眼间,你会发现身边的大数据项目如雨后春笋般拔地而起,大数据俨然成为当今热得不能再热的话题和焦点。因为Hadoop及其相关开源技术的横空出世和迅猛发展,越来越多的企业发现那些尘封已久的历史数据或每天正在以指数级产生的交易数据、日志数据和客户行为数据其实蕴藏着巨大的价值,犹如一座座尚未开发的金矿,谁能抢占先机,就能挖掘并实现巨大的商业价值。互联网企业深谙此道,利用大数据分析结果进行产品推广和定向营销,大大改善了消费者的购物体验和消费习惯,在收获口碑的同时也赚得盆满钵满!与此同时,传统企业也在积极转型,纷纷将Hadoop大数据平台纳入到现有的IT架构和解决方案,那么如何将传统数据和大数据进行高效的集成、管理和分析呢?如何保证数据的准确性,一致性和可靠性呢?带着众多疑问,我们来看看IBM所提供的DataStage大数据集成方案,一切必将豁然开朗。

    大数据处理

    二、大数据集成所面临的挑战

    1.新型的数据存储

    • 大数据引入了新型的数据存储,例如,Hadoop及NoSQL,这些新型的数据存储都需要集成。
    • 没有好的传统方法能够有效集成这些新型数据存储。

    2.新的数据类型及格式

    • 非结构化数据;半结构化数据;JSON, Avro ...
    • 视频、文档、网络日志 ...
    • 如何有效处理复杂且多样化的数据

    3.更大的数据量

    • 需要针对更大的数据量进行数据移动,转换,清洗等等。
    • 需要更好的可扩展性
    大数据处理

    三、大数据信息整合是Hadoop项目成败的关键

    大部分的Hadoop方案包括以下阶段:

    • 数据收集
    • 数据移动
    • 数据转换
    • 数据清洗
    • 数据整合
    • 数据探查
    • 数据分析

    由于面对的是基于海量的,彼此孤立的异构数据源和数据类型,所以大部分企业的Hadoop项目将花费80%的精力在数据整合上,而仅有20%的精力用于数据分析。可见,数据集成对Hadoop项目的成败有多重要。

    大数据处理

    四、IBM大数据集成解决方案:InfoSphere DataStage

    1. 集中、批量式处理:整合和连接、清洗转换大数据

    • Hadoop大数据作为源和目标,同现有企业信息整合;
    • 与现有整合任务具备同样的开发界面和逻辑架构;
    • 将处理逻辑下压至MapReduce,利用Hadoop平台最小化网络开销;
    • 通过InfoSphere Streams流处理进行实时分析流程;
    • 验证和清洗大数据源的数据质量;
    • 贯穿大数据和/或传统数据流通过世系跟踪和血缘分析;
    大数据处理

    2.面向大数据和传统数据的丰富接口,支持企业所有的数据源和目标

    • 对DBMS(DB2, Netezza, Oracle, Teradata, SQL Server, GreenPlum,…)提供高性能的原生API;
    • 提供特定的ERP连接器;
    • 基于JDBC、ODBC连接器提供灵活支持(MySQL);
    • 支持简单和复杂的文件格式 (Flat, Cobol, XML, native Excel);
    • 支持扩展数据源:Web Services, Cloud, Java
    • 连接Hadoop文件系统(HDFS),提供可扩展的并行读写
    • 直连InfoSphere Streams,支持实时分析处理
    • 提供对NoSQL数据源(Hive,HBase,MongoDB,Cassandra)的支持
    大数据处理

    3.最广泛的异构平台支持

    大数据处理

    4.IBM大数据集成方案带给客户的惊喜

    大数据处理

    五、DataStage连通Hadoop的最佳实践

    在DataStage中,可通过File Connector组件或Big Data File组件来连接Hadoop平台,从而将传统RDBMS数据库或本地文件中的数据加载到HDFS。比较而言,Big Data File组件支持IBM BigInsights,提供更佳的读写性能;而File Connector组件则通过WebHDFS接口或HttpFS接口访问HDFS,不依赖于Hadoop的品牌和版本,提供更广泛的兼容性。

    大数据处理

    FileConnector是DataStage v11.3面向Hadoop的全新组件,提供以下功能:

    • 可用于读/写Hadoop文件系统(HDFS)
    • 支持并行处理和线性扩展
    • 不需要安装其他Hadoop客户端软件包
    • 支持Kerberos认证
    • 支持SSL安全访问协议
    • 支持Knox gateway
    • 支持通过WebHDFS,HttpFS方式访问Hadoop
    • 支持访问本地的Hadoop节点
    • 更全面的支持Hadoop(不依赖于其版本变更)

    下面以Apache Hadoop v2.7为例,介绍通过配置File Connector将Oracle表数据写入HDFS的方法:

    1.安装DataStage v11.3.1(参考以下链接)

    http://www-01.ibm.com/support/knowledgecenter/SSZJPZ_11.3.0/com.ibm.swg.im.iis.install.nav.doc/containers/cont_iis_information_server_installation.html?lang=en

    2.配置Kerberos安全认证

    将Apache Hadoop服务器上的krb5.conf文件(KDC配置信息)复制到DataStage服务器上的/etc目录。

    3.检查Apache Hadoop的HDFS配置文件,确认已启用WebHDFS支持

    大数据处理

    如何配置WebHDFS Rest API for Apache Hadoop v2.7:

    http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/WebHDFS.html

    4.配置SSL访问Hadoop

    • 登陆DataStage服务器,使用keytool命令创建truststore,用于存放来自于Hadoop服务器的SSL安全证书,该truststore名为test.jks, 在/opt目录下

    keytool -genkey -alias test -keystore test.jks -storepass test

    • 将Hadoop服务器上的SSL证书(例如cert.pem)复制到DataStage服务器
    • 在DataStage服务器上通过keytool命令导入证书cert.pem

    keytool -import -trustcacerts -alias test -file cert.pem -keystore test.jks -storepass test -noprompt

    • 用DataStage自带的encrypt.sh命令加密上面所创建truststore的password,得到加密后的二进制密码(例如{iisenc} iWuRnROgFLbk0H1sjfIc7Q==)

    cd /opt/IBM/InformationServer/ASBNode/bin/

    [root@IBM-DataStage bin]# ./encrypt.sh

    Enter the text to encrypt: test

    Enter the text again to confirm: test

    {iisenc} iWuRnROgFLbk0H1sjfIc7Q==

    • 在/opt目录下创建一个名为properties.txt的文本文件,添加内容如下

    password={iisenc}iWuRnROgFLbk0H1sjfIc7Q==

    • 修改DataStage配置文件(dsenv),添加以下环境变量

    DS_TRUSTSTORE_LOCATION=/opt/test.jks

    DS_TRUSTSTORE_PROPERTIES=/opt/properties.txt

    • 重启DataStage

    5.在DataStage开发客户端中找到File Connector组件

    大数据处理

    6.配置File Connector组件的属性

    • 通过WebHDFS接口访问Apache Hadoop
    • 采用Kerberos安全认证(指定Keytab文件)
    • 采用https协议及相应端口
    • 将源表数据自动拆分成多个文件并行写入HDFS(为提高性能,利用8个节点同时写数据)
    大数据处理

    7.运行DataStage作业,可看到数据已成功写入Hadoop HDFS

    虽然本次测试是基于虚拟机环境,但DataStage所展现出来的性能依然非常强劲,从Oracle读取4.64亿条记录并写入HDFS,仅需10分钟左右,最高速率达到619495 行/秒。如果增加CPU以提高并行度,性能更可线性增长!

    大数据处理

    在目标端生成的HDFS文件列表(8个子文件):

    大数据处理

    更多大数据与分析相关行业资讯、解决方案、案例、教程等请点击查看>>>

    展开全文
  • InfoSphere DataStage 是 IBM 统一数据集成平台InfoSphere Information Server的重要组件,是业界主流的ETL(Extract, Transform, Load)软件。

    InfoSphere DataStage 是 IBM 统一数据集成平台InfoSphere Information Server的重要组件,是业界主流的ETL(Extract, Transform, Load)软件。

    在之前的文章中,小编曾经介绍过DataStage不仅支持各种异构平台的数据库,提供多种功能强大的数据库连接器;而且也支持对非结构化数据的访问,例如TXT、CSV、XML、COBOL和Excel文件等。

    而对于常见的数据库存储过程,DataStage同样也提供强大的支持,接下来将以Oralce的存储过程为例演示在DataStage如何进行调用。

    前期准备工作

    1. 创建存储过程p2,p2定义了一个游标查询表A的数据,然后将数据返回给输出参数。

    datastage

    2. 表A的结构如下所示

    datastage

    3. 表A的数据如下所示

    datastage

    4. 在DataStage中导入存储过程P2的定义

    datastage

    5. 使用设置好的Oracle数据源(ODBC DSN): oraodbc

    datastage

    6. 选择存储过程P2,开始导入

    datastage

    将存储过程作为数据源生成数据

    1. 创建DataStage作业,使用Stored Procedure Stage作为数据源,将结果输出 到Peek组件(Peek组件一般用于开发调试,可将运行结果直接打印到屏幕中)。

    datastage

    2. 设置存储过程的连接信息

    datastage

    3. 设置存储过程属性,选择存储过程类型为Source(表示作为源),在Syntax属性栏手工设置输入参数pa的值为1,输出参数pb用:1作为占位符,表示将存储过程的结果输出到后续的DataStage组件(Peek)。

    datastage

    4. 在Output属性栏中设置要输出的列定义,aname是自己设置的,ProCode和ProMess列由系统自动生成。

    datastage

    5. 运行作业,日志显示存储过程P2被DataStage调用,并输出正确结果。

    datastage

    存储过程作为中间过程处理数据

    1. 创建DataStage作业,使用Stored Procedure Stage作为中间过程,读取Oracle表数据后,经过中间转换和处理,将最终结果输出到Peek组件。

    datastage

    2. 查看源端Oracle表数据

    datastage

    3. 设置存储过程的连接信息

    datastage

    4. 设置存储过程属性,选择存储过程类型为Transform(表示作为中间阶段),在Syntax属性栏中手工设置输入参数pa用:1作为占位符,表示读取前面Oracle数据库的表数据;输出参数pb用:2作为占位符,表示将存储过程的处理结果输出到后续的DataStage组件(Peek)。

    datastage

    5. 查看Input选项

    datastage

    6.查看Output选项

    datastage

    7. 将PA映射到auuid字段,参数类型设置为Input;将PB映射到aname字段,参数设置为Output。

    datastage

    8. 运行作业,日志显示存储过程P2被DataStage调用,并输出正确结果。

    datastage


    展开全文
  • Partitioning The aim of most partitioning operations is to end up with a set of partitionsthat are as near equal size as possible, ensuring an even load across yourprocessors. When perform

    Partitioning
           The aim of most partitioning operations is to end up with a set of partitions
    that are as near equal size as possible, ensuring an even load across your
    processors.
           When performing some operations however, you will need to take control
    of partitioning to ensure that you get consistent results. A good example
    of this would be where you are using an aggregator stage to summarize
    your data. To get the answers you want (and need) you must ensure that
    related data is grouped together in the same partition before the summary
    operation is performed on that partition. DataStage lets you do this.
            There are a number of different partitioning methods available, note that
    all these descriptions assume you are starting with sequential data. If you
    are repartitioning already partitioned data then there are some specific
    considerations (see “Repartitioning” on page 2-25):

            分区的目的是把海量的数据用接近平均分配的方法存储到一组数据分区里,并保证每个分区的数据都能加载到处理器运行。
            要对数据实施分区,需要设定相应的数据分区规则的,从而保证分区后的数据的整体性。

    Round robin
            The first record goes to the first processing node, the second to the second
    processing node, and so on. When DataStage reaches the last processing
    node in the system, it starts over. This method is useful for resizing partitions
    of an input data set that are not equal in size. The round robin
    method always creates approximately equal-sized partitions. This method
    is the one normally used when DataStage initially partitions data.

    Round分区:
            将数据按一定的方式进行排序,并依照这种排序方式组条记录提取,组个放到分区里。比如有10000条数据,把这10000条数据分成3个分区,如果选择ROUND分区方法,DS就会把第一条数据放到第一个分区了,第二条记录放到第二个分区,第三条记录放到第三个分区,然后循环会来,第四条记录放到第一个分区,第五条记录放到第二个分区,依此类推。

    Random
            Records are randomly distributed across all processing nodes. Like round
    robin, random partitioning can rebalance the partitions of an input data set
    to guarantee that each processing node receives an approximately equalsized
    partition. The random partitioning has a slightly higher overhead than round robin
    because of the extra processing required to calculate a random value for each record.

    随机分区:
            数据随机分配到个分区里。

    Same
            The operator using the data set as input performs no repartitioning and
    takes as input the partitions output by the preceding stage. With this partitioning
    method, records stay on the same processing node; that is, they are not redistributed.
    Same is the fastest partitioning method. This is normally the method DataStage uses
    when passing data between stages in your job.

    不懂翻译。

    Entire
      Every instance of a stage on every processing node receives the complete
    data set as input. It is useful when you want the benefits of parallel execution,
    but every instance of the operator needs access to the entire input data set.
    You are most likely to use this partitioning method with stagesthat create lookup
    tables from their input.

    Hash by field
      Partitioning is based on a function of one or more columns (the hash partitioning
    keys) in each record. The hash partitioner examines one or more
    fields of each input record (the hash key fields). Records with the same
    values for all hash key fields are assigned to the same processing node.
    This method is useful for ensuring that related records are in the same
    partition, which may be a prerequisite for a processing operation. For
    example, for a remove duplicates operation, you can hash partition records
    so that records with the same partitioning key values are on the
    same node. You can then sort the records on each node using the hash key
    fields as sorting key fields, then remove duplicates, again using the same
    keys. Although the data is distributed across partitions, the hash partitioner
    ensures that records with identical keys are in the same partition,
    allowing duplicates to be found.
            Hash partitioning does not necessarily result in an even distribution of
    data between partitions. For example, if you hash partition a data set
    based on a zip code field, where a large percentage of your records are
    from one or two zip codes, you can end up with a few partitions
    containing most of your records. This behavior can lead to bottlenecks
    because some nodes are required to process more records than other
    nodes.
            For example, the diagram shows the possible results of hash partitioning
    a data set using the field age as the partitioning key. Each record with a
    given age is assigned to the same partition, so for example records with age
    36, 40, or 22 are assigned to partition 0. The height of each bar represents
    the number of records in the partition.
            As you can see, the key values are randomly distributed among the
    different partitions. The partition sizes resulting from a hash partitioner
    are dependent on the distribution of records in the data set so even though
    there are three keys per partition, the number of records per partition
    varies widely, because the distribution of ages in the population is nonuniform.
            When hash partitioning, you should select hashing keys that create a large
    number of partitions. For example, hashing by the first two digits of a zip
    code produces a maximum of 100 partitions. This is not a large number for
    a parallel processing system. Instead, you could hash by five digits of the
    zip code to create up to 10,000 partitions. You also could combine a zip
    code hash with an age hash (assuming a maximum age of 190), to yield
    1,500,000 possible partitions.
            Fields that can only assume two values, such as yes/no, true/false,
    male/female, are particularly poor choices as hash keys.
            You must define a single primary collecting key for the sort merge
    collector, and you may define as many secondary keys as are required by
    your job. Note, however, that each record field can be used only once as a
    collecting key. Therefore, the total number of primary and secondary
    collecting keys must be less than or equal to the total number of fields in
    the record. You specify which columns are to act as hash keys on the Partitioning
    tab of the stage editor, see “Partitioning Tab” on page 3-21. An
    example is shown below. The data type of a partitioning key may be any
    data type except raw, subrecord, tagged aggregate, or vector (see
    page 2-32 for data types). By default, the hash partitioner does case-sensitive
    comparison. This means that uppercase strings appear before
    lowercase strings in a partitioned data set. You can override this default if
    you want to perform case-insensitive partitioning on string fields.

    HASH分区
      太长了,看英文比翻译来得方便。

    Modulus
      Partitioning is based on a key column modulo the number of partitions.
    This method is similar to hash by field, but involves simpler computation.
    In data mining, data is often arranged in buckets, that is, each record has a
    tag containing its bucket number. You can use the modulus partitioner to
    partition the records according to this number. The modulus partitioner
    assigns each record of an input data set to a partition of its output data set
    as determined by a specified key field in the input data set. This field can
    be the tag field.
      The partition number of each record is calculated as follows:
    partition_number = fieldname mod number_of_partitions
    where:
      • fieldname is a numeric field of the input data set.
      • number_of_partitions is the number of processing nodes on which
    the partitioner executes. If a partitioner is executed on three
    processing nodes it has three partitions.
    In this example, the modulus partitioner partitions a data set containing
    ten records. Four processing nodes run the partitioner, and the modulus
    partitioner divides the data among four partitions. The input data is as
    follows:
    The bucket is specified as the key field, on which the modulus operation
    is calculated.
    Here is the input data set. Each line represents a row:
    64123  1960-03-30
    61821  1960-06-27
    44919  1961-06-18
    22677  1960-09-24
    90746  1961-09-15
    21870  1960-01-01
    87702  1960-12-22
    4705   1961-12-13
    47330  1961-03-21
    88193  1962-03-12
      The following table shows the output data set divided among four partitions
    by the modulus partitioner.
    Partition 0:
    Partition 1:61821 1960-06-27,22677 1960-09-24,47051961-12-13,88193 1962-03-12
    Partition 2:21870 1960-01-01,87702 1960-12-22,47330 1961-03-21,90746 1961-09-15
    Partition 3:64123 1960-03-30,44919 1961-06-18
      Here are three sample modulus operations corresponding to the values of
    three of the key fields:
      • 22677 mod 4 = 1; the data is written to Partition 1.
      • 47330 mod 4 = 2; the data is written to Partition 2.
      • 64123 mod 4 = 3; the data is written to Partition 3.
      None of the key fields can be divided evenly by 4, so no data is written to
    Partition 0.
      You define the key on the Partitioning tab (see “Partitioning Tab” on
    page 3-21)

    取模分区

    Range
      Divides a data set into approximately equal-sized partitions, each of which
    contains records with key columns within a specified range. This method
    is also useful for ensuring that related records are in the same partition.
      A range partitioner divides a data set into approximately equal size partitions
    based on one or more partitioning keys. Range partitioning is often
    a preprocessing step to performing a total sort on a data set.
      In order to use a range partitioner, you have to make a range map. You can
    do this using the Write Range Map stage, which is described in Chapter 55.
      The range partitioner guarantees that all records with the same partitioning
    key values are assigned to the same partition and that the
    partitions are approximately equal in size so all nodes perform an equal
    amount of work when processing the data set.
      An example of the results of a range partition is shown below. The partitioning
    is based on the age key, and the age range for each partition is indicated by
    the numbers in each bar. The height of the bar shows the size of the partition.
      All partitions are of approximately the same size. In an ideal distribution,
    every partition would be exactly the same size. However, you typically
    observe small differences in partition size.
      In order to size the partitions, the range partitioner uses a range map to
    calculate partition boundaries. As shown above, the distribution of partitioning
    keys is often not even; that is, some partitions contain many
    partitioning keys, and others contain relatively few. However, based on
    the calculated partition boundaries, the number of records in each partition
    is approximately the same.
      Range partitioning is not the only partitioning method that guarantees
    equivalent-sized partitions. The random and round robin partitioning
    methods also guarantee that the partitions of a data set are equivalent in
    size. However, these partitioning methods are keyless; that is, they do not
    allow you to control how records of a data set are grouped together within
    a partition.
      In order to perform range partitioning your job requires a write range map
    stage to calculate the range partition boundaries in addition to the stage
    that actually uses the range partitioner. The write range map stage uses a
    probabilistic splitting technique to range partition a data set. This technique
    is described in Parallel Sorting on a Shared-Nothing Architecture Using
    Probabilistic Splitting by DeWitt, Naughton, and Schneider in Query
    Processing in Parallel Relational Database Systems by Lu, Ooi, and Tan, IEEE
    Computer Society Press, 1994. In order for the stage to determine the partition
    boundaries, you pass it a sorted sample of the data set to be range
    partitioned. From this sample, the stage can determine the appropriate
    partition boundaries for the entire data set. See Chapter 55, “Write
    Range Map Stage,” for details.
      When you come to actually partition your data, you specify the range map
    to be used by clicking on the property icon, next to the Partition type field,
    the Partitioning/Collection properties dialog box appears and allows you
    to specify a range map (see “Partitioning Tab” on page 3-21 for a description
    of the Partitioning tab).

    DB2
      Partitions an input data set in the same way that DB2 would partition it.
    For example, if you use this method to partition an input data set
    containing update information for an existing DB2 table, records are
    assigned to the processing node containing the corresponding DB2 record.
    Then, during the execution of the parallel operator, both the input record
    and the DB2 table record are local to the processing node. Any reads and
    writes of the DB2 table would entail no network activity.
      See the DB2 Parallel Edition for AIX, Administration Guide and Reference for
    more information on DB2 partitioning.
      To use DB2 partitioning on a stage, select a Partition type of DB2 on the
    Partioning tab, then click the Properties button to the right. In the Partitioning/Collection
    properties dialog box, specify the details of the DB2 table whose partitioning
    you want to replicate (see “Partitioning Tab” on page 3-21 for a description
    of the Partitioning tab).

    Auto
      The most common method you will see on the DataStage stages is Auto.
      This just means that you are leaving it to DataStage to determine the best
    partitioning method to use depending on the type of stage, and what the
    previous stage in the job has done. Typically DataStage would use round
    robin when initially partitioning data, and same for the intermediate
    stages of a job.



    查看原文点击此处

     
    展开全文
  • ETL常用的三种工具介绍及对比Datastage,Informatica http://www.sohu.com/a/249098751_100194412 ETL是数据仓库中的非常重要的一环,是承前启后的必要的一步。ETL负责将分布的、异构数据源中的数据如关系数据、...

    ETL常用的三种工具介绍及对比Datastage,Informatica 

    http://www.sohu.com/a/249098751_100194412

    ETL是数据仓库中的非常重要的一环,是承前启后的必要的一步。ETL负责将分布的、异构数据源中的数据如关系数据、平面数据文件等抽取到临时中间层后进行清洗、转换、集成,最后加载到数据仓库或数据集市中,成为联机分析处理、数据挖掘的基础。下面给大家介绍一下什么是ETL以及ETL常用的三种工具(Datastage,Informatica,Kettle)!

    1.ETL是什么?

    ETL,是英文 Extract-Transform-Load 的缩写,用来描述将数据从来源端经过抽取(extract)、转换(transform)、加载(load)至目的端的过程。(数据仓库结构)通俗的说法就是从数据源抽取数据出来,进行清洗加工转换,然后加载到定义好的数据仓库模型中去。目的是将企业中的分散、零乱、标准不统一的数据整合到一起,为企业的决策提供分析依据。ETL是BI项目重要的一个环节,其设计的好坏影响生成数据的质量,直接关系到BI项目的成败。

    2.为什么要用ETL工具?

    ▶ 当数据来自不同的物理主机,这时候如使用SQL语句去处理的话,就显得比较吃力且开销也更大。

    ▶ 数据来源可以是各种不同的数据库或者文件,这时候需要先把他们整理成统一的格式后才可以进行数据的处理,这一过程用代码实现显然有些麻烦。

    ▶ 在数据库中我们当然可以使用存储过程去处理数据,但是处理海量数据的时候存储过程显然比较吃力,而且会占用较多数据库的资源,这可能会导致数据资源不足,进而影响数据库的性能。

    上面所说的问题,我们用ETL工具就可以解决。它的优点有:

    ● 支持多种异构数据源的连接。(部分)

    ● 图形化的界面操作十分方便。

    ● 处理海量数据速度快、流程更清晰等。

    3.ETL工具介绍

    Informatica和Datastage占据国内市场的大部分的份额。

    4.ETL工具差异

    Kettle,Datastage,Informatica三个ETL工具的特点和差异介绍:

    操作

    都是属于比较简单易用,主要是开发人员对于工具的熟练程度。Informatica有四个开发管理组件,开发的时候我们需要打开其中三个进行开发,Informatica没有ctrl+z的功能,如果对job作了改变之后,想要撤销,返回到改变前是不可能的。相比Kettle跟Datastage在测试调试的时候不太方便。Datastage全部的操作在同一个界面中,不用切换界面,能够看到数据的来源,整个job的情况,在找bug的时候会比Informatica方便。Kettle介于两者之间。

    部署

    Kettle只需要JVM环境,Informatica需要服务器和客户端安装,而Datastage的部署比较耗费时间,有一点难度。

    数据处理的速度

    大数据量下Informatica 与Datastage的处理速度是比较快的,比较稳定。Kettle的处理速度相比之下稍慢。

    服务

    Informatica与Datastage有很好的商业化的技术支持,而Kettle则没有。商业软件的售后服务上会比免费的开源软件好很多。

    风险

    风险与成本成反比,也与技术能力成正比。

    扩展

    Kettle的扩展性无疑是最好,因为是开源代码,可以自己开发拓展它的功能,而Informatica和Datastage由于是商业软件,基本上没有。

    Job的监控

    三者都有监控和日志工具。在数据的监控上,个人觉得Datastage的实时监控做的更加好,可以直观看到数据抽取的情况,运行到哪一个控件上。这对于调优来说,我们可以更快的定位到处理速度太慢的控件并进行处理,而informatica也有相应的功能,但是并不直观,需要通过两个界面的对比才可以定位到处理速度缓慢的控件。有时候还需要通过一些方法去查找。

    网上的技术文档

    Datastage < Informatica < kettle,相对来说,Datastage跟Informatica在遇到问题去网上找到解决方法的概率比较低,kettle则比较多。

    5.项目经验分享

    多张表同步、重复的操作:在项目中,很多时候我们都需要同步生产库的表到数据仓库中。一百多张表同步、重复的操作,对开发人员来说是细心和耐心的考验。在这种情况下,开发人员最喜欢的工具无疑是kettle,多个表的同步都可以用同一个程序运行,不必每一张表的同步都建一个程序,而informatica虽然有提供工具去批量设计,但还是需要生成多个程序进行一一配置,而datastage在这方面就显得比较笨拙。

    增量表:在做增量表的时候,每次运行后都需要把将最新的一条数据操作时间存到数据库中,下次运行我们就取大于这个时间的数据。Kettle有控件可以直接读取数据库中的这个时间置为变量;对于没有类似功能控件的informatica,我们的做法是先读取的数据库中的这个时间存到文件,然后主程序运行的时候指定这个文件为参数文件,也可以得到同样的效果。

    有一句话说的好:世上没有最好的,只有适合的!每一款ETL工具都有它的优缺点,我们需要根据实际项目,权衡利弊选择适合的ETL工具,合适的就是最好的。当下越来越多公司及其客户更重视最新的数据(实时数据)展现,传统的ETL工具可能满足不了这样的需求,而实时流数据处理和云计算技术更符合。所以我们也需要与时俱进,学习大数据时代下的ETL工具

    展开全文
  • 随着 大数据分析 市场快速渗透到各行各业,哪些大数据技术是刚需?哪些技术有极大的潜在价值?根据弗雷斯特研究公司发布的指数,这里给出最热的十个大数据技术。 预测分析: 预测分析 是一种统计或数据挖掘解决方案...
  • ETL是数据仓库中的非常重要的一环,是承前启后的必要的一步。ETL负责将分布的、异构...下面给大家介绍一下什么是ETL以及ETL常用的三种工具(Datastage,Informatica,Kettle)! 1.ETL是什么? ETL,是英文 Extr...
  • 作为InfoSphere Information Server 8.5发行版的一部分,InfoSphere DataStage增强了称为XML stage的新层次转换功能,该功能提供了本机XML模式支持以及强大的XML转换功能。 这些功能基于独特的最新技术,该技术使您...
  • 大数据作为继云计算、物联网之后IT行业又一颠覆性的技术,备受关注,要想知道大数据创业方向,一定要知道,大数据产业链包括哪几个环节,具体的包含内容,接下来,为大家一一介绍: IT基础设施,包括提供硬件、软件...
  • NoSQL 技术为应用提供了缓存和搜索特性,但既然是处理数据,就需要定义一种方法来处理各种数据流,以便能够给用户输出见解或数据服务。通过审视被IT组织使用广泛的数据架构来定义处理的拓扑结构。...
  • 最热门的大数据技术

    2019-05-09 13:00:47
    大数据已经融入到各行各业,哪些大数据技术是最受欢迎?哪些大数据技术潜力巨大?对10个最热门的大数据技术的介绍。 (一)预测分析 预测分析是一种统计或数据挖掘解决方案,包含可在结构化和非结构化数据中使用以...
  • 2019独角兽企业重金招聘Python工程师标准>>> ...
  • InfoSphere DataStage 是 IBM 统一数据集成平台InfoSphere Information Server的重要组件,是业界主流的ETL(Extract, Transform, Load)软件。
  • 大数据已经融入到各行各业,哪些大数据技术是最受欢迎?哪些大数据技术潜力巨大?请听大讲台老师对10个最热门的大数据技术的介绍。
  • IBM InfoSphere DataStage Operations Console 是一个基于 Web 的监视工具。它为 InfoSphere DataStage 和 QualityStage 客户提供了 IBM InfoSphere Information Server 引擎操作环境的全面视图,监视当前和过去的...
  • 上次给大家介绍了大数据专业的报考方面的内容,今天小编来带大家看看大数据行业就业情况。首先大家得知道,如果想朝大数据行业发展得学习哪些方面的知识。大家都知道数学和计算机都是秃头强势学科,而大数据emmmm......
  • 当下,”大数据”几乎是每个IT人都在谈论的一个词汇,不单单是时代发展的趋势,也是革命技术的创新。大数据对于行业的用户也越来越重要。掌握了核心数据,不单单可以进行智能化的决策,还可以在竞争激烈的行业当中...
1 2 3 4 5 ... 20
收藏数 440
精华内容 176
关键字:

大数据产品datastage