精华内容
下载资源
问答
  • 大数据环境部署7:SparkSQL配置使用

    千次阅读 2015-10-22 21:05:52
    1、SparkSQL配置 将$HIVE_HOME/conf/hive-site.xml配置文件拷贝到$SPARK_HOME/conf目录下。 将$HADOOP_HOME/etc/hadoop/hdfs-site.xml配置文件拷贝到$SPARK_HOME/conf目录下。 2、运行 启动Spark集群 ...
    

    1SparkSQL配置

    1. $HIVE_HOME/conf/hive-site.xml配置文件拷贝到$SPARK_HOME/conf目录下。
    2. $HADOOP_HOME/etc/hadoop/hdfs-site.xml配置文件拷贝到$SPARK_HOME/conf目录下。

    2、运行

    1. 启动Spark集群
    2. 启动SparkSQL Client./home/spark/opt/spark-1.2.0-bin-hadoop2.4/bin/spark-sql --master spark://172.16.107.9:7077 --executor-memory 1g
    3. 运行SQL,访问hive的表:spark-sql>select count(*) from test.t1;

    注意:

    在启动spark-sql时,如果不指定master,则以local的方式运行,master既可以指定standalone的地址,也可以指定yarn

    当设定masteryarn(spark-sql--master yarn)时,可以通过http:// 172.16.107.9:8088页面监控到整个job的执行过程;

    如果在$SPARK_HOME/conf/spark-defaults.conf中配置了spark.masterspark:// 172.16.107.9:7077,那么在启动spark-sql时不指定master也是运行在standalone集群之上。

    3、可能的问题

    在启动的时候,报字符串输入不正确,根据提示到$SPARK_HOME/conf/hive-site.xml修改正确即可正确启动。

     

     

    参考:

    http://doc.okbase.net/byrhuangqiang/archive/104202.html

    http://www.cnblogs.com/shishanyuan/p/4723604.html?utm_source=tuicool

     

    展开全文
  • 原文 ... 1.环境 OS:Red Hat Enterprise Linux Server release 6.4 (Santiago)Hadoop:Hadoop 2.4.1Hive:0.11.0JDK:1.7.0_60Spark:1.1.0(内置SparkSQL)Scala:2
     
    

    1.环境

    • OS:Red Hat Enterprise Linux Server release 6.4 (Santiago)
    • Hadoop:Hadoop 2.4.1
    • Hive:0.11.0
    • JDK:1.7.0_60
    • Spark:1.1.0(内置SparkSQL)
    • Scala:2.11.2

    2.Spark集群规划

    • 账户:ebupt
    • master:eb174
    • slaves:eb174、eb175、eb176

    3.SparkSQL发展历史

    2014年9月11日,发布Spark1.1.0。Spark从1.0开始引入SparkSQL。Spark1.1.0变化较大是SparkSQL和MLlib。具体参见 release note 。

    SparkSQL的前身是Shark。由于Shark自身的不完善,2014年6月1日Reynold Xin宣布:停止对Shark的开发。SparkSQL抛弃原有Shark的代码,汲取了Shark的一些优点,如内存列存储(In-Memory Columnar Storage)、Hive兼容性等,重新开发SparkSQL。

    4.配置

    1. 安装配置同Spark-0.9.1(参见博文: Spark、Shark集群安装部署及遇到的问题解决 )
    2. 将$HIVE_HOME/conf/hive-site.xml配置文件拷贝到$SPARK_HOME/conf目录下。
    3. 将$HADOOP_HOME/etc/hadoop/hdfs-site.xml配置文件拷贝到$SPARK_HOME/conf目录下。

    5.运行

    1. 启动Spark集群
    2. 启动SparkSQL Client: ./spark/bin/spark-sql --master spark://eb174:7077 --executor-memory 3g
    3. 运行SQL,访问hive的表:spark-sql> select count(*) from test.t1;
    14/10/08 20:46:04 INFO ParseDriver: Parsing command: select count(*) from test.t1
    14/10/08 20:46:05 INFO ParseDriver: Parse Completed
    14/10/08 20:46:05 INFO metastore: Trying to connect to metastore with URI thrift://eb170:9083
    14/10/08 20:46:05 INFO metastore: Waiting 1 seconds before next connection attempt.
    14/10/08 20:46:06 INFO SparkDeploySchedulerBackend: Registered executor: Actor[akka.tcp://sparkExecutor@eb174:55408/user/Executor#1282322316] with ID 2
    14/10/08 20:46:06 INFO SparkDeploySchedulerBackend: Registered executor: Actor[akka.tcp://sparkExecutor@eb176:56138/user/Executor#-264112470] with ID 0
    14/10/08 20:46:06 INFO SparkDeploySchedulerBackend: Registered executor: Actor[akka.tcp://sparkExecutor@eb175:43791/user/Executor#-996481867] with ID 1
    14/10/08 20:46:06 INFO BlockManagerMasterActor: Registering block manager eb174:54967 with 265.4 MB RAM
    14/10/08 20:46:06 INFO BlockManagerMasterActor: Registering block manager eb176:60783 with 265.4 MB RAM
    14/10/08 20:46:06 INFO BlockManagerMasterActor: Registering block manager eb175:35197 with 265.4 MB RAM
    14/10/08 20:46:06 INFO metastore: Connected to metastore.
    14/10/08 20:46:07 INFO deprecation: mapred.map.tasks is deprecated. Instead, use mapreduce.job.maps
    14/10/08 20:46:07 INFO MemoryStore: ensureFreeSpace(406982) called with curMem=0, maxMem=278302556
    14/10/08 20:46:07 INFO MemoryStore: Block broadcast_0 stored as values in memory (estimated size 397.4 KB, free 265.0 MB)
    14/10/08 20:46:07 INFO MemoryStore: ensureFreeSpace(25198) called with curMem=406982, maxMem=278302556
    14/10/08 20:46:07 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 24.6 KB, free 265.0 MB)
    14/10/08 20:46:07 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on eb174:49971 (size: 24.6 KB, free: 265.4 MB)
    14/10/08 20:46:07 INFO BlockManagerMaster: Updated info of block broadcast_0_piece0
    14/10/08 20:46:07 INFO SparkContext: Starting job: collect at HiveContext.scala:415
    14/10/08 20:46:08 INFO FileInputFormat: Total input paths to process : 1
    14/10/08 20:46:08 INFO DAGScheduler: Registering RDD 5 (mapPartitions at Exchange.scala:86)
    14/10/08 20:46:08 INFO DAGScheduler: Got job 0 (collect at HiveContext.scala:415) with 1 output partitions (allowLocal=false)
    14/10/08 20:46:08 INFO DAGScheduler: Final stage: Stage 0(collect at HiveContext.scala:415)
    14/10/08 20:46:08 INFO DAGScheduler: Parents of final stage: List(Stage 1)
    14/10/08 20:46:08 INFO DAGScheduler: Missing parents: List(Stage 1)
    14/10/08 20:46:08 INFO DAGScheduler: Submitting Stage 1 (MapPartitionsRDD[5] at mapPartitions at Exchange.scala:86), which has no missing parents
    14/10/08 20:46:08 INFO MemoryStore: ensureFreeSpace(11000) called with curMem=432180, maxMem=278302556
    14/10/08 20:46:08 INFO MemoryStore: Block broadcast_1 stored as values in memory (estimated size 10.7 KB, free 265.0 MB)
    14/10/08 20:46:08 INFO MemoryStore: ensureFreeSpace(5567) called with curMem=443180, maxMem=278302556
    14/10/08 20:46:08 INFO MemoryStore: Block broadcast_1_piece0 stored as bytes in memory (estimated size 5.4 KB, free 265.0 MB)
    14/10/08 20:46:08 INFO BlockManagerInfo: Added broadcast_1_piece0 in memory on eb174:49971 (size: 5.4 KB, free: 265.4 MB)
    14/10/08 20:46:08 INFO BlockManagerMaster: Updated info of block broadcast_1_piece0
    14/10/08 20:46:08 INFO DAGScheduler: Submitting 2 missing tasks from Stage 1 (MapPartitionsRDD[5] at mapPartitions at Exchange.scala:86)
    14/10/08 20:46:08 INFO TaskSchedulerImpl: Adding task set 1.0 with 2 tasks
    14/10/08 20:46:08 INFO TaskSetManager: Starting task 0.0 in stage 1.0 (TID 0, eb174, NODE_LOCAL, 1199 bytes)
    14/10/08 20:46:08 INFO TaskSetManager: Starting task 1.0 in stage 1.0 (TID 1, eb176, NODE_LOCAL, 1199 bytes)
    14/10/08 20:46:08 INFO ConnectionManager: Accepted connection from [eb176/10.1.69.176:49289]
    14/10/08 20:46:08 INFO ConnectionManager: Accepted connection from [eb174/10.1.69.174:33401]
    14/10/08 20:46:08 INFO SendingConnection: Initiating connection to [eb176/10.1.69.176:60783]
    14/10/08 20:46:08 INFO SendingConnection: Initiating connection to [eb174/10.1.69.174:54967]
    14/10/08 20:46:08 INFO SendingConnection: Connected to [eb176/10.1.69.176:60783], 1 messages pending
    14/10/08 20:46:08 INFO SendingConnection: Connected to [eb174/10.1.69.174:54967], 1 messages pending
    14/10/08 20:46:08 INFO BlockManagerInfo: Added broadcast_1_piece0 in memory on eb176:60783 (size: 5.4 KB, free: 265.4 MB)
    14/10/08 20:46:08 INFO BlockManagerInfo: Added broadcast_1_piece0 in memory on eb174:54967 (size: 5.4 KB, free: 265.4 MB)
    14/10/08 20:46:08 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on eb174:54967 (size: 24.6 KB, free: 265.4 MB)
    14/10/08 20:46:08 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on eb176:60783 (size: 24.6 KB, free: 265.4 MB)
    14/10/08 20:46:10 INFO TaskSetManager: Finished task 1.0 in stage 1.0 (TID 1) in 2657 ms on eb176 (1/2)
    14/10/08 20:46:10 INFO TaskSetManager: Finished task 0.0 in stage 1.0 (TID 0) in 2675 ms on eb174 (2/2)
    14/10/08 20:46:10 INFO DAGScheduler: Stage 1 (mapPartitions at Exchange.scala:86) finished in 2.680 s
    14/10/08 20:46:10 INFO TaskSchedulerImpl: Removed TaskSet 1.0, whose tasks have all completed, from pool 
    14/10/08 20:46:10 INFO DAGScheduler: looking for newly runnable stages
    14/10/08 20:46:10 INFO DAGScheduler: running: Set()
    14/10/08 20:46:10 INFO DAGScheduler: waiting: Set(Stage 0)
    14/10/08 20:46:10 INFO DAGScheduler: failed: Set()
    14/10/08 20:46:10 INFO DAGScheduler: Missing parents for Stage 0: List()
    14/10/08 20:46:10 INFO DAGScheduler: Submitting Stage 0 (MappedRDD[9] at map at HiveContext.scala:360), which is now runnable
    14/10/08 20:46:10 INFO MemoryStore: ensureFreeSpace(9752) called with curMem=448747, maxMem=278302556
    14/10/08 20:46:10 INFO MemoryStore: Block broadcast_2 stored as values in memory (estimated size 9.5 KB, free 265.0 MB)
    14/10/08 20:46:10 INFO MemoryStore: ensureFreeSpace(4941) called with curMem=458499, maxMem=278302556
    14/10/08 20:46:10 INFO MemoryStore: Block broadcast_2_piece0 stored as bytes in memory (estimated size 4.8 KB, free 265.0 MB)
    14/10/08 20:46:10 INFO BlockManagerInfo: Added broadcast_2_piece0 in memory on eb174:49971 (size: 4.8 KB, free: 265.4 MB)
    14/10/08 20:46:10 INFO BlockManagerMaster: Updated info of block broadcast_2_piece0
    14/10/08 20:46:11 INFO DAGScheduler: Submitting 1 missing tasks from Stage 0 (MappedRDD[9] at map at HiveContext.scala:360)
    14/10/08 20:46:11 INFO TaskSchedulerImpl: Adding task set 0.0 with 1 tasks
    14/10/08 20:46:11 INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID 2, eb175, PROCESS_LOCAL, 948 bytes)
    14/10/08 20:46:11 INFO StatsReportListener: Finished stage: org.apache.spark.scheduler.StageInfo@513f39c
    14/10/08 20:46:11 INFO StatsReportListener: task runtime:(count: 2, mean: 2666.000000, stdev: 9.000000, max: 2675.000000, min: 2657.000000)
    14/10/08 20:46:11 INFO StatsReportListener:     0%      5%      10%     25%     50%     75%     90%     95%     100%
    14/10/08 20:46:11 INFO StatsReportListener:     2.7 s   2.7 s   2.7 s   2.7 s   2.7 s   2.7 s   2.7 s   2.7 s   2.7 s
    14/10/08 20:46:11 INFO StatsReportListener: shuffle bytes written:(count: 2, mean: 50.000000, stdev: 0.000000, max: 50.000000, min: 50.000000)
    14/10/08 20:46:11 INFO StatsReportListener:     0%      5%      10%     25%     50%     75%     90%     95%     100%
    14/10/08 20:46:11 INFO StatsReportListener:     50.0 B  50.0 B  50.0 B  50.0 B  50.0 B  50.0 B  50.0 B  50.0 B  50.0 B
    14/10/08 20:46:11 INFO StatsReportListener: task result size:(count: 2, mean: 1848.000000, stdev: 0.000000, max: 1848.000000, min: 1848.000000)
    14/10/08 20:46:11 INFO StatsReportListener:     0%      5%      10%     25%     50%     75%     90%     95%     100%
    14/10/08 20:46:11 INFO StatsReportListener:     1848.0 B        1848.0 B        1848.0 B        1848.0 B        1848.0 B        1848.0 B    1848.0 B        1848.0 B        1848.0 B
    14/10/08 20:46:11 INFO StatsReportListener: executor (non-fetch) time pct: (count: 2, mean: 86.309428, stdev: 0.103820, max: 86.413248, min: 86.205607)
    14/10/08 20:46:11 INFO StatsReportListener:     0%      5%      10%     25%     50%     75%     90%     95%     100%
    14/10/08 20:46:11 INFO StatsReportListener:     86 %    86 %    86 %    86 %    86 %    86 %    86 %    86 %    86 %
    14/10/08 20:46:11 INFO StatsReportListener: other time pct: (count: 2, mean: 13.690572, stdev: 0.103820, max: 13.794393, min: 13.586752)
    14/10/08 20:46:11 INFO StatsReportListener:     0%      5%      10%     25%     50%     75%     90%     95%     100%
    14/10/08 20:46:11 INFO StatsReportListener:     14 %    14 %    14 %    14 %    14 %    14 %    14 %    14 %    14 %
    14/10/08 20:46:11 INFO ConnectionManager: Accepted connection from [eb175/10.1.69.175:36187]
    14/10/08 20:46:11 INFO SendingConnection: Initiating connection to [eb175/10.1.69.175:35197]
    14/10/08 20:46:11 INFO SendingConnection: Connected to [eb175/10.1.69.175:35197], 1 messages pending
    14/10/08 20:46:11 INFO BlockManagerInfo: Added broadcast_2_piece0 in memory on eb175:35197 (size: 4.8 KB, free: 265.4 MB)
    14/10/08 20:46:12 INFO MapOutputTrackerMasterActor: Asked to send map output locations for shuffle 0 to sparkExecutor@eb175:58085
    14/10/08 20:46:12 INFO MapOutputTrackerMaster: Size of output statuses for shuffle 0 is 140 bytes
    14/10/08 20:46:12 INFO TaskSetManager: Finished task 0.0 in stage 0.0 (TID 2) in 1428 ms on eb175 (1/1)
    14/10/08 20:46:12 INFO DAGScheduler: Stage 0 (collect at HiveContext.scala:415) finished in 1.432 s
    14/10/08 20:46:12 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool 
    14/10/08 20:46:12 INFO StatsReportListener: Finished stage: org.apache.spark.scheduler.StageInfo@6e8030b0
    14/10/08 20:46:12 INFO StatsReportListener: task runtime:(count: 1, mean: 1428.000000, stdev: 0.000000, max: 1428.000000, min: 1428.000000)
    14/10/08 20:46:12 INFO StatsReportListener:     0%      5%      10%     25%     50%     75%     90%     95%     100%
    14/10/08 20:46:12 INFO StatsReportListener:     1.4 s   1.4 s   1.4 s   1.4 s   1.4 s   1.4 s   1.4 s   1.4 s   1.4 s
    14/10/08 20:46:12 INFO StatsReportListener: fetch wait time:(count: 1, mean: 0.000000, stdev: 0.000000, max: 0.000000, min: 0.000000)
    14/10/08 20:46:12 INFO StatsReportListener:     0%      5%      10%     25%     50%     75%     90%     95%     100%
    14/10/08 20:46:12 INFO StatsReportListener:     0.0 ms  0.0 ms  0.0 ms  0.0 ms  0.0 ms  0.0 ms  0.0 ms  0.0 ms  0.0 ms
    14/10/08 20:46:12 INFO StatsReportListener: remote bytes read:(count: 1, mean: 100.000000, stdev: 0.000000, max: 100.000000, min: 100.000000)
    14/10/08 20:46:12 INFO StatsReportListener:     0%      5%      10%     25%     50%     75%     90%     95%     100%
    14/10/08 20:46:12 INFO StatsReportListener:     100.0 B 100.0 B 100.0 B 100.0 B 100.0 B 100.0 B 100.0 B 100.0 B 100.0 B
    14/10/08 20:46:12 INFO SparkContext: Job finished: collect at HiveContext.scala:415, took 4.787407158 s
    14/10/08 20:46:12 INFO StatsReportListener: task result size:(count: 1, mean: 1072.000000, stdev: 0.000000, max: 1072.000000, min: 1072.000000)
    14/10/08 20:46:12 INFO StatsReportListener:     0%      5%      10%     25%     50%     75%     90%     95%     100%
    14/10/08 20:46:12 INFO StatsReportListener:     1072.0 B        1072.0 B        1072.0 B        1072.0 B        1072.0 B        1072.0 B    1072.0 B        1072.0 B        1072.0 B
    14/10/08 20:46:12 INFO StatsReportListener: executor (non-fetch) time pct: (count: 1, mean: 80.252101, stdev: 0.000000, max: 80.252101, min: 80.252101)
    14/10/08 20:46:12 INFO StatsReportListener:     0%      5%      10%     25%     50%     75%     90%     95%     100%
    14/10/08 20:46:12 INFO StatsReportListener:     80 %    80 %    80 %    80 %    80 %    80 %    80 %    80 %    80 %
    14/10/08 20:46:12 INFO StatsReportListener: fetch wait time pct: (count: 1, mean: 0.000000, stdev: 0.000000, max: 0.000000, min: 0.000000)
    14/10/08 20:46:12 INFO StatsReportListener:     0%      5%      10%     25%     50%     75%     90%     95%     100%
    14/10/08 20:46:12 INFO StatsReportListener:      0 %     0 %     0 %     0 %     0 %     0 %     0 %     0 %     0 %
    14/10/08 20:46:12 INFO StatsReportListener: other time pct: (count: 1, mean: 19.747899, stdev: 0.000000, max: 19.747899, min: 19.747899)
    14/10/08 20:46:12 INFO StatsReportListener:     0%      5%      10%     25%     50%     75%     90%     95%     100%
    14/10/08 20:46:12 INFO StatsReportListener:     20 %    20 %    20 %    20 %    20 %    20 %    20 %    20 %    20 %
    5078
    Time taken: 7.581 seconds
    View Code

    注意:

    1. 在启动spark-sql时,如果不指定master,则以local的方式运行,master既可以指定standalone的地址,也可以指定yarn;
    2. 当设定master为yarn时(spark-sql --master yarn)时,可以通过http://$master:8088页面监控到整个job的执行过程;
    3. 如果在$SPARK_HOME/conf/spark-defaults.conf中配置了spark.master spark://eb174:7077,那么在启动spark-sql时不指定master也是运行在standalone集群之上。

    6.遇到的问题及解决方案

    ① 在spark-sql客户端命令行界面运行SQL语句出现无法解析UnknownHostException:ebcloud(这是hadoop的dfs.nameservices)

    14/10/08 20:42:44 ERROR CliDriver: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 1.0 failed 4 times, most recent failure: Lost task 0.3 in stage 1.0 (TID 4, eb174): java.lang.IllegalArgumentException: java.net.UnknownHostException: ebcloud
      org.apache.hadoop.security.SecurityUtil.buildTokenService(SecurityUtil.java:377)
      org.apache.hadoop.hdfs.NameNodeProxies.createNonHAProxy(NameNodeProxies.java:240)
      org.apache.hadoop.hdfs.NameNodeProxies.createProxy(NameNodeProxies.java:144)
      org.apache.hadoop.hdfs.DFSClient.<init>(DFSClient.java:579)
      org.apache.hadoop.hdfs.DFSClient.<init>(DFSClient.java:524)
      org.apache.hadoop.hdfs.DistributedFileSystem.initialize(DistributedFileSystem.java:146)
      org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2397)
      org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:89)
      org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2431)

    原因:Spark无法正确获取HDFS的地址。因此,将hadoop的HDFS配置文件hdfs-site.xml拷贝到$SPARK_HOME/conf目录下。

    14/10/08 20:26:46 INFO BlockManagerMaster: Updated info of block broadcast_0_piece0 
    14/10/08 20:26:46 INFO SparkContext: Starting job: collect at HiveContext.scala:415 
    14/10/08 20:29:19 WARN RetryInvocationHandler: Exception while invoking class org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getFileInfo over eb171/10.1.69.171:8020. Not retrying because failovers (15) exceeded maximum allowed (15) 
    java.net.ConnectException: Call From eb174/10.1.69.174 to eb171:8020 failed on connection exception: java.net.ConnectException: Connection refused; For more details see: http://wiki.apache.org/hadoop/ConnectionRefused 
    at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) 
    at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57) 
    at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) 
    at java.lang.reflect.Constructor.newInstance(Constructor.java:526) 
    at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:783) 
    at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:730) 
    at org.apache.hadoop.ipc.Client.call(Client.java:1414) 
    at org.apache.hadoop.ipc.Client.call(Client.java:1363)

    原因:hdfs连接失败,原因是hdfs-site.xml未全部同步到slaves的节点上。

    7.参考资料

    展开全文
  • SparkSQL配置(HIVE作为数据源)

    千次阅读 2016-09-30 18:24:35
    HIVE的配置(以mysql做为元数据的存储,hdfs作为数据的存储): 1.修改 hive-env.sh (可以从hive-default.xml.template拷贝修改) #hadoop的主目录export HADOOP_HOME=/usr/local/hadoop # Hive Configuration ...

    HIVE的配置(以mysql做为元数据的存储,hdfs作为数据的存储):

    1.修改 hive-env.sh  (可以从hive-default.xml.template拷贝修改)

    #hadoop的主目录
    export HADOOP_HOME=/usr/local/hadoop
    # Hive Configuration Directory can be controlled by:
    export HIVE_CONF_DIR=/usr/local/hive/conf
    # Folder containing extra ibraries required for hive compilation/execution can be controlled by:
    export HIVE_AUX_JARS_PATH=/usr/local/hive/lib
    
    
    2.修改 hive-site.xml(可以参考hive-default.xml.template修改)
    #此处主要配置与mysql相关信息
      <property>
        <name>javax.jdo.option.ConnectionURL</name>
        <value>jdbc:mysql://master:3306/hive?createDatabaseIfNotExist=true</value>
        <description>JDBC connect string for a JDBC metastore</description>
      </property>
    
    
    
      <property>
        <name>javax.jdo.option.ConnectionPassword</name>
        <value>youpassword</value>
        <description>password to use against metastore database</description>
      </property>
    
    
    
     <property>
        <name>javax.jdo.option.ConnectionUserName</name>
        <value>root</value>
        <description>Username to use against metastore database</description>
      </property>
    
    <span style="font-family: Arial, Helvetica, sans-serif;">至此hive基本配置完毕</span>
    <span style="font-family: Arial, Helvetica, sans-serif;">然后启动./HIVE_HOME/bin/hive 看是否能启动成功!</span>
    <span style="font-family: Arial, Helvetica, sans-serif;">-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------</span>
    配置spark
    1.修改spark-env.sh
    #内存根据自己的机器配置,注意:太配置小了,运行会出现no resource。。。。。。,
    export SCALA_HOME=/usr/local/spark
    export JAVA_HOME=/usr/local/jdk1.8.0
    export HADOOP_HOME=/usr/local/hadoop
    export HADOOP_CONF_DIR=/usr/local/hadoop/etc/hadoop
    export SPARK_MASTER_IP=master
    export SPARK_WORKER_MEMORY=800m
    export SPARK_EXECUTOR_MEMORY=800m
    export SPARK_DRIVER_MEMORY=800m
    export SPARK_WORKER_CORES=4
    export MASTER=spark://master:7077
    
    
    
    2.配置spark-defaults.conf
    spark.executor.extraJavaOptions  -XX:+PrintGCDetails -Dkey=value -Dnumbers="one two thr"
    spark.eventLog.enabled           true
    spark.eventLog.dir               hdfs://master:9000/historyserverforSpark
    #可以用来查看spark的历史执行任务 web UI
    spark.yarn.historyServer.address        master:18080
    spark.history.fs.logDirectory   hdfs://master:9000/historyserverforSpark 
    
    
    3.配置slaves(配置了两个work节点)
    slave1
    slave2
    -------------------------------------------------------
    在spark/conf中配置添加hive-site.xml,内容如下
    <?xml version="1.0" encoding="UTF-8" standalone="no"?>
    <configuration>
    <property>
    <name>hive.metastore.uris</name>  
        <value>thrift://master:9083</value>  
        <description>Thrift URI for the remote metastore. Used by metastore client to connect to remote metastore.</description> 
    </property>
    
    
    </configuration>
    
    
    4.启动 hive的元数据
     hive --servie meatastore
    5. 启动sparkSQL
    ./bin/spark-bin
    
        
    
    
    
    
    
                                                                               
                                                   



    展开全文
  • Hive依赖的前提组件 ... 修改配置文件 hive-env.sh HADOOP_HOME=/opt/hadoop-2.7.7 HIVE_CONF_DIR=/opt/hive-2.3.9/conf JAVA_HOME=/opt/jdk1.8.0_291COPY hive-site.xml <!-- hive的相关配..
    • Hive依赖的前提组件

      • HDFS 用来存储Hive中表的内容数据(文件)
      • MySQL 用来存储Hive中库和表的结构信息
    • Hive的安装

      1. 下载解压修改名字
      2. 修改配置文件
        • hive-env.sh
          HADOOP_HOME=/opt/hadoop-2.7.7
          HIVE_CONF_DIR=/opt/hive-2.3.9/conf
          JAVA_HOME=/opt/jdk1.8.0_291COPY
      • hive-site.xml
    <!--
    hive的相关配置在hive-default.xml中提供了一些默认配置
    我们使用时可以在官网的文档中查看hive可用的所有配置和说明
    https://cwiki.apache.org/confluence/display/Hive/Configuration+Properties#ConfigurationProperties-HiveConfigurationProperties
    当前我们使用时,只需要使用连接MySQL作为metastore的4个配置项即可
    -->
    <?xml version="1.0" encoding="UTF-8" standalone="no"?>
    <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
    <configuration>
     <property>
       <name>javax.jdo.option.ConnectionURL</name>
       <value>jdbc:mysql://node01:3306/metastore?createDatabaseIfNotExist=true</value>
     </property>
     <property>
       <name>javax.jdo.option.ConnectionDriverName</name>
       <value>com.mysql.cj.jdbc.Driver</value>
     </property>
     <property>
       <name>javax.jdo.option.ConnectionUserName</name>
       <value>root</value>
     </property>
     <property>
       <name>javax.jdo.option.ConnectionPassword</name>
       <value>123456</value>
     </property>
    </configuration>COPY
    1. 分发安装包

      • hive是一个数据仓库工具,不需要启动任何服务,使用时直接使用hive客户端执行HQL或者提交HQL脚本,所以只需要在用到的机器上安装即可
    2. 添加环境变量

    echo 'export HIVE_HOME=/opt/hive-2.3.9' >> /etc/profile
    echo 'export PATH=.:$HIVE_HOME/bin:$PATH' >> /etc/profile
    source /etc/profileCOPY
    1. 启动客户端

    hive

    3.2 SparkSQL的命令行使用

    • SparkSQL内置了一个Hive
      可以通过直接将原有的Hive配置文件导入Spark安装目录方式,使SparkSQL可以在原有Hive库中直接执行SparkSQL计算

      1. 将Hive安装目录下的conf/hive-site.xml复制到Spark安装目录的conf中
      2. 将Hive连接mysql使用的jdbc的驱动依赖,复制到Spark安装目录的jars中
      3. 使用spark-sql启动SparkSQL的命令行客户端

    3.3 SparkSQL的程序开发

    • SparkSQL允许使用Scala、Java、Python、R、SQL进行开发
    • 当前使用Scala API开发

      1. maven导入依赖

      2. 创建SparkSession

      3. SparkSession加载数据执行SQL操作

    展开全文
  • sparksql参数配置

    千次阅读 2017-06-26 14:59:21
    转载自:http://www.cnblogs.com/wwxbi/p/6114410.html 查看当前环境SQL参数的配置spark.sql("SET -v")keyvaluespark.sql.hive.version1.2.1spark.sql.sources.parallelPartitionDiscovery.threshold32spark.sql....
  • 之前一直使用ranger管理hive的用户权限,现在系统要集成SparkSQL(thriftserver),但是在ranger下并没有SparkSQL的相关的插件,通过搜集HORTONWORKS相关资料,可以给SparkSQL配置LLAP(关于LLAP的更多细节,查看...
  • SparkSQL参数配置指南

    千次阅读 2019-09-27 09:20:21
    动态资源分配开启后,当executor闲置超过60s将被回收,但executor不会低于spark.dynamicAllocation.minExecutors配置个数。当任务资源不足时,任务会自动向YARN申请资源,但executor不会超过spark.dynamicAllocation...
  • sparksql必要的配置

    千次阅读 2017-12-15 16:53:22
    sparksql必要的配置Driver和Executor权限问题由于Driver和Excutor在执行过程中需要访问Hive中的元数据库mysql,但是Driver和Executor具体被分配到哪台机器上,这个是不固定的,所以这就要求集群中的所有从节点都需要...
  • SparkSql 常用参数配置

    2020-11-25 20:33:04
    SparkSql 常用参数配置: 1、常用持久化: RDD层面: ​ 持久化cache:内存 ​ MEMORY_ONLY_SER:序列化(启用sparkkryo序列化)有效降低内存占用,但耗费更多cpu性能序列化,而且还要注册需要序列化的类; ​ 以yarn...
  • sparkSQL metaData配置到Mysql

    千次阅读 2016-01-26 15:31:26
    构造以spark为核心的数据仓库, 0.说明  在大数据领域,hive作为老牌的数据仓库比较流行,spark可以考虑兼容hive。但是如果不想用hive做数据...sparkSQL作为数据仓库其元数据放到了Derby中,一般生产环境不会用
  • SparkSQL ThriftServer配置及连接测试

    千次阅读 2017-12-05 16:00:30
    SparkSQL ThriftServer配置及连接测试
  • sparksql优化参数配置

    2019-09-29 11:17:50
    set spark.shuffle.file.buffer=128k set spark.shuffle.consolidateFiles=true set spark.shuffle.manager=hash set spark.shuffle.memoryFraction=0.5 set spark.serializer=org.apache.spark.serializer....
  • Basic { def main(args: Array[String]): Unit = { //创建上下文环境配置对象 val sparkConf = new SparkConf().setMaster("local[*]").setAppName("sparkSql") //创建 SparkSession 对象 val spark = SparkSession....
  • 一、环境配置 二、案例
  • 配置Tableau Desktop连接SparkSQL

    千次阅读 2016-04-08 18:44:26
    配置Tableau Desktop连接SparkSQL 1. 前期准备 a、在window上安装Tableau Desktop(window的版本必须是win7或以上),同时上官网去下载Tableau的SparkSQL的ODBC驱动(SimbaSparkODBC64.msi)并安装在window上 b、将...
  • SparkSQL

    2018-10-17 22:40:24
    SparkSQL的shuffle过程 SparkSQL结构化数据 SparkSQL解析 SparkSQL的shuffle过程 Spark SQL的核心是把已有的RDD,带上Schema信息,然后注册成类似sql里的”Table”,对其进行sql查询。这里面主要分两部分,一...
  • cdh5.3中配置sparksql

    2019-10-03 15:06:52
    在cdh5.3中的spark,已经包含了sparksql,只需要做以下几步配置,就可以在使用该功能 1)确保hive的cli和jdbc都可以正常工作 2)把hive-site.xml 复制到 SPARK_HOME/conf目录下 3)将hive的类库添加到spark ...
  • SparkSql

    2019-12-13 15:45:34
    目录(SparkSql)本质(是什么)(我在试着讲明白)作用(干什么)(我在试着讲明白)架构(有什么)(我在试着讲明白)Spark SQL由core,catalyst,hive和hive-thriftserver4个部分组成。1.Catalyst执行优化器UDFUDAF开窗...
  • 首先要配置好hive,保存元数据到mysql中,这个大家自己查资料!然后配置Spark SQL, 1.配置hive-site.xml 在master1上的/usr/local/spark/spark-1.6.0-bin-hadoop2.6/conf目录创建hive-site.xml文件,内容如下: ...
  • sparkSQL

    千次阅读 2019-02-21 22:57:34
    sparkSQL 介绍 sparkSQL将SQL解析成spark任务来执行 , 使用更友好 . Shark是基于Spark计算框架之上且兼容Hive语法的SQL执行引擎, 底层的计算采用了Spark , 性能比MapReduce的Hive大约快2倍之上 . 当数据全部加载...
  • 1. 所需的配置文件 完整配置的安装包如下图,可在此提取 ,提取码为c7tv。 2. JDK环境变量配置 点击我的电脑(/计算机)-属性-高级系统设置-环境变量, 点击 新建,添加: (上图变量值为“%JAVA_HOME%\lib;%JAVA_...
  • sparksql

    2021-08-13 16:28:50
    sparksql介绍 sparksql是spark用来处理结构化数据的一个模板,他提供了要给编程抽象叫做dataframe并且作为分布式sql查询引擎的作用 sparksqlsparksql转化为rdd,然后提交到集群执行,执行效率快 hive的应用其实...

空空如也

空空如也

1 2 3 4 5 ... 20
收藏数 7,369
精华内容 2,947
关键字:

sparksql配置