精华内容
参与话题
问答
  • Big data

    2017-12-17 23:04:00
    Big dataisdata setsthat are so voluminous and complex that traditionaldata processingapplication softwareare inadequate to deal with them. Big data challenges includecapturing data,data storage,...

    Big data is data sets that are so voluminous and complex that traditional data processingapplication software are inadequate to deal with them. Big data challenges include capturing data,data storagedata analysis, search, sharingtransfervisualizationquerying, updating andinformation privacy. There are three dimensions to big data known as Volume, Variety and Velocity.

    Lately, the term "big data" tends to refer to the use of predictive analyticsuser behavior analytics, or certain other advanced data analytics methods that extract value from data, and seldom to a particular size of data set. "There is little doubt that the quantities of data now available are indeed large, but that’s not the most relevant characteristic of this new data ecosystem." Analysis of data sets can find new correlations to "spot business trends, prevent diseases, combat crime and so on." Scientists, business executives, practitioners of medicine, advertising and governments alike regularly meet difficulties with large data-sets in areas including Internet searchfintechurban informatics, and business informatics. Scientists encounter limitations in e-Science work, including meteorologygenomicsconnectomics, complex physics simulations, biology and environmental research.

    Data sets grow rapidly - in part because they are increasingly gathered by cheap and numerous information-sensing Internet of things devices such as mobile devices, aerial (remote sensing), software logs, cameras, microphones, radio-frequency identification (RFID) readers and wireless sensor networks.] The world's technological per-capita capacity to store information has roughly doubled every 40 months since the 1980s;[8] as of 2012, every day 2.5 exabytes (2.5×1018) of data are generated. By 2025, IDC predicts there will be 163 zettabytes of data. One question for large enterprises is determining who should own big-data initiatives that affect the entire organization.

    Relational database management systems and desktop statistics- and visualization-packages often have difficulty handling big data. The work may require "massively parallel software running on tens, hundreds, or even thousands of servers". What counts as "big data" varies depending on the capabilities of the users and their tools, and expanding capabilities make big data a moving target. "For some organizations, facing hundreds of gigabytes of data for the first time may trigger a need to reconsider data management options. For others, it may take tens or hundreds of terabytes before data size becomes a significant consideration."

    转载于:https://www.cnblogs.com/Anei/p/8053783.html

    展开全文
  • Big Data

    2017-05-01 23:14:07
    Big Datasourse: ...The term ‘big data’ seems to be popping up everywhere these days. And there seems to be as many uses of this term as there are contex

    Big Data

    sourse: https://words.sdsc.edu/words-data-science/big-data
    The term ‘big data’ seems to be popping up everywhere these days. And there seems to be as many uses of this term as there are contexts in which you find it: ‘big data’ is often used to refer to any dataset that is difficult to manage using traditional database systems; it is also used as a catch-all term for any collection of data that is too large to process on a single server; yet others use the term to simply mean “a lot of data”; sometimes it turns out it doesn’t even have to be large. So what exactly is big data?

    A precise specification of ‘big’ is elusive. What is considered big for one organization may be small for another. What is large-scale today will likely seem small-scale in the near future; petabyte is the new terabyte. Thus, size alone cannot specify big data. The complexity of the data is an important factor that must also be considered.

    Most now agree with the characterization of big data using the 3 V’s coined by Doug Laney of Gartner:

    • Volume: This refers to the vast amounts of data that is generated every second/minute/hour/day in our digitized world.
    • Velocity: This refers to the speed at which data is being generated and the pace at which data moves from one point to the next.
    • Variety: This refers to the ever-increasing different forms that data can come in, e.g., text, images, voice, geospatial.
      A fourth V is now also sometimes added:

    • Veracity: This refers to the quality of the data, which can vary greatly.
      The above V’s are the dimensions that characterize big data, and also embody its challenges: We have huge amounts of data, in different formats and varying quality, that must be processed quickly.

    It is important to note that the goal of processing big data is to gain insight to support decision-making. It is not sufficient to just be able to capture and store the data. The point of collecting and processing volumes of complex data is to understand trends, uncover hidden patterns, detect anomalies, etc. so that you have a better understanding of the problem being analyzed and can make more informed, data-driven decisions. In fact, many consider value as the fifth V of big data:

    • Value: Processing big data must bring about value from insights gained.
      To address the challenges of big data, innovative technologies are needed. Parallel, distributed computing paradigms, scalable machine learning algorithms, and real-time querying are key to analysis of big data. Distributed file systems, computing clusters, cloud computing, and data stores supporting data variety and agility are also necessary to provide the infrastructure for processing of big data. Workflows provide an intuitive, reusable, scalable and reproducible way to process big data to gain verifiable value from it in and enable application of same methods to different datasets.

    With all the data generated from social media, smart sensors, satellites, surveillance cameras, the Internet, and countless other devices, big data is all around us. The endeavor to make sense out of that data brings about exciting opportunities indeed!

    展开全文
  • bigdata

    2012-04-24 22:10:08
    当前的bigdata已经非常热,hadoop系统已经成为这个领域解决问题最快最廉价的解决方案。今天先转载一个bigdata的材料。材料里介绍了很多当前bigdata流行的技术。也提到了下一代bigdata的架构,是否可以是下一代值的...

    当前的bigdata已经非常热,hadoop系统已经成为这个领域解决问题最快最廉价的解决方案。今天先转载一个bigdata的材料。材料里介绍了很多当前bigdata流行的技术。也提到了下一代bigdata的架构,是否可以是下一代值的讨论,目前很多公司已经开始使用这样的架构了。

    http://www.slideshare.net/eddodds/big-data-infrastructure1


    展开全文
  • BIG DATA

    2016-08-18 07:52:06
    数据缺失与共线性 在ZestFinsnce , Ondeck以及海量风控建模中一般使用...多种原始数据的聚合(data——fusion)的方法。在参考使用中出现大量的数据确实和数据间相关性太强,这样出现过拟合问题。 一旦关键变量

    数据缺失与共线性
    在ZestFinsnce , Ondeck以及海量风控建模中一般使用来自申请书报告,政府数据,征信报告,网站和APP数据采集。企业合作,互联网公开数据资源,体现出维度特别高,来源特别广,数据结构特别复杂的特征。多种原始数据的聚合(data——fusion)的方法。在参考使用中出现大量的数据确实和数据间相关性太强,这样出现过拟合问题。
    一旦关键变量缺失,在逻辑回归中这类线性模型便会极大的影响模型的效果,机器学习中使用Regulsarized方法,限制模型的复杂度,共线性带来严重的伪相关问题, 在模型的快速迭代过程中导致获得的协同效应并不完善。
    在概率图模型中,一般使用不同的填补方法,中值,众数,平均数,距离最小,模型填充用回归填充,C4.5, 热卡,Kmeasn来建模填充。最后用PCA或者变量选择来降维,这是一种似然的概率最高单一估计,这种模型往往无法将数据准确的放入模型,这个信号数据可能hi导致模型效果降低。综合经验分析考虑数据有缺失,数据相关性,因果分析,基于贝叶斯的理论来将连续的数据的概率主成分分析(Probabilistic Principie Componets Analysis ,PPCA)和针对离散数据的贝叶斯网络(Bayesian Network)方法结合概率图模型打造一个风控算法在概率图解释性,模型预测能力上达到最佳。
    概率PCA是一种Transform方法,将多维数据进行函数映射到低维平面,主要核心是隐变量在条件概率的服从正太分布情况下,得到观测样本。贝叶斯网络本身是一种有向无环的概率图模型,适用于离散变量,用变量间相关性和热图构造变量间的决定网络。通过联合分布函数和欧式距离计算来确实。按照贝叶斯理论可以在已知部分的情况下,求出未知变量子在已知变量的观测的条件概率实现缺失填充的目的,

    大数据的本质:

    软件使用和量化的理性认识从分析的角度上产生了大数据的概念。
    机器学习的本质
    随着数据的量变导致质变,数据空间复杂度提高后其内部的隐含规律越来越精确完整,机器学习将数据内部的这种存在关系可以体现出来通过物理世界感受到。
    大数据最大的工作消耗在哪里?
    目前80%的工作在于数据集的清洗工作方面和校验,这个工作不难但是繁琐,费时间。
    数据的收集和分类。海量数据的ad-hoc查询。
    hadoop在查询效果上要各自的要求,parquet技术,
    ORC,Hive常见存储格式
    CarbonData华为推出一种可以支持PB级别的数据格式。
    节省技巧:
          流式计算
          流式计算上层建筑
          华为在Storm上做的StreamCQL, 在流式上做很多计算事情:
          数据处理
          ad-Hoc查询
          机器学习
          报表
          存储输出
         

    展开全文
  • Alex Gorelik - The Enterprise Big Data Lake_ Delivering the Promise of Big Data and Data Science-O’Reilly Media (2019)
  • bigdata_interview 面试总结
  • Tianchi - BigData 该仓库托管一些我之前参加天池大数据竞赛的代码。有关打比赛的内容,欢迎访问我的博客 Snoopy_Yuan的博客 - 天池赛 或 PnYuan- Homepages - 天池赛 。 here is a repository for my code during ...
  • Big Data Visualization

    2018-08-15 17:52:47
    The target audience of this book are data analysts and those with at least a basic knowledge of big data analysis who now want to learn interesting approaches to big data visualization in order to ...
  • bigdata-examples 数据可视化,大屏模板示例 访问   点击预览在线效果   点击预览在线效果   点击预览在线效果
  • BigData代码

    2016-04-04 13:18:07
    BigData代码
  • Big Data 2.0

    2017-12-05 17:28:09
    of Big Data processing systems. The book is not focused only on one research area or one type of data. However, it discusses various aspects of research and development of Big Data systems. It also ...
  • Pentaho Big Data Community Home - Pentaho Big Data - Pentaho Wiki Pentaho Big Data Community Home - Pentaho Big Data -Pentaho WikiTa...
  • Big Data Books

    2020-12-31 18:51:14
    <div><p>Big Data Book list added [https://github.com/vimoxshah/Learning-Resources/blob/master/BigData/books.md]</p><p>该提问来源于开源项目:GDGAhmedabad/Learning-Resources</p></div>
  • Oracle Big Data Handbook

    2018-10-29 10:41:26
    Oracle bigd data handbook for the big data analysising
  • BigData大数据学习笔记

    2021-01-03 18:53:40
    BigData大数据学习笔记
  • Big Data Now

    2013-11-05 13:38:14
    Big Data Trend and Overview
  • This data is categories as "Big Data" due to its sheer Volume, Variety and Velocity. Most of this data is unstructured, quasi structured or semi structured and it is heterogeneous in nature. The ...
  • The Enterprise Big Data Lake

    2019-02-25 13:35:06
    The Enterprise Big Data Lake
  • Title: Handbook of Big Data Technologies Length: 895 pages Edition: 1st ed. 2017 Language: English Publisher: Springer Publication Date: 2017-03-26 ISBN-10: 3319493396 ISBN-13: 9783319493398 Table of...
  • 华为大数据开发者 HCNP-Big Data-Developer HCNP Big Data Developer 教材
  • "Big Data Management and Processing" English | ISBN: 1498768075 | 2017 | 487 pages | PDF | 23 MB "Big Data Management and Processing is [a] state-of-the-art book that deals with a wide range of ...
  • Big Data For Dummies

    2014-11-22 16:52:20
    Big Data For Dummies
  • big data computational intelligence networking.pdf big data computational intelligence networking.pdf big data computational intelligence networking.pdf big data computational intelligence networking....

空空如也

1 2 3 4 5 ... 20
收藏数 25,628
精华内容 10,251
关键字:

big data