• what is big data?

    千次阅读 2015-08-26 18:51:20
    link: http://opensource.com/resources/big-dataBig data: everyone seems to be talking about it, but what is big data really? How is it changing the way researchers at companies, non-profits, ...

    link: http://opensource.com/resources/big-data

    Big data: everyone seems to be talking about it, but what is big data really? How is it changing the way researchers at companies, non-profits, governments, institutions, and other organizations are learning about the world around them? Where is this data coming from, how is it being processed, and how are the results being used? And why is open source so important to answering these questions?

    In this short primer, learn all about big data and what it means for the changing world we live in.

    What is big data?

    There is no hard and fast rule about exactly what size a database needs to be in order for the data inside of it to be considered “big.” Instead, what typically defines big data is the need for new techniques and tools in order to be able to process it. In order to use big data, you need programs which span multiple physical and/or virtual machines working together in concert in order to process all of the data in a reasonable span of time.

    Getting programs on multiple machines to work together in an efficient way, so that each program knows which components of the data to process, and then being able to put the results from all of the machines together to make sense of a large pool of data takes special programming techniques. Since it is typically much faster for programs to access data stored locally instead of over a network, the distribution of data across a cluster and how those machines are networked together are also important considerations which must be made when thinking about big data problems.

    What kind of datasets are considered big data?

    The uses of big data are almost as varied as they are large. Prominent examples you’re probably already familiar with including social media network analyzing their members’ data to learn more about them and connect them with content and advertising relevant to their interests, or search engines looking at the relationship between queries and results to give better answers to users’ questions.

    But the potential uses go much further! Two of the largest sources of data in large quantities are transactional data, including everything from stock prices to bank data to individual merchants’ purchase histories; and sensor data, much of it coming from what is commonly referred to as the Internet of Things (IoT). This sensor data might be anything from measurements taken from robots on the manufacturing line of an auto maker, to location data on a cell phone network, to instantaneous electrical usage in homes and businesses, to passenger boarding information taken on a transit system.

    By analyzing this data, organizations are able to learn trends about the data they are measuring, as well as the people generating this data. The hope for this big data analysis are to provide more customized service and increased efficiencies in whatever industry the data is collected from.

    How is big data analyzed?

    One of the best known methods for turning raw data into useful information is by what is known as MapReduce. MapReduce is a method for taking a large data set and performing computations on it across multiple computers, in parallel. It serves as a model for how program, and is often used to refer to the actual implementation of this model.

    In essence, MapReduce consists of two parts. The Map function does sorting and filtering, taking data and placing it inside of categories so that it can be analyzed. The Reduce function provides a summary of this data by combining it all together. While largely credited to research which took place at Google, MapReduce is now a generic term and refers to a general model used by many technologies.

    What tools are used to analyze big data?

    Perhaps the most influential and established tool for analyzing big data is known as Apache Hadoop. Apache Hadoop is a framework for storing and processing data in a large scale, and it is completely open source. Hadoop can run on commodity hardware, making it easy to use with an existing data center, or even to conduct analysis in the cloud. Hadoop is broken into four main parts:

    • The Hadoop Distributed File System (HDFS), which is a distributed
      file system designed for very high aggregate bandwidth;
    • YARN, a platform for managing Hadoop’s resources and scheduling
      programs which will run on the Hadoop infrastructure;
    • MapReduce, as described above, a model for doing big data processing;
    • And a common set of libraries for other modules to use.

    To learn more about Hadoop, see our Introduction to Apache Hadoop for big data.

    Other tools are out ther too. One which has been receiving a lot of attention recently is Apache Spark. The main selling point of Spark is that it stoes much of the data for processing in memory, as opposed to on disk, which for certain kinds of analysis can be much faster. Depending on the operation, analysts may see results a hundred times faster or more. Spark can use the Hadoop Distributed File System, but it is also capable of working with other data stores, like Apache Cassandra or OpenStack Swift. It’s also fairly easy to run Spark on a single local machine, making testing and development easier.

    For more on Apache Spark, see our collection of articles on the topic.

    Of course, these aren’t the only two tools out there. There are countless open source solutions for working with big data, many of them specialized to provide optimal features and performance for a specific niche or for specific hardware configurations. And as big data continues to grow in size and importance, the list of open source tools for working with it will certainly continue to grow alongside.

  • what is Bigdata

    2021-06-18 11:18:27
    Bigdata define huge data,The scale of the data involved is so huge that it is impossible to intercept, manage, process and organize the information that can be interpreted by human within a ...

    Bigdata define
    huge data,The scale of the data involved is so huge that it is impossible to intercept, manage, process and organize the information that can be interpreted by human within a reasonable time by human

    4v charcater

  • The second chapter explores what makes big data special and, in doing so, leads us to a more specific definition. In Chapter 3, we discuss the problems related to storing and managing big data. Most ...
  • What's more, Big Data Analytics with Spark provides an introduction to other big data technologies that are commonly used along with Spark, like Hive, Avro, Kafka and so on. So the book is self-...
  • big data now

    2017-11-12 07:36:45
    Mike Loukides kicked things off in June 2010 with “What is data science?” and from there we’ve pursued the various threads and themes that naturally emerged. Now, roughly a year later, we can look...
  • Big Data Architect’s Handbook is for you if you are an aspiring data professional, developer, or IT enthusiast who aims to be an all-round architect in big data. This book is your one-stop solution ...
  • Discovering Big Data’s fundamental concepts and what makes it different from previous forms of data analysis and data science Understanding the business motivations and drivers behind Big Data ...
  • is is a book about what, if any, “home eld advantage” the discipline of geography might hold with “big data” given its history of dealing with large, heterogeneous sets of spatial information.1 ...
  • But what will set you apart from the rest is actually knowing how to USE big data to get solid, real-world business results – and putting that in place to improve performance.Big Data will give you ...
  • Understanding Big Data

    2014-03-22 23:03:41
    Chapter 1 What Is Big Data? Hint: You’re a Part of it Every Day Chapter 2 Why is Big Data Important? Chapter 3 Why IBM for Big Data? Part II: Big Data: From the Technology Perspective Chapter 4 All ...
  • But what will set you apart from the rest is actually knowing how to USE big data to get solid, real-world business results - and putting that in place to improve performance. Big Data will give you ...
  • What’s more, Big Data Analytics with Spark provides an introduction to other big data technologies that are commonly used along with Spark, like Hive, Avro, Kafka and so on. So the book is self-...
  • The best-selling author of Big Data is back, this time with a unique and in-depth insight into how specific companies use big data. Big data is on the tip of everyone's tongue. Everyone understands ...
  • Big Data Bootcamp explains what big data is and how you can use it in your company to become one of tomorrow's market leaders. Along the way, it explains the very latest technologies, companies, and ...
  • The best-selling author of Big Data is back, this time with a unique and in-depth insight into how specific companies use big data.Big data is on the tip of everyone's tongue. Everyone understands ...
  • Your business generates reams of data, but what do you do with it? Reporting is only the beginning. Your data holds the key to innovation and growth – you just need the proper analytics. In Big Data,...
  • Big Data Analytics Using Splunk opens the door to an exciting world of real-time operational intelligence.Built around hands-on projects Shows how to mine social media Opens the door to real-time ...
  • What's more, Big Data Analytics with Spark provides an introduction to other big data technologies that are commonly used along with Spark, like Hive, Avro, Kafka and so on. So the book is self-...
  • This series aims to capture new developments and summarize what is known over the entire spectrum of mathematical and computational biology and medicine. It seeks to encourage the integration of ...
  • Examine the problem of maintaining the quality of big data and discover novel solutions. You will learn the four V’s of big data, including veracity, and study the problem from various angles. The ...
  • High Fidelity Data Reduction for Big Data Security Dependency Analyses(CCF A) 这是我读的条理最清晰的一篇文章了! 1.ABSTRACT We found that some events have identical dependency impact scope, and ...

    High Fidelity Data Reduction for Big Data Security Dependency Analyses(CCF A)



    1. We found that some events have identical dependency impact scope, and therefore they can be safely aggregated.(有相同的依赖影响范围,可以聚合),所以提出了方法 Causality Preserved Reduction (CPR)
    2. Furthermore, we found that certain popular system behaviors lead to densely connected dependency graphs of objects and their related neighbors(一些系统行为会导致对象及其相关邻居的紧密连接依赖图),所以提出了 Process-centric Causality Approximation Reduction (PCAR),
    3. 评价结果表明,该方法在保证准确性的前提下,可将取证分析(forensic analysis)效率提高5.6倍。


    2.1 System Dependency Analysis



    2.1.2 POI

    本文中提出了一个名词叫做 Point-of-Interest (POI),在文章中大量的使用了这个名词,根据我的理解这应该是一个有问题的事件(也就是说从这个时候开始观测到有问题,算是一个监测点,从这个点开始进行backward分析,试图找到所有对这个事件有影响的所有事件和点并且构造出一个图)。
    上图就是一个例子,eAD-1就是那个POI Event,红色线标识的事件是对POI有影响的事件,黑色虚线标识的事件是对POI没有影响的事件。

    2.2 Data Characteristics

    对Linux和Windows分别使用Kernel audit和ETW进行日志收集,我们观察到一台普通的桌面计算机每天产生超过100万个事件,而一台服务器可能产生10到100倍的量。
    结论:Reducing the data volume is key to solving the scalability problem.

    2.3 Data Reduction Insights

    For each “key event” there exist a series of “shadowed events” whose causal relations to other events are negligible in the presence of the “key event” ,Therefore,we could significantly reduce the data volume while keeping the causal dependencies intact, by merging or summarizing information in“shadowed events”into“key events”while preserving causal relevant information in the latter.
    (对于key event他有许多的shadowed events ,当key event出现的时候,shadowed events可以被忽略掉,所以这种情况就可以进行合并从而减少数据量)


    2.3.1 Low-loss Dependency Approximation

    With further study of our data, we discover that several applications (mostly system daemons) tend to produce intense bursts of events that appears semantically similar, but are not reducible by the perfect causality preserving reduction, due to interleaved dependencies.

    For example, process “pcscd” repeatedly interleaves read/write access to a set of files. We name this type of workload “iBurst.”(就是有一些程序会在短时间内突发大量在语义上是类似的事件,对于这种类型的程序进行数据缩减是没法完美的保留因果的)

    With iBurst, data reduction is only possible with certain levels of loss in causality. (所以考虑以一种低的因果损失为代价进行缩减)

    如上图所示,We thus devise a method to detect an iBurst and apply a well-controlled dependency approximation reduction, which ensures that causality loss only impacts events within the iBurst.(采用了一种方法,使得因果信息的损失只发生在这个圈里面)

    By ignoring the causal relationship among all events within the iBurst, event eCA−2 is considered approximately shadowed by eCA−1, even though they are interleaved by eAD−1. However,eAB−1 and eAB−2 must be kept as independent key events because they are interleaved by eBF−1,which does not belong to the iBurst.
    (忽视这个圈里面的所有因果信息,考虑它们之间的key和shadow关系,如果和圈外面的事件是interleaved交叉的,那么保留,如果与圈外的事件没有interleaved ,那么就进行合并缩减 )


    3.1 定义


    3.2 CPR

    Causality Preserved Reduction (CPR). The core idea is to aggregate all those events that are aggregable and share the same trackability.
    For two events e1and e2 from entity u to entity v, they have the same backward trackability if all the time windows of incoming events of u do not overlap with the time window of [te(e1),te(e2)].(只要u的所有入事件的时间窗没有全部重叠在e1结束时间到e2结束时间,那么就可以进行合并)
    they have the same forward trackability only if none of the outgoing events of entity v has
    a time window overlapped with the time window between the start times of e1 and e2.(只要没有v的出边事件的时间窗全部重叠在e1到e2的开始时间这段,那么就可以进行合并,因为这个时候有相同的前向追踪)

    If two events can be aggregated, we aggregate the event with a later start time (i.e.,the later event) to the event with an earlier start time (i.e.,the former event) by extending the end time of the former event to the end time of the later event and then discard the later event.(如果两个事件能够合并,那么保留前一个事件,舍弃后一个事件,并且将时间窗变为前一个的开始时间到后一个的结束时间)


    Before an attack is revealed, we cannot know what events will be selected as POI events; therefore, our approach should work equally on events no matter whether they will be selected as POI events.(当然很多时候在攻击没有被揭示的时候,我们还不知道什么事件会被选为POI,所以我们的方法应该在不知道什么是POI的情况下,也是可以实现合并算法的)


    3.3 Process-centric Causality Approximation Reduction

    Process-centric Causality Approximation Reduction (PCAR), aims to reduce data from intensive bursts of events with interleaved dependencies(PCAR是为了对前面讲到的那种突发性的burst事件进行数据缩减)


    1. 设立了一个Hot process, Hot processes can be detected using a simple statistics calculation with a sliding time window If the number of events related to a process in a time window exceeds a certain threshold, the process is marked as a hot process.
    2. Once a hot process is detected, we collect all objects involved in the interactions,and form a neighbour setN(u), where u is the hot process. We name the set ego-net.
    3. we only check the trackability with the information flow into and out of the neighbour setN(u). The checking procedure is the same as CPR in checking the time window overlap. It can ensure that as long as the events inside the ego-net are not selected as a POI event, we can achieve high-quality tracking results.(相当于是把这个neighbour当作一个整体,只看外界和这个整体的信息流入和流出)

    3.4 Domain Knowledge Reduction Extension

    3.4.1 Special files

    In a Linux system, many system entities are treated as files and interact with processes as file reads/writes。

    Plain files :refer to the files that have data stored in the disk(就是正常的文件存储)。

    special files :refer to the files that are only abstractions of system resources(系统资源的抽象)。

    1. By contrast, the interactions between processes and special files may involve more complex behaviours and implicit information flows. The reason is that the files under/procare mappings of kernel information, and writing to them or reading from them involves complex kernel activities that cannot be treated as a simple read/write, resulting in no explicit information flow.(进程对很多这种特殊文件的操作是很复杂的内核操作,而不能被认为是简单的读和写,所以这种行为并不会产生具体的信息流)

    2. Based on such domain knowledge, we further integrate a special files filter into our approach to remove events related to those special files that will not introduce any explicit information flow.(所以在算法中加入了特殊文件过滤,移除这些不会产生信息流的特殊文件)

    3.4.2 Temporary files

    Define a temporary file as a file that is only touched by one process during its lifetime.Since a temporary file only has information exchange with one process, it does not introduce any explicit information flow in attack forensics either, and therefore we can remove all the events of temporary files from the data.(这种暂时文件(只与一个进程交互过),所以不会产生信息的流动,所以可以移除)


    Conduct a series of experiments based on real-world data collected from a corporate R&D laboratory to demonstrate the effectiveness of our data reduction approach, in terms of data processing,storage capacity improvement and the support for forensic analysis.

    4.1 Data collection

    In the corporate research laboratory from which our data is collected,We select one month of data for our study.
    The data logs we used are collected from more than 30 machines with various server models and operating systems. All the data collected is stored in a daily basis

    4.2 Data reduction

    4.2.1 Overall effectiveness

    1. In the first phrase, we only apply CPR.
    2. In the second phrase, we further apply PCAR.
    3. Finally, we apply domain knowledge reduction extension in the third phrase.


    In total we collected data from 31 hosts, 18 of which have Linux as operating systems and 13 of which have Windows as operating systems.


    1. For CPR, on average it can achieve the reduction ratio of 56% (i.e., it can reduce the data size by 56%) thus increasing the data processing and storage capacities by 2.27 times (2.63 in Linux and 1.61 in Windows,respectively).
    2. Next, if we apply CPR+PCAR, the overall reduction ratio will be raised to 70%, which achieves 3.33 times growth in the data processing and storage capacities(4 in Linux and 2.44 in Windows, respectively).
    3. Finally, after we apply our domain knowledge reduction extension, the reduction ratio can reach 77%, which increases the data processing and storage capacities by 4.35 times (5.26 in Linux and 2.56 in Windows, respectively)

    可以看出本文提出的方法在reduce data logs 和 improve data process and storage capacities有显著的效果。

    Different hosts, the benefits gained by our system vary. This is because different hosts run different workloads, and the effectiveness of our system is affected by different workloads.(本文的方法的提升效率的vary是基于工作负载的)

    4.2.2 Break-down analysis(应该是细分,分类分析的意思,不是故障分析)

    Since the effectiveness of our system is sensitive to different workloads, we need to conduct a break-down analysis to scrutinize how our system works on different workloads.


    1. From the table we can see that CPR works well on most workloads. However, it works poorly on system daemons(系统后台进程),and the reason has been explained previously: the daemon processes generally perform tasks that generate intensively interleaving events(类似iburst突发进程). That is also why PCAR works very well on this type of workload.
    2. Office workloads(office负载) generate considerable temporary files, which is why our domain knowledge reduction extension works best on them。
    3. File sharing(文件共享) generates many interactions with temporary files (logs), and thus domain knowledge filtering can help to improve the reduction ratio significantly.
    4. The workload on which our approach is least effective is communication applications,However, their workloads do not contribute much to the entire workload in an enterprise environment.(在社交应用上表现不好,但是它们的附载贡献不大)
    5. Database(数据库),on the other hand, could be one of the majority workloads in certain circumstances, but our approach works less effectively on them. However, our system can still achieve an reduction ratio of more than 40% and increase the data processing and storage capacities by 1.67 times.(数据库负载大但是本文的效果并不是特别好,但是仍然有一定效果)

    4.2.3 Naïve aggregation

    这个实验很有意思,实验的目的是基于这么一个启发性发现: events appeared in a short period of time between two system entities that tend to share similar behaviours. Thus, in this naive approach, we blindly aggregate events in a fixed time window, without considering any dependency.(在两个系统实体之间的一个短时间的事件倾向于有相同的行为,因此先不考虑依赖关系只根据时间来聚合它们)

    Such a naive approach can be regarded as a state-of-art data reduction approach, which will provide us a baseline of reduction power to compare with.(这种方法虽然不科学,但是它可以作为一种基线,来考量本文提出的方法的有用程度)

    1. First, we fix the time window to 10 seconds
    2. Second, we set the time window to unlimited (i.e., we will aggregate all aggregatable events in the data). The second naive aggregation should be an upper bound of data reduction power of any reduction approach that does not remove any entities and has the same event aggregability definition as our approach。(对于这种对时间没有加限制的情况可以看作是所以数据缩减技术的天花板-最高值,就是把所有的东西都聚合了,这种聚合是没有考虑依赖关系损失的,只是native聚合)


    4.3 Support of forensic analysis

    本章看取证分析的结果如何,Since forward-tracking is the opposite of backtracking, in our evaluation we focus on backtracking.(这里只分析backtracking)

    The connectivity reflects the quality of the backtracking.(连接反映了backtracking的质量)Multiple aggregatable events only contribute to one connectivity.(多个可聚合的事件会聚合到一个连接)


    The enlarged resulted graph will increase the difficulty of tracing the root cause of anomaly. However, as the false positive rate of our approach is very low, the impact on forensic analysis is negligible.(扩大的图会增加取证分析的困难,但是PCAR的false positive很低,所以基本上不会有影响)

    To further investigate the impact upon attack forensics and compare with the naive aggregation, we randomly select 20,000 POI events from the data and apply backtracking.如下图所示


    4.4 Runtime Performance

    1. The memory consumption remains under 2GB, which is a small overhead for any commercial server.
    2. As for the CPU consumption, a single 2.5G Hz core can easily handle such a data rate. Thus,the CPU overhead is also minor.


    5.1 逃避的可能

    1. Since CPR guarantees that no causal dependency is lost after data reduction, an attacker simply cannot exploit CPR to distort the reality by any means。
    2. Although PCAR does result in dependency loss during data reduction, it is very difficult to exploit such a loss to cover any meaningful malicious activities。

    5.2 泛用性

    CPR and PCAR do not alter data characteristics with respect to platform, instrumentations and
    high-level semantics, and thus can function as an independent data reduction layer, applied either before or after other data reductions。(与其他缩减算法不冲突,可以作为一个单独的缩减层)


    1. LOGGC(这个之前读过),有些系统对象(如临时文件)是隔离的生命周期很短,对依赖关系分析的影响很小,因此可以对这些对象和事件进行垃圾收集,以节省空间。 we found that our approach offers comparable reduction as LogGC. Furthermore, the two approaches are orthogonal.
      Their approach focuses on object life-span, while we focus on event causality equivalence. The two approaches compliment one another and can be further combined.(Loggc的缩减是关注文件的生命周期,而本文的方式是关注因果等价,两者可以结合使用互补)

    2. ProTracer [34] proposes a new system that aims at reducing log volume for provenance tracing. (但是这是一个新的系统), our approach can be applied on existing audit systems without any modification and our reduced data also retains the potential to be used by applications other than forensic analysis.(而我们的方法可以直接应用在已经存在的审查系统上,而不需要对底层系统进行更改。)

    7. Conclusion

    We presented a novel approach that exploits dependency among system events to reduce the size of data without compromising the accuracy of the forensic analysis。(提出了一种减少数据并且不用或者很少的在取证分析的准确性上妥协的方法。)

    Evaluated over datasets gathered from a real-world enterprise environment, our results show that our approach improves space capacity by 3.4 times with no or little accuracy compromise. Although the overall space and computational requirements for large-scale data analysis remain challenging, we hope our data reduction approach will bring forensic analysis in a big data context closer to become practical.(在大数据环境下的大量数据来说仍然有整体空间和计算需求的挑战)

  • This book is about how to integrate full-stack open source big data architecture and how to choose the correct technology―Scala/Spark, Mesos, Akka, Cassandra, and Kafka―in every layer. Big data ...
  • As you advance through the book, you will wrap up by learning how to create a single pane for end-to-end monitoring, which is a key skill in building advanced analytics and big data pipelines. What ...
  • By the end of this book, you will have a thorough understanding of Spark, and you will be able to perform full-stack data analytics with a feel that no amount of data is too big. Style and approach ...
  • introduce what is big data
  • If you are a Big Data architect, developer, or a programmer who wants to develop applications/frameworks to implement real-time analytics using open source technologies, then this book is for you. ...
  • Big data for dummies

    2019-03-14 10:29:06
    What is the architecture for big data? How can you manage huge volumes of data without causing major disruptions in your data center? ✓ When should you integrate the outcome of your big data ...
  • Utilize this practical and easy-to-follow guide to modernize traditional enterprise data warehouse and business intelligence environments with next-generation big data technologies. Next-Generation ...
  • Beginning Big Data with Power BI and Excel 2013 is designed for anyone who uses data analytics to make business decisions, especially those in small and medium-sized businesses. Table of Contents ...
  • This book shows you many optimization techniques and covers every context where Pig is used in big data analytics. Beginning Apache Pig shows you how Pig is easy to learn and requires relatively ...



1 2 3 4 5 ... 20
收藏数 39,363
精华内容 15,745