• what is big data?

    2019-09-13 04:15:47
    link: http://opensource.com/resources/big-data Big data: everyone seems to be talking about it, but what is big data really? How is it changin...

    link: http://opensource.com/resources/big-data

    Big data: everyone seems to be talking about it, but what is big data really? How is it changing the way researchers at companies, non-profits, governments, institutions, and other organizations are learning about the world around them? Where is this data coming from, how is it being processed, and how are the results being used? And why is open source so important to answering these questions?

    In this short primer, learn all about big data and what it means for the changing world we live in.

    What is big data?

    There is no hard and fast rule about exactly what size a database needs to be in order for the data inside of it to be considered “big.” Instead, what typically defines big data is the need for new techniques and tools in order to be able to process it. In order to use big data, you need programs which span multiple physical and/or virtual machines working together in concert in order to process all of the data in a reasonable span of time.

    Getting programs on multiple machines to work together in an efficient way, so that each program knows which components of the data to process, and then being able to put the results from all of the machines together to make sense of a large pool of data takes special programming techniques. Since it is typically much faster for programs to access data stored locally instead of over a network, the distribution of data across a cluster and how those machines are networked together are also important considerations which must be made when thinking about big data problems.

    What kind of datasets are considered big data?

    The uses of big data are almost as varied as they are large. Prominent examples you’re probably already familiar with including social media network analyzing their members’ data to learn more about them and connect them with content and advertising relevant to their interests, or search engines looking at the relationship between queries and results to give better answers to users’ questions.

    But the potential uses go much further! Two of the largest sources of data in large quantities are transactional data, including everything from stock prices to bank data to individual merchants’ purchase histories; and sensor data, much of it coming from what is commonly referred to as the Internet of Things (IoT). This sensor data might be anything from measurements taken from robots on the manufacturing line of an auto maker, to location data on a cell phone network, to instantaneous electrical usage in homes and businesses, to passenger boarding information taken on a transit system.

    By analyzing this data, organizations are able to learn trends about the data they are measuring, as well as the people generating this data. The hope for this big data analysis are to provide more customized service and increased efficiencies in whatever industry the data is collected from.

    How is big data analyzed?

    One of the best known methods for turning raw data into useful information is by what is known as MapReduce. MapReduce is a method for taking a large data set and performing computations on it across multiple computers, in parallel. It serves as a model for how program, and is often used to refer to the actual implementation of this model.

    In essence, MapReduce consists of two parts. The Map function does sorting and filtering, taking data and placing it inside of categories so that it can be analyzed. The Reduce function provides a summary of this data by combining it all together. While largely credited to research which took place at Google, MapReduce is now a generic term and refers to a general model used by many technologies.

    What tools are used to analyze big data?

    Perhaps the most influential and established tool for analyzing big data is known as Apache Hadoop. Apache Hadoop is a framework for storing and processing data in a large scale, and it is completely open source. Hadoop can run on commodity hardware, making it easy to use with an existing data center, or even to conduct analysis in the cloud. Hadoop is broken into four main parts:

    • The Hadoop Distributed File System (HDFS), which is a distributed
      file system designed for very high aggregate bandwidth;
    • YARN, a platform for managing Hadoop’s resources and scheduling
      programs which will run on the Hadoop infrastructure;
    • MapReduce, as described above, a model for doing big data processing;
    • And a common set of libraries for other modules to use.

    To learn more about Hadoop, see our Introduction to Apache Hadoop for big data.

    Other tools are out ther too. One which has been receiving a lot of attention recently is Apache Spark. The main selling point of Spark is that it stoes much of the data for processing in memory, as opposed to on disk, which for certain kinds of analysis can be much faster. Depending on the operation, analysts may see results a hundred times faster or more. Spark can use the Hadoop Distributed File System, but it is also capable of working with other data stores, like Apache Cassandra or OpenStack Swift. It’s also fairly easy to run Spark on a single local machine, making testing and development easier.

    For more on Apache Spark, see our collection of articles on the topic.

    Of course, these aren’t the only two tools out there. There are countless open source solutions for working with big data, many of them specialized to provide optimal features and performance for a specific niche or for specific hardware configurations. And as big data continues to grow in size and importance, the list of open source tools for working with it will certainly continue to grow alongside.


  • Big data for dummies

    2019-03-14 10:29:06
    What is the architecture for big data? How can you manage huge volumes of data without causing major disruptions in your data center? ✓ When should you integrate the outcome of your big data ...
  • Understanding Big Data

    2014-03-22 23:03:41
    Chapter 1 What Is Big Data? Hint: You’re a Part of it Every Day Chapter 2 Why is Big Data Important? Chapter 3 Why IBM for Big Data? Part II: Big Data: From the Technology Perspective Chapter 4 All ...
  • Big Data

    千次阅读 2013-02-28 11:08:51
    What is "Big data"? The amount of data and rate of creation of data in the world is increasing at unprecedented levels. These huge, less structured data sets from non-traditional sources is big-dat
    What is "Big data"?
    The amount of data and rate of creation of data in the world is increasing at unprecedented levels. These huge, less structured data sets from non-traditional sources is big-data.

    Wikipedia defines big data as:
    In information technology, big data is a collection of data sets so large and complex that it becomes difficult to process using on-hand database management tools. The challenges include capture, curation, storage, search, sharing, analysis, and visualization.

    This info graphic from Big Data | Visual.ly should be a good start:

    Why is it important?
    This explosion of data and analysis of these large datasets or "big data" has become crucial to innovate, compete and get an edge over the competition. It can also give great insight as to what is happening on very low levels at resolutions not possible before. On a side note, the big data industry is poised to grow to a $25 billion by 2015 and a 50 bil industry by 2017.

    It is set to affect almost all industries:

    What can analysis of big data do?
    Earlier, enterprises relied mostly on transactional data stored in a orderly fashion, however this has changed. Lot of data is generated from people centric sources, which can be anything ranging from email to posts and tweets. Earlier, organizations used to discard the data, but with cost of storage and computing reducing, analysis of big data has become affordable and mandatory.

    The data also has a short life span and difference between good and bad info is processing the data stream in seconds.

    For instance, there are 12 terabytes of tweets a day, and after filtering out the noise, this data can give a lot of insight into consumer behavior in multiple areas - in short predict the future to various degrees.

    How different communities interact is shown in the visual above.

    Is it worth the hype?
    There is a lot of hype surrounding big data, and most of it is real and deserved. Almost all the big players have come out with their own solutions and products to tackle these new challenges.

    What are the challenges?
    • Sanitizing the data - which data source can you trust? Which data is current?
    • Processing the data in a timely manner - the data may not be worth a cent after a few hours. So processing the data in time is crucial.
    • handling the sheer volume and variety of the data

  • The pictures are from the white board in my office –you might have heard of Henry and Albert, but Rusty is the VP of Enterprise Data Services and Big Data at Regions Bank. He made that statement on a...

    The pictures are from the white board in my office –you might have heard of Henry and Albert, but Rusty is the VP of Enterprise Data Services and Big Data at Regions Bank. He made that statement on a call several months ago and it’s been on my white board ever since. The quotes remind me every day that all companies are facing huge issues from the explosion of Big Data that is breaking traditional architectures. Yet while this disruption presents great opportunity – to get new insights that are transformative to the business the existing architecture and way of doing things has to change – more of the same has already proven to not be the right answer.

    Syncsort, Tableau and Cloudera delivered a webinar and Paul Lilford, Matt Brandwein, Jorge Lopez and I worked together for several months building the content and at every meeting it was interesting how all of us were seeing exactly the same disruption from different viewpoints. I began my career in data warehousing 15 years ago, and since then, I’ve helped design, build and implement not just data warehouses, but also, the products and tools like ETL that are used to create them around the world across every vertical. But what’s completely blown my mind over the last 18 months is the incredible disruption Hadoop has had on traditional data architectures and from what I can tell that’s just a small taste of what’s to come.

    Early data warehouses were insanely successful ─ they exposed the fact that business users have an unquenchable thirst for more information and once they are hooked they just want more and more.  The only comparison that comes close is trusted data for business users is like a drug addiction. They always want more data, more frequently, of a higher quality and with fewer restrictions while dealing with a limited budget.

    Fast forward to today and data warehouse and business intelligence infrastructure has been a victim of its own success – like a teenager hitting adolescence – warehouses have experienced massive growing pains. While at the same time the data they provide has become such a critical part of corporate infrastructure that for example lots of quarterly reports are based on data from them.

    If you’re delivering your financial results based on the data warehouse, then you have to trust and govern that information or your executives could go to jail. As a result business intelligence projects have continued to receive funding despite growing costs – Gartner estimates the average costs of just the data integration component is between 500K and 1M USD.

    So what are the essential elements in the new information architecture to liberate organizations access to data insights?

    1)  You need an enterprise data hub (EDH) which serves a number of roles:

    1. It provides an active archive (the ultimate enterprise data staging area) for all data and types, cost-effectively retained in full fidelity for any time period automatically providing a compliance archive.
    2. A single place to transform data from its raw format into structured formats for consumption by existing systems like data warehouses and marts. You can with tools like DMX-h bring in legacy sources like mainframe that were previously unavailable to the warehouse / reporting tools and combine that data with other sources in full fidelity.
    3. Business users get agile access to specific data sets to directly explore and analyse reducing business intelligence backlog requests and freeing capacity on existing systems.
    4. You can bring new analytic workloads to the EDH including products like Syncsort DMX-h running directly IN it and Tableau running AGAINST it generating greater insight, and more value from data driving revenue and profit

    2)     Migrate data staging – ELT (ETL running as SQL in the RDBMS) is an EDW Resource HOG up to 80% of storage/workload in some cases. By collecting, processing and distributing data using an EDH (ETL on Hadoop) you can reduce costs, free up resources etc. Today warehouse databases have less capacity to run end user queries and reports because of ELT and dormant data, so instead of the full data volume “data retention windows” were necessary. In addition the maintenance nightmare this creates means a spaghetti like architecture that customers call the onion effect because new layers of ELT SQL are added around the existing ones – because nobody wants/knows how to change the existing code without breaking it and if you have to make changes – everyone involved ends up crying. TDWI estimates it takes upwards of eight weeks to add a single column to a table and I’ve regularly seen six month wait times for new column /adding new data sources

    3)     You need a tool like Tableau to provide self-service connection and visualization of the data – not just for business users doing reporting, but for data scientist that wants to explore / analyze data and discover new insight that may even lead to new report requests from the warehouse.

    All these topics are covered in the webinar but if you take one thing away from reading this please remember – it’s not just Rusty’s warehouse that has become like a barge ship and it’s not a bad thing either – data warehouses are great at solving a specific problem but to discover new insight we need to do something different. Ultimately, successful customers are deploying an enterprise data hub downstream of the warehouse to enhance the staging area – the dirty secret of every data warehouse. They are definitely not getting rid of the RDBMS and if anything the capacity it’s releasing there is being used to solve new problems.

    This new paradigm delivers a seamless end-to-end solution from data to insight and it’s evolving rapidly to become easier and less expensive. Don’t be held back by “old school” notions of how to solve for the explosion of data. What has worked in the past is not the answer for today. Just like me, every day remind yourself of the wise words of Henry, Albert and Rusty.


    - See more at: http://vision.cloudera.com/a-journey-from-data-warehousing-to-big-data-insights-what-i-learnt-from-henry-ford-albert-einstein-and-rusty-sears/#sthash.II7xtlkp.dpuf
  • <p>type(temp2) returns saspy.sasdata.SASdata. Now I want to convert temp2 to be a Pandas Dataframe. I use temp2_1=temp2.to_df(method='CSV'). <p>It takes a long while. Are there...
  • introduce what is big data
  • big data now

    2017-11-12 07:36:45
    Mike Loukides kicked things off in June 2010 with “What is data science?” and from there we’ve pursued the various threads and themes that naturally emerged. Now, roughly a year later, we can look...
  • <div><p>What can be done to improve the performance if the json input provided is big ? now it is taking almost 10 -15 secs to load the builder </p><p>该提问来源于开源项目:mistic100/jQuery-...
  • As you advance through the book, you will wrap up by learning how to create a single pane for end-to-end monitoring, which is a key skill in building advanced analytics and big data pipelines. What ...
  • Big Data for Dummies

    2013-05-16 22:30:26
    Big data management is one of the major challenges facing business, industry, and not-for-profit organizations. Data sets such as customer transactions for a mega-retailer, weather patterns monitored ...
  • aws big dataIt is a well-known fact that s3 + Athena is a match made in heaven but since data is in S3 and Athena is serverless, we have to use GLUE crawler to store metadata about what is contained w...

    aws big data

    It is a well-known fact that s3 + Athena is a match made in heaven but since data is in S3 and Athena is serverless, we have to use GLUE crawler to store metadata about what is contained within those S3 locations.

    众所周知,s3 + Athena是天造地设的一对,但由于数据在S3中并且Athena是无服务器的,因此我们必须使用GLUE搜寻器来存储有关那些S3位置中所包含内容的元数据。

    Even small master tables, metrics tables or daily incremental transactional data with Schema changes must be crawled to create a table on Athena.


    In the beginning, my team and I used to write python scripts which upload the CSV files to s3 and then trigger a Lambda function which will invoke the relevant Crawler and create/update the table on Athena. Which was a troublesome and tedious process. That is when “AWS data Wrangler” comes to the rescue.

    一开始,我和我的团队曾经编写python脚本,将CSV文件上传到s3,然后触发Lambda函数,该函数将调用相关的Crawler并在Athena上创建/更新表。 这是一个麻烦而又乏味的过程。 那就是当“ AWS数据牧马人”来救援时。

    In this article, I want to focus on using data wrangler to read data from Athena → transform the data → create a new table out of the transformed dataframe directly on Athena.


    NOTE: AWS Data wrangler is synonymous with pandas but custom-tailored for AWS.

    注意:AWS Data Wrangler是pandas的同义词,但是为AWS量身定制的。

    NO crawler == NO hassle


    This can be achieved both from your local machine and glue python shell.


    Before we start the implementation make sure Data Wrangler is installed using:

    在开始实施之前,请确保使用以下方法安装了Data Wrangler:

    pip install awswrangler

    Also, verify appropriate s3 bucket and Glue table accesses were given to role or the user. Data Wrangler internally uses boto3 to perform actions.

    另外,请验证已将适当的s3存储桶和Glue表访问权限授予了角色或用户。 Data Wrangler在内部使用boto3执行操作。

    import awswrangler as wr
    name = wr.sts.get_current_identity_name()arn = wr.sts.get_current_identity_arn()

    The sample test code looks something like this


    #python Glue python shell作业的lib路径 (#python lib path for Glue python shell job)

    #If you are using GLUE python shell then make sure you add this python lib path every time you create a new job. The latest release can be found here

    #如果您使用的是GLUE python shell,请确保每次创建新作业时都添加此python lib路径 。 最新版本可以在这里找到


    #AWS Data Wrangler从Athena读取数据 (# AWS data wrangler read data from Athena)

    import awswrangler as wr
    import pandas as pd
    import numpy as np
    #databases = wr.catalog.databases()
    df = wr.athena.read_sql_query("""
    SELECT *
    FROM table_name limit 100
    """, database="test_database")
    df = df.melt(id_vars=['col1','col2','col3'])

    #AWS Data Wrangler将数据作为表格写入Athena (# AWS data wrangler write data to Athena as table)

    Using data wrangler you can read data in any type(CSV, parquet, Athena query, etc etc) anywhere (local or glue) as a pandas dataframe and write it back to s3 as an Object and create table on Athena simultaneously. For instance, the following code snippet can :

    使用数据管理器,您可以在任何地方(本地或胶水)读取任何类型(CSV,镶木地板,雅典娜查询等)的数据作为熊猫数据帧,并将其作为对象写回到s3并同时在雅典娜上创建表。 例如,以下代码段可以:

    res = wr.s3.to_parquet(
    path=f"s3://xxxxxxx/aws-wrangler-test", #s3 path where you want to dump
    database="test_database", #<your database name>
    table="table_test_wrangler", #<your new table name>
    description="testing data"

    More complex features like partitioning, casting and catalogue integration can be achieved by this to_parquet API. For an exhaustive list of Parameters, this method supports click here

    通过to_parquet API可以实现更复杂的功能,例如分区,转换和目录集成。 有关参数的详尽列表,此方法支持单击此处。

    奖励:雅典娜缓存 (BONUS: Athena cache)

    When calling read_sql_query, instead of just running the query, we now can verify if the query has been run before. This is disabled by default and can be enabled by passing max_cache_seconds bigger than 0.

    现在,当调用read_sql_query时 ,不仅可以运行查询,还可以验证查询是否已经运行过。 默认情况下禁用此功能,可以通过传递大于0的max_cache_seconds来启用。

    When max_cache_seconds > 0 and if the query string match within the prescribed seconds then Wrangler will return the s3 object instead of re-running the query if they are still available in S3.

    如果max_cache_seconds> 0,并且查询字符串在规定的秒数内匹配,则Wrangler将返回s3对象,而不是在S3中仍然可用时重新运行查询。

    df = wr.athena.read_sql_query(query, database="test_database", max_cache_seconds=900)

    AWS claims this increases performance more than 100x but must be executed with caution as the string should exactly match with the previous query ran in last 900 sec(15 min) as per max_cache_seconds parameter limit set here.


    The detailed approach of implementing Athena cache can be found in here


    翻译自: https://medium.com/@dheerajsharmainampudi/getting-started-on-aws-data-wrangler-and-athena-7b446c834076

    aws big data

  • Big Data Architect’s Handbook is for you if you are an aspiring data professional, developer, or IT enthusiast who aims to be an all-round architect in big data. This book is your one-stop solution ...
  • But what will set you apart from the rest is actually knowing how to USE big data to get solid, real-world business results – and putting that in place to improve performance.Big Data will give you ...
  • Big Data Analytics Using Splunk opens the door to an exciting world of real-time operational intelligence.Built around hands-on projects Shows how to mine social media Opens the door to real-time ...
  • Big data (Hadoop)

    2011-11-21 12:44:39
    What is the motivation behind Big Data? • Introducing Hadoop
  • This book is about how to integrate full-stack open source big data architecture and how to choose the correct technology―Scala/Spark, Mesos, Akka, Cassandra, and Kafka―in every layer. Big data ...
  • Big Data Analysis and Mining

    千次阅读 2018-05-29 13:05:33
    Chapter1 Introduction ...1.1 What is Big Data: Anwser: used to describe a massive structured and unstructured data that is so large that it is difficult to process using traditional database and soft...
  • Discovering Big Data’s fundamental concepts and what makes it different from previous forms of data analysis and data science Understanding the business motivations and drivers behind Big Data ...
  • Working with big data

    2020-12-27 10:46:16
    m working with Big Data and I saved numpy arrays in batches to my folder, because I can't store array fulfully on my RAM, while training. However, as I see, the library needs to load the dataframe...
  • See a Mesos-based big data stack created and the components used. You will use currently ... It is also for anyone interested in Hadoop or big data, and those experiencing problems with data size.
  • Big Data Bootcamp explains what big data is and how you can use it in your company to become one of tomorrow's market leaders. Along the way, it explains the very latest technologies, companies, and ...
  • What Are the Strategy and Business Applications of Big Data ?. Practice Question… Chapter 5… Big Data Platforms and Operating Tools. Practice Questions… Chapter 6… Big Data End User and Accounting...
  • Your business generates reams of data, but what do you do with it? Reporting is only the beginning. Your data holds the key to innovation and growth – you just need the proper analytics. In Big Data,...
  • Timely and bigdata

    2020-12-02 20:57:08
    <p>I have a few weeks ahead of me to try and implement some bigdata/rust toolbox, and I'm trying to assess how timely could help me. <p>I am toying right now with ...
  • The best-selling author of Big Data is back, this time with a unique and in-depth insight into how specific companies use big data.Big data is on the tip of everyone's tongue. Everyone understands ...
  • <p>The data is often too big for localstorage. Sometimes this data can be >10MB and can't easily be split into smaller parts. For example, we may need to send a single image over the wire that&...
  • Discussion: Storing Big Data

    2020-12-02 04:16:14
    ve been thinking a bit about the future of big data in Electron Microscopy. <p>For the uninitiated: Due to increases in detector resolution and the increase of use of time-resolved EM, 4D STEM and ...
  • What's more, Big Data Analytics with Spark provides an introduction to other big data technologies that are commonly used along with Spark, like Hive, Avro, Kafka and so on. So the book is self-...



1 2 3 4 5 ... 20
收藏数 4,030
精华内容 1,612