当前位置: 首页 > >

大数据介绍项目流程_大数据介绍

发布时间:

大数据介绍项目流程






About Big Data


关于大数据


什么是大数据?(What is Big Data?)

In modern world, there are many big problems. One of those problems is Big Data. At present world, data collection is very important.It is they key to the success of a company. But as users are increasing day by day, data is becoming larger and larger.


在现代世界中,存在许多大问题。 这些问题之一是大数据。 在当今世界,数据收集非常重要,这是公司成功的关键。 但是随着用户的日益增加,数据变得越来越大。


Some of the companies which acquire enormous data on daily basis are google,*,twitter, instagram etc. People all around the world post images and other stuff everyday. For example, * generates 4 PetaBytes per day. See below stats


每天都会获取大量数据的一些公司是google,*,twitter,instagram等。世界各地的人们每天都会发布图片和其他内容。 例如,*每天生成4 PetaBytes。 见以下统计


统计:每分钟评分(Stats: Per Minute Ratings)

Here are some of the per minute ratings for various social networks:


以下是各种社交网络每分钟的收视率:


Snapchat: Over 527,760 photos shared by users

Snapchat:用户共享了超过527,760张照片

LinkedIn: Over 120 professionals join the network

领英:超过120名专业人员加入了网络

YouTube: 4,146,600 videos watched

YouTube:观看了4,146,600个视频

Twitter: 456,000 tweets sent or created

Twitter:发送或创建了456,000条推文

Instagram: 46,740 photos uploaded

Instagram:上传了46,740张照片

Netflix: 69,444 hours of video watched

Netflix:观看了69,444小时的视频

Giphy: 694,444 GIFs served

Giphy:送出694,444张GIF

Tumblr: 74,220 posts published

Tumblr:已发布74,220个帖子

Skype: 154,200 calls made by users.

Skype:用户拨打了154,200个电话。


So, how does these companies manage the data. The answer is by using a combination of massively paralleled systems.


因此,这些公司如何管理数据。 答案是通过使用大规模并行系统的组合。


The concept used in solving Big Data is Distributed System. To understand distributed systems, we need to understand another concept called IOPS.


解决大数据所使用的概念是分布式系统。 要了解分布式系统,我们需要了解另一个称为IOPS的概念


什么是IOPS? (What is IOPS?)

IOPS means Input/Output operations per second. It is the unit for measuring performance characteristics of storage devices. IOPS represents how quickly a given storage device or medium can read and write commands in every second.When writing data into the disk, we dont write it in bytes.Rather we write it in form of blocks. Blocks have different sizes and it depends on the system. SQL Server uses 64kb blocks whereas Windows server uses 4kb blocks. To get better understanding of IOPS, lets take SSD and HDD. We know that SSD’s are faster than HDD’s. The iops for ssd is in range 3000 to 40,000 whereas iops for hdd is in range of 55 to 80.


IOPS表示每秒的输入/输出操作。 它是测量存储设备性能特征的单位。 IOPS表示给定的存储设备或介质每秒可以读取和写入命令的速度,当将数据写入磁盘时,我们不以字节为单位写入数据,而是以块形式写入数据。 块的大小不同,具体取决于系统。 SQL Server使用64kb块,而Windows Server使用4kb块。 为了更好地了解IOPS,请使用SSD和HDD。 我们知道SSD的速度比HDD的速度更快。 固态硬盘的IOPS范围为3000至40,000,而硬盘硬盘的IOPS范围为55至80。


什么是分布式系统? (What is a Distributed System?)

A distributed system, also known as distributed computing, is a system with multiple components located on different machines that communicate and coordinate actions in order to appear as a single coherent system to the end-user.


分布式系统,也称为分布式计算,是一种具有位于不同机器上的多个组件的系统,这些组件可以通信和协调动作,以显示为最终用户的单个连贯系统。















Let us consider a storage appliances. Lets suppose we have 40 TB of data. To write 40 TB of data into a disk, it takes 40 min. If we split the data into 10 TB blocks and start writing the data in 4 disks, it takes total of 10 mins.Suppose if we split 40 TB in 5 TB blocks and start the process of storing it in 8 disks, it takes 5 mins to copy all data to disks. From this concept, we can say that using more number of disks and storing small data is more efficient than using large disk with large amount of data. This concept is called parallelisation and is used by Distributed System. Not only storage, even compute power and many other services use this concept.


让我们考虑一个存储设备。 假设我们有40 TB的数据。 要将40 TB的数据写入磁盘,需要40分钟。 如果将数据拆分为10 TB的块并开始将数据写入4个磁盘中,则总共需要10分钟;假设我们将40 TB的数据拆分为5 TB的块并开始将其存储在8个磁盘中的过程需要5分钟将所有数据复制到磁盘。 从这个概念,我们可以说使用更多数量的磁盘并存储少量数据比使用大量数据的大型磁盘更有效。 这个概念称为并行化,由分布式系统使用。 不仅存储,甚至计算能力和许多其他服务都使用此概念。


分布式系统的主要优点 (Main benefits of Distributed System)

Horizontal Scalability ? Since computing happens independently on each machine, it is easy and generally inexpensive to add additional devices and functionality as necessary.

水*可扩展性?由于计算是在每台计算机上独立进行的,因此根据需要添加其他设备和功能很容易,而且通常也不便宜。

Reliability ? Most distributed systems are fault-tolerant as they can be made up of hundreds of machines that work together. The system generally doesn’t experience any disruptions if a single machine fails.

可靠性-大多数分布式系统都是容错的,因为它们可以由数百台可协同工作的机器组成。 如果单台计算机出现故障,系统通常不会受到任何干扰。

Performance ? Distributed systems are extremely efficient because work loads can be broken up and sent to multiple machines.

性能?分布式系统非常高效,因为可以分解工作负载并将其发送到多台机器。


There are so many Big Data technologies like Apache Hadoop,Microsoft HDInsight, NoSQL, Hive, Sqoop etc. Out of all, most widely used is Hadoop.


大数据技术如此之多,例如Apache Hadoop,Microsoft HDInsight,NoSQL,Hive,Sqoop等。其中,最广泛使用的是Hadoop。



The Three Vs of Big Data


大数据的三大诉求


Volume: Means amount of data. Big data is about volume. Volumes of data that can reach unprecedented heights in fact. It’s estimated that 2.5 quintillion bytes of data is created each day


数量:表示数据量。 大数据与数量有关。 实际上,数据量可以达到前所未有的高度。 估计每天创建2.5亿个字节的数据


Velocity: Velocity is the fast rate at which data is received and (perhaps) acted on. For example, * users upload more than 900 million photos a day. * has to handle a tsunami of photographs every day. It has to ingest it all, process it, file it, and somehow, later, be able to retrieve it.


速度:速度是接收和(或可能)作用于数据的快速速率。 例如,*用户每天上传超过9亿张照片。 *必须每天处理大量的海啸照片。 它必须吸收所有内容,对其进行处理,将其归档,并在以后以某种方式能够对其进行检索。


Variety: It refers to the many types of data that are available.For example, you may have noticed that I’ve talked about photographs, sensor data, tweets, encrypted packets, and so on. Each of these are very different from each other.


种类:它指的是可用的多种数据类型,例如,您可能已经注意到我在谈论照片,传感器数据,推文,加密的数据包等。 这些都彼此非常不同。



什么是Hadoop? (What is Hadoop?)

Hadoop is an open-source software framework for storing data and running applications on clusters of commodity hardware. It provides massive storage for any kind of data, enormous processing power and the ability to handle virtually limitless concurrent tasks or jobs.


Hadoop是一个开放源代码软件框架,用于在商用硬件集群上存储数据和运行应用程序。 它可为任何类型的数据提供海量存储,强大的处理能力以及处理几乎无限的并发任务或作业的能力。


Hadoop uses Master-Slave architecture. It comprises of single NameNode (Master Node) and other nodes are DataNodes (Slave Nodes). All the data are stored in Data Nodes, not in master node. We use MasterNode to manage DataNodes.


Hadoop使用主从架构。 它由单个NameNode(主节点)组成,其他节点为DataNode(从节点)。 所有数据都存储在数据节点中,而不是主节点中。 我们使用MasterNode来管理DataNode。













Hadoop makes it easier to use all the storage and processing capacity in cluster servers, and to execute distributed processes against huge amounts of data. Hadoop provides the building blocks on which other services and applications can be built.


Hadoop使使用群集服务器中的所有存储和处理能力以及对大量数据执行分布式进程变得更加容易。 Hadoop提供了构建其他服务和应用程序的基础。


This is just one software. There are so many other softwares to handle big data. Every software(technology) has its own benefits.


这只是一个软件。 还有许多其他软件可以处理大数据。 每种软件(技术)都有自己的优势。


大数据的重要性 (Importance of Big Data)

There is a famous quote. That is “Data is the new Oil”. It means data is very important in present times.


有一个著名的报价。 那就是“数据是新的石油”。 这意味着数据在当前非常重要。


Data improves quality of life.

数据可以改善生活质量。 Data allows organizations to more effectively determine the cause of problems. Data allows organizations to visualize relationships between what is happening in different locations, departments, and systems.

数据使组织可以更有效地确定问题的原因。 数据使组织可以可视化不同位置,部门和系统中发生的事情之间的关系。 Data Analytics provide us solutions for most of the problems we face today.

数据分析为我们今天面临的大多数问题提供了解决方案。 Data helps you understand performance.It helps you to understand your customers

数据可以帮助您了解绩效,也可以帮助您了解客户





翻译自: https://medium.com/@kvs.vishnu23/introduction-to-big-data-8f28a4daa73f



大数据介绍项目流程



友情链接: