A) The trend is for every individual's data footprint to grow, but perhaps more significantly,the amount of data generated By machines as a part of the Internet of things would be even greater than this generated by people.
Each person's footprints on the Internet (or traces) will continue to grow, but perhaps more important is the large number of machine-generated data that, as part of the IoT, will be much larger than the data generated by people.
b) Organizations no longer has to merely manage their own data; Success in the future would be dictated to a large extent by their ability to extract value from other organizations ' data.
Organizers are no longer just managing their own data, and in the future, having the ability to extract valuable data from data from other organizations will determine your success.
c) Mashups between different information sources make for unexpected and hitherto unimaginable applications.
Mixing different sources of information will produce applications that have hitherto been unthinkable.
D) It had been said that "more data usually beats better algorithms,".
It has been said that more data will often trump better algorithms
E) This was a long time to read all data on a single drive-and writing is even slower. The obvious-on-the-multiple disks at once. Imagine If we had drives, each holding one hundredth of the data. Working in parallel, we could read the data in under and minutes.
It takes a long time to read all the data on a single-drive device, and the write is slower. The most significant way to reduce time is to read from multiple disks at once. Imagine that if we have 100 disk drives, each disk drive holds 1% of the data. In the case of concurrency, we can read the data within two minutes.
f) We can imagine that the users of such a system would is happy to share access in return for shorter analysis Times,and Statistically, that their analysis jobs would is likely to being spread over time, so they wouldn ' t interfere with each other Too much.
As we can imagine, users will be happy to share devices as a reward for shorter analysis times, and, in statistical sense, the user's analytical work is distributed across time periods, so there is no interference between them too much.
g) The first problem to solve are hardware failure:as soon as you start using many pieces of hardware, the chance that one Would fail is fairly high.
The first problem to solve is a hardware failure problem: As long as you use a multi-part integrated device, there is a very high chance that one of the parts will fail.
h) The second problem is a most analysis of the tasks need to being able to combine the data in some a, and data read from one Disk may need to is combined with data from any of the other disks.
The second problem is that, to some extent, most analysis tasks require aggregate data, and reading from one device requires data from the other 99 devices.
i) In a nutshell, this is what Hadoop provides:a reliable, scalable platform for storage and analysis. What's more, the because it runs on commodity hardware and are open Source,hadoop is affordable.
In short, Hadoop is a reliable, scalable storage and analytics platform. And because Hadoop runs on a common machine, plus open source, Hadoop is economical.
j) Queries typically take minutes or more, so it's best for offline use, where there isn ' t a human sitting in the Processi Ng loop waiting for results.
The typical query process takes a few minutes or more, so Hadoop is great for offline processing because no one is waiting around for the results of the loop process.
k) The first component to provide online access is HBase, a Key-value store that uses HDFS for its underlying storage. HBase provides both online read/write access of individual rows and batch operations for reading and writing data in bulk, Making it a good solution for building applications on.
The first online access component provided is HBase, which has the property of a key-value pair and uses HDFs to form its underlying storage. HBase not only provides online read and write access to individual rows of records, but also allows bulk read and write operations on large volumes of unbound data, making it a good way to build applications.
L) The real enabler for new processing models in Hadoop is the introduction of YARN (which stands for yet another Resource Negotiator) in Hadoop 2. YARN is a cluster resource management system, which allows no distributed program (not just MapReduce) to run on data A Hadoop cluster.
The real enabler for the HADOOP2 new processing model is that Yarn,yarn is a cluster resource management system that allows any distributed program (not just MapReduce) to run data on a Hadoop cluster
m) Despite the emergence of different processing frameworks on Hadoop, MapReduce still have a place for batch processing, a nd it's useful to understand how it works since it introduces several concepts this apply more generally f input formats, or how a datasets are split into pieces).
Although there are various processing frameworks that appear on Hadoop, MapReduce has a place in the batch, and since the introduction of some more general thought, it is helpful to understand how it works.
N) Why can ' t we use the databases with lots of disks to does large-scale analysis? Why is Hadoop needed? The answer to these questions comes from another trend in disk Drives:seek time was improving more slowly than transfer RA Te.
Why can't we use a multi-disk-driven database for large-scale data analysis? Why Hadoop is needed, the answer to this question comes from another disk-driven trend: the drive to address time is much slower than the increase in transfer rate
O) in many ways, MapReduce can is seen as a complement to a relational Database Management System (RDBMS). MapReduce is a good fit for problems this need to analyze the whole datasets in a batch fashion, particularly for ad hoc an Alysis. An RDBMS are good for point queries or updates, where the dataset has been indexed to deliver low-latency retrieval and UPD Ate times of a relatively small amount of data. MapReduce suits applications where the data is written once and read many times, whereas a relational database are good for Datasets that is continually updated.
In many ways, MapReduce can be seen as a complement to relational database management systems (RDBMS), and MapReduce is well suited to the problem of analyzing the entire data set in batch mode. The RDBMS is good at making low-latency queries on datasets that have already been indexed and updating relatively small data multiple times. MapReduce is suitable for applications that write multiple reads at once, while a relational database excels at updating applications frequently.
p) Another difference between Hadoop and an RDBMS are the amount of structure in the datasets on which they operate.
Another difference between Hadoop and the RDBMS system is the volume of the data set structure that they manipulate.
Q) Relational data is often normalized to retain its integrity and remove redundancy. Normalization poses problems for Hadoop processing because it makes reading a record a nonlocal operation, and one of the Central assumptions This Hadoop makes is the it is possible to perform (high-speed) streaming reads and writes.
Relational data is generally standardized and needs to maintain its data integrity and eliminate redundant data. Standardized processing is a problem for Hadoop because it makes reading data no longer a localized operation, and one of the core assumptions of Hadoop is the ability to stream read and write at high speed.
R) Hadoop tries to co-locate the data with the compute nodes and so data access are fast because it is local. This feature, known as data locality, are at the heart of data processing in Hadoop and are the reason for its good Performa nCE.
Hadoop uses computer nodes to share stored data, which is localized data, so data access is fast. This data localization is a core feature of Hadoop and the reason for the high performance of Hadoop.
s) processing in Hadoop operates with the higher level:the programmer thinks in terms of the data model (such as Key-v Alue pairs for MapReduce) while the data flow remains implicit.
What we do with Hadoop only happens at the advanced stage: programmers only need to conceive of data models (such as MapReduce's key-value pairs), but they still imply the flow of data.
T) coordinating the processes in a large-scale distributed computation is a challenge. The hardest aspect is gracefully handling partial failure-when you don ' t know whether or not a remote process has failed -and still making progress with the overall computation.
Coordination between processes in large-scale distributed computing is a challenge, and the hardest part is when you don't know if there's a remote that has failed, how to gracefully handle local failures, while still allowing the entire computation to move steadily forward.
u) MapReduce is designed to run jobs that last minutes or hours on trusted, dedicated hardware running in a single data CE Nter with very high aggregate bandwidth interconnects.
MapReduce is designed to operate in a single data center, made up of dedicated hardware that enables internal high-speed integration, to serve computing tasks that can be completed in minutes or hours without deep scrutiny.
V) today, Hadoop is widely used in mainstream enterprises. Hadoop ' s role as a generalpurpose storage and analysis platform for big data have been recognized by the industry, and this The fact is reflected in the number of products for that use or incorporate Hadoop in some.
Today, Hadoop has been widely used in mainstream enterprises. The role of Hadoop as a universal storage and analytics platform for big data has been recognized by the industry, and to some extent, this fact has also been proven in a wide range of products using or referencing Hadoop.
The authoritative guide to Hadoop (fourth edition) highlights translations (2)--chapter 1. Meet Hadoop