1. Three Vs of Big Data
- Volume: > dozens of terabytes
- Variety: unstructured / semi-structured / structured data
- Velocity: has value for a limited time before being loaded into a data warehouse.
2. Hadoop
- Hadoop is a framework for storing data on large clusters and running applications against that data.
- Hadoop consists of two main components:
- MapReduce (YARN): distributed processing framework on distributed data sets.
- Data consists of key-value pairs.
- Computation has only two phases:
- Map: input data is split info a large number of fragments, each of which is assigned to a map task. Map tasks are distributed across the cluster to process the key-value pairs from fragments and produce intermediate key-value pairs.
- Reduce: Intermediate data set is sorted by key and is partitioned for reduce tasks that will produce output key-value pairs into HDFS.
- Hadoop distributed file system (HDFS).
- Master service: NameNode (control access to data files)
- Slave services: DataNode (manage storage, serving read/write requests)
- An application that is running on Hadoop gets its work divided among the nodes in the cluster.
- HDFS stores the data that will be processed.
- A Hadoop cluster can span thousands of machines, where HDFS stores data, and MapReduce process data nearby (to keep I/O costs low).
- Hadoop clusters typically consist of a few master nodes and many slave nodes.
- Master node: control the storage and processing systems in Hadoop
- Slave node: store all the cluster's data and is also where the data gets processed.
3. Apache Hadoop Ecosystem Components*
4. Releases and Distributions
- Hadoop releases: directly from Apache Software Foundation (1.x, 2.x)
- Distributions:
- Cloudera (CDH) + value-added components on top of Hadoop.
- EMC: SQL processing for Hadoop.
- Hortonworks: HDP + paid support
- IBM: InfoSphere BigInsight, PureData System for Hadhoop ..
- Intel: Analyzing big data with optimizations for Intel processors & SSD storage & networking.
- MapR: An enterprise-grade platform that supports many well-known customers.
No comments:
Post a Comment