Description: Hadoop is a framework written in Java for running applications on large clusters of commodity hardware and incorporates features similar to those of the GFS and of the MapReduce computing paradigm. You'll explore a demonstration of the use of Sqoop and Hive with Hadoop to flow and fuse data. The demonstration includes preprocessing data, partitioning data and joining data. This learning path can be used as part of the preparation for the Cloudera Certified Administrator for Apache Hadoop (CCA-500) exam.

Target Audience: Technical personnel with a background in Linux, SQL, and programming who intend to join a Hadoop Engineering team in roles such as Hadoop developer, data architect, or data engineer or roles related to technical project management, cluster operations, or data analysis

Duration: 01:54

Description: The Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. Rather than rely on hardware to deliver high-availability, the library itself is designed to detect and handle failures at the application layer, so delivering a highly-available service on top of a cluster of computers, each of which may be prone to failures. This course explains Oozie as a workflow tool used to manage multiple stage tasks in Hadoop. Additionally, you'll learn how to use Hue, a front end tool which is browser based. This learning path can be used as part of the preparation for the Cloudera Certified Administrator for Apache Hadoop (CCA-500) exam.

Target Audience: Technical personnel with a background in Linux, SQL, and programming who intend to join a Hadoop Engineering team in roles such as Hadoop developer, data architect, or data engineer or roles related to technical project management, cluster operations, or data analysis

Duration: 02:40

Description: Hadoop is an open source software for affordable supercomputing. It provides the distributed file system and the parallel processing required to run a massive computing cluster. This course explains Pig as a data flow scripting tool for interfacing with Hadoop. You'll learn about the installation and configuration of Pig and explore a demonstration of Pig in action. This learning path can be used as part of the preparation for the Cloudera Certified Administrator for Apache Hadoop (CCA-500) exam.

Target Audience: Technical personnel with a background in Linux, SQL, and programming who intend to join a Hadoop Engineering team in roles such as Hadoop developer, data architect, or data engineer or roles related to technical project management, cluster operations, or data analysis

Duration: 01:53

Description: Apache Hadoop is a set of algorithms for distributed storage and distributed processing of Big Data on computer clusters built from commodity hardware. All the modules in Hadoop are designed with a fundamental assumption that hardware failures are commonplace and thus should be automatically handled in software by the framework. In this course, you'll explore Hive as a SQL like tool for interfacing with Hadoop. The course demonstrates the installation and configuration of Hive, followed by demonstration of Hive in action. Finally, you'll learn about extracting and loading data between Hive and a RDBMS. This learning path can be used as part of the preparation for the Cloudera Certified Administrator for Apache Hadoop (CCA-500) exam.

Target Audience: Technical personnel with a background in Linux, SQL, and programming who intend to join a Hadoop Engineering team in roles such as Hadoop developer, data architect, or data engineer or roles related to technical project management, cluster operations, or data analysis

Duration: 02:05

Description: The core of Hadoop consists of a storage part, HDFS, and a processing part, MapReduce. Hadoop splits files into large blocks and distributes the blocks amongst the nodes in the cluster. To process the data, Hadoop and MapReduce transfer code to nodes that have the required data, which the nodes then process in parallel. This approach takes advantage of data locality to allow the data to be processed faster and more efficiently via distributed processing than by using a more conventional supercomputer architecture that relies on a parallel file system where computation and data are connected via high-speed networking. In this course, you'll learn about the theory of YARN as a parallel processing framework for Hadoop. You'll also learn about the theory of MapReduce as the backbone of parallel processing jobs. Finally, this course demonstrates MapReduce in action by explaining the pertinent classes and then walk through a MapReduce program step by step. This learning path can be used as part of the preparation for the Cloudera Certified Administrator for Apache Hadoop (CCA-500) exam.

Target Audience: Technical personnel with a background in Linux, SQL, and programming who intend to join a Hadoop Engineering team in roles such as Hadoop developer, data architect, or data engineer or roles related to technical project management, cluster operations, or data analysis

Duration: 01:37

Description: Hadoop is an open-source software framework for storing and processing big data in a distributed fashion on large clusters of commodity hardware. Essentially, it accomplishes two tasks: massive data storage and faster processing. This course explains the theory of Sqoop as a tool for dealing with extraction and loading of structured data from a RDBMS. You'll explore an explanation of Hive SQL statements and a demonstration of Hive in action. This learning path can be used as part of the preparation for the Cloudera Certified Administrator for Apache Hadoop (CCA-500) exam.

Target Audience: Technical personnel with a background in Linux, SQL, and programming who intend to join a Hadoop Engineering team in roles such as Hadoop developer, data architect, or data engineer or roles related to technical project management, cluster operations, or data analysis

Duration: 01:26

Description: Hadoop is an open source software project that enables distributed processing of large data sets across clusters of commodity servers. It is designed to scale up from a single server to thousands of machines, with very high degree of fault tolerance. Rather than relying on high-end hardware, the resiliency of these clusters comes from the software's ability to detect and handle failures at the application layer. In this course, you'll learn about the theory of Flume as a tool for dealing with extraction and loading of unstructured data. You'll explore a detailed explanation of the Flume agents and a demonstration of the Flume agents in action. This learning path can be used as part of the preparation for the Cloudera Certified Administrator for Apache Hadoop (CCA-500) exam.

Target Audience: Technical personnel with a background in Linux, SQL, and programming who intend to join a Hadoop Engineering team in roles such as Hadoop developer, data architect, or data engineer or roles related to technical project management, cluster operations, or data analysis

Duration: 02:00

Description: Hadoop is an open source Java framework for processing and querying vast amounts of data on large clusters of commodity hardware. It relies on an active community of contributors from all over the world for its success. In this course, you'll explore the server architecture for Hadoop and learn about the functions and configuration of the daemons making up the Hadoop Distributed File System. You'll also learn about the command line interface and common HDFS administration issues facing all end users. Finally, you'll explore the theory of HBase as another data repository built alongside or on top of HDFS, and basic HBase commands. This learning path can be used as part of the preparation for the Cloudera Certified Administrator for Apache Hadoop (CCA-500) exam.

Target Audience: Technical personnel with a background in Linux, SQL, and programming who intend to join a Hadoop Engineering team in roles such as Hadoop developer, data architect, or data engineer or roles related to technical project management, cluster operations, or data analysis

Duration: 02:07

Description: Apache Hadoop is an open source framework for distributed storage and processing of large sets of data on commodity hardware. Hadoop enables businesses to quickly gain insight from massive amounts of structured and unstructured data. In this course, you'll learn step-by-step instructions for installing Hadoop in a pseudo-mode and troubleshoot installation errors. You'll learn where the log files are and more about the architecture. This learning path can be used as part of the preparation for the Cloudera Certified Administrator for Apache Hadoop (CCA-500) exam.

Target Audience: Technical personnel with a background in Linux, SQL, and programming who intend to join a Hadoop Engineering team in roles such as Hadoop developer, data architect, or data engineer or roles related to technical project management, cluster operations, or data analysis

Duration: 02:31

Description: Hadoop's HDFS is a highly fault-tolerant distributed file system and, like Hadoop in general, designed to be deployed on low-cost hardware. It provides high throughput access to application data and is suitable for applications that have large data sets. This course examines the Hadoop ecosystem by demonstrating all of the commonly used open source software components. You'll explore a Big Data model to understand how these tools combine to create a supercomputing platform. You'll also learn how the principles of supercomputing apply to Hadoop and how this yields an affordable supercomputing environment. This learning path can be used as part of the preparation for the Cloudera Certified Administrator for Apache Hadoop (CCA-500) exam.

Target Audience: Technical personnel with a background in Linux, SQL, and programming who intend to join a Hadoop Engineering team in roles such as Hadoop developer, data architect, or data engineer or roles related to technical project management, cluster operations, or data analysis

Duration: 01:35