Description:

Continue to explore Apache Spark, the de facto big data science framework, in this Skillsoft Aspire course. You will learn how to analyze a Spark DataFrame by treating it as though it were a relational database table. Learners discover how to create a view from a Spark DataFrame and run SQL queries against it, and how to define and explore data in Windows. Key concepts in this course include different stages involved in optimizing any query or method call on the contents of a Spark DataFrame; how to create views out of a Spark DataFrame's contents and run queries against them; and how to trim and clean a DataFrame before a view is created, as a precursor to running SQL queries. Next, learn how to perform an analysis of data by running different SQL queries; how to configure a DataFrame with an explicitly defined schema; and define what a window is in the context of Spark. Finally, observe how to create and analyze categories of data in a data set by using Windows.

Target Audience:

Duration: 00:55

Description:

An open-source cluster-computing framework used for data science, Apache Spark has become the de facto big data framework. In this Skillsoft Aspire course, learners explore how to analyze real data sets by using DataFrame API methods. Discover how to optimize operations with shared variables and combine data from multiple DataFrames using joins. Key concepts covered in this course include features that make Spark 2.x versions significantly faster than Spark 1.x; how to create a Spark DataFrame from contents of a CSV file and apply some simple transformations on the DataFrame; and how to apply grouping and aggregation operations on a DataFrame to analyze categories of data in a data set. Then use Matplotlib to visualize the contents of a Spark DataFrame; learn about broadcast variables and how to perform a join operation with a DataFrame; and study contents of a DataFrame in a text file for archiving or sharing. Finally, learn how to perform different join operations on Spark DataFrames to combine data from multiple sources, and how to analyze data with DataFrame API.

Target Audience:

Duration: 01:12

Description:

Apache Spark, an open-source cluster-computing framework used for data science, has become the de facto big data framework. In this Skillsoft Aspire course, explore the basics of Apache Spark, an analytics engine for working with big data built on top of Hadoop. Discover how it allows operations on data with both its own library methods and with SQL, while delivering great performance. Key concepts covered here include how Spark fits in with Hadoop; Spark RDDs, their characteristics, and how to distinguish between RDDs and DataFrames; and the components of Spark and the functions of the Spark Session, Master, and Worker nodes. Then observe how to install PySpark and initialize a Spark Context; how to initialize a Spark DataFrame from the contents of an RDD; and the contents of a DataFrame by using the SQLContext. Next, you will learn how to apply the map() function on an RDD to configure a DataFrame; how to retrieve required data from DataFrame and how to apply transformations; and how to convert Spark DataFrames to Pandas DataFrames and vice versa.

Target Audience:

Duration: 01:07