Let’s kick off the new series of blog posts starting today.
In this series, we are going to prepare ourselves for the CCA Spark and Hadoop Developer Exam (CCA175 certification).
Through these blog posts, we will try to prepare for the CCA175 certification. There are a few things mentioned below that we need to know before starting off with the actual tutorials.
Cost: $295 (The cost itself makes it very important that we take this certification exam seriously)
Duration: 2 hours
Passing Criteria: 70% of the total questions asked
The preparation of the CCA175 exam is divided into the following four sections.
- Data Ingest
- Transform, Stage, and Store
- Data Analysis
Above mentioned categories contain specific tasks which we should be familiar with to excel the certification with flying colors.
We will see tasks in each category, which are as follows.
Bringing data in Hadoop Ecosystem
Our focus is to bring/tranfer data in/to Hadoop Ecosystem. The data source system can be within our environment or can be a third party for achieving these tasks.
This section expects that you have skills to transfer data between external systems and your cluster, which includes the following:
- Import data from a MySQL database into HDFS using Sqoop
- Export data to a MySQL database from HDFS using Sqoop
- Change the delimiter and file format of data during import using Sqoop
- Ingest real-time and near-real-time streaming data into HDFS
- Process streaming data as it is loaded onto the cluster
- Load data into and out of HDFS using the Hadoop File System commands
The second section focuses on performing transformation activities on the data that we ingested using skills mentioned above.
TRANSFORM, STAGE, AND STORE
Transforming the ingested data in business defined format.
This section covers converting a set of data values in a given format stored in HDFS into new data values or a new data format and write them into HDFS. It includes the follwing tasks.
- Load RDD data from HDFS for use in Spark applications
- Write the results from an RDD back into HDFS using Spark
- Read and write files in a variety of file formats
- Perform standard extract, transform, load (ETL) processes on data
The third section focuses on analyzing the data that we transformed in the above section.
Analyzing of the transformed data.
We are going to use Spark SQL to interact with the metastore programmatically in our applications. The goal of this section is to generate reports by using queries against loaded data. The following tasks are expected to be part of this section.
- Use metastore tables as an input source or an output sink for Spark applications
- Understand the fundamentals of querying datasets in Spark
- Filter data using Spark
- Write queries that calculate aggregate statistics
- Join disparate datasets using Spark
- Produce ranked or sorted data
The last section relates to performing some configurational activities to either change the session/system level configuration or change the format in which we receive the output.
Preparing to work with different cluster components.
We must be able to perform tasks such as follows to clear this section.
- Supply command-line options to change your application configuration, such as increasing available memory
These were the details regarding the certification exam. Please note that we do get a set of reference documents available during the certification exam, which is very useful expecially when we are a little bit confised/clueless about the syntax of a command, or some other technical difficulties. I can say this with my experience. Please do not hesitate to use these resources for your benefit.
I am sure we will learn more things as we embark upon this beatiful journey.
I am hopeful that this will help at least one person with their certification preparation.
You can find more information about CCA175 certification at https://www.cloudera.com/about/training/certification/cca-spark.html
Please let me know if you have suggestions to improve these tutorials.