It is one of the everyday use cases in data science and data engineering to read data from one storage platform, transform it, and use it elsewhere. Our Dataproc example will be based on a data processing pipeline that uses Apache Spark (Python API) with Dataproc (PySpark). We will run a sample pipeline to read data from Reddit posts stored in BigQuery and extract the title and body (raw text) with the timestamp that was created for each Reddit comment. This data will be converted into CSV format, compressed in ZIP format, and stored in a Google Cloud Storage bucket:
- For our use case with Dataproc, we must enable three APIs – Dataproc, Compute Engine, and BigQuery. We can do this using the following gcloud command:
gcloud services enable compute.googleapis.com dataproc.googleapis.com bigquerystorage.googleapis.com
- We can default the Dataproc region by using the gcloud config set dataproc/region YOUR_REGION, command, where YOUR_REGION can be any available region. In our case, YOUR_REGION will be europe-west1.
- To create a Dataproc cluster, we must provide its name and additional configuration:
gcloud beta dataproc clusters create DATAPROC_CLUSTER_NAME –worker-machine-type n1-standard-2 –num-workers 6 –image-version 1.5-debian –initialization-actions gs://dataproc-initialization-actions/python/pip-install.sh –metadata ‘PIP_PACKAGES=google-cloud-storage’ –optional-components=ANACONDA –enable-component-gateway
We need to enter the Dataproc cluster’s name, and we can choose different worker machine types. Pick a desired worker type from the available Google Compute Engine instance types. In our case, we changed it to n1-standard-2 so that we don’t exceed our CPU quota. We can also specify a different amount of worker nodes. After a moment, the Dataproc cluster will be operational and ready to accept incoming jobs:
Figure 10.7 – The Dataproc cluster is operational
- We will use the Google Cloud Python GitHub repository, which can be cloned with the following command:
git clone https://github.com/GoogleCloudPlatform/cloud-dataproc
- Once we enter the repository with the cd ~/cloud-dataproc/codelabs/spark-bigquery command, we can execute the Dataproc job.
- To execute the Dataproc job, we can use the following command:
gcloud dataproc jobs submit pyspark –cluster dataproc-cluster-1 –jars gs://spark-lib/bigquery/spark-bigquery-latest_2.12.jar –driver-log-levels root=FATAL counts_by_subreddit.py
- The job’s status can be viewed in the Google Cloud console or Cloud Shell once it’s been accepted:
Figure 10.8 – Dataproc job visible in Cloud Shell and the Google Cloud console
Dataproc provides various components of the Dataproc cluster – YARN ResourceManager, MapReduce Job History, Spark History Server, HDFS NameNode, YARN Application Timeline, and Tez. This information is visible in Cluster details under the WEB INTERFACES tab:
Figure 10.9 – Cluster details
- After a moment, the job will complete, and we can proceed with data transformation:
Figure 10.10 – Dataproc job results
- In BigQuery Web UI’s Query Editor, we run the select * from fh-bigquery.reddit_posts.2017_01 limit 10; SQL query to view sample data output:
Figure 10.11 – SQL query output in BigQuery Web UI
- We will see that different columns are available, but two columns – title and selftext – contain the information we will use later.
- In the cloud-dataproc/codelabs/spark-bigquery folder downloaded in the previous steps repository, we run the backfill.py Python script. To do so, we can run the following gcloud command:
cd ~/cloud-dataproc/codelabs/spark-bigquery
bash backfill.sh dataproc-cluster-1 wmarusiak-dataproc-bucket-1
- After a few minutes, the job will be completed, and the CSV files will be uploaded to the Google Cloud Storage bucket. We can list the bucket’s content with the following command:
gsutil ls gs://YOUR_CLOUD_STORAGE_BUCKET/reddit_posts///food.csv.gz
We will get the following output:
Figure 10.12 – SQL query output in BigQuery Web UI
- So, what can we do with this data after processing it? One example could be building models on top of it.
- Now that we’ve learned about and implemented Dataproc in Google Cloud, we will move on to the next section, where we will learn about APIs and their use cases and do a hands-on implementation of them.