For years, companies of all sizes have collected and stored vast amounts of data about their customers and business operations to enhance performance, achieve growth, and realize their goals. In 2006, Clive Humby, a renowned British mathematician, coined the phrase Data is the new oil to emphasize the growing importance of data in the modern business landscape. Google Cloud has a broad portfolio of data processing products.
In Chapter 7, we introduced some data-related services from the storage perspective. This chapter aims to familiarize us with the data processing services in Google Cloud. We discussed Cloud SQL, BigQuery, Firestore, Cloud Spanner, and Cloud Bigtable. To complete the story, we will cover the following topics in this chapter:
- Data processing point of overview – Pub/Sub, Dataproc, and Dataflow
- Initializing and loading data using the command line, API transfer, and loading the data from Cloud Storage and streaming it using Pub/Sub
Data processing services overview
Data processing in Google Cloud refers to the various tools and services provided by Google Cloud Platform (GCP) that enable organizations to process, store, and analyze large amounts of data in the cloud.
These services provide organizations with the necessary infrastructure and tools to collect, process, store, analyze, and visualize their data in the cloud, helping them discover meaningful insights and make better-informed business decisions.
Pub/Sub is a publish/subscribe service – a messaging service where the senders of messages are decoupled from the receivers.
Pub/Sub is a messaging service that is both scalable and asynchronous, allowing us to separate the message-producing services from the services that process those messages. It allows various Google Cloud services to communicate asynchronously with latencies of 100 milliseconds. The most common use case for Pub/Sub is streaming analytics and data integration pipeline with data ingestion and distribution.
Here are some other Pub/Sub use cases:
- Ingestion user interaction and server events: From your application or servers, you can forward events to Pub/Sub and process them with stream processing tools such as Dataflow.
- Real-time event distribution: Pub/Sub allows us to distribute events to multiple applications or databases.
- Parallel processing and workflows: When combined with Cloud Functions, Pub/Sub can distribute messages among multiple workers to be used in tasks such as file compression, sending email notifications, or processing images.
- Enterprise event bus: Pub/Sub is well suited to an enterprise-wide data sharing bus and distributing events.
There are two services – Pub/Sub and Pub/Sub Lite.
Pub/Sub is the default choice for most users and applications. Pub/Sub Lite, on the other hand, has a lower cost and offers lower reliability compared to Pub/Sub. Other differences are that Pub/Sub Lite topics are stored in only one zone and replicated asynchronously.
To learn more about the differences between the two services, go to https://cloud.google.com/pubsub/docs/choosing-pubsub-or-lite.