I participated in GCP's Big Data and Machine Learning training "CPB100"

table of contents
My name is Ito and I am an infrastructure engineer
One cloud service that has been growing rapidly recently (or so I personally believe) is
Google Cloud Platform (GCP).
I attended a Google Cloud OnBoard seminar in December 2016, and
it seems there were over 1,000 participants.
Participated in Google Cloud OnBoard | Beyond Inc
The reason for this level of expansion in Japan is likely the emergence of the Tokyo region.
(Tokyo Google Cloud Region | Google Cloud Platform)
I think it was around October or November 2016
So, after that long introduction, as the title suggests,
I participated in the GCP training "CPB100".
The training mainly covers big data and machine learning.

Google Cloud Platform Free Training Tour | Topgate Co., Ltd
About Big Data

Although we did not perform any specific operations in this seminar, we were given a lot of information about the general overview
When it comes to Google's big data services, "BigQuery" is the one that comes to mind.
BigQuery - Analytics Data Warehouse | Google Cloud Platform
Of course, this was also discussed at OnBoard
BigQuery has the capability to perform regular expression replacements on 10 billion rows in just under 10 seconds.
Internally, BigQuery works by
splitting the data, storing each section on an HDD, retrieving the data when a query is run, and creating separate containers for each section.
Since disk I/O becomes a bottleneck when running queries, splitting the data into numerous containers enables high-speed analysis.By the way, I heard that BigQuery doesn't create indexes and instead performs full scans. Apparently,
it's more difficult to create indexes because the data is so large.
Other Services
| GCP | AWS | overview |
|---|---|---|
| Cloud Dataflow | Amazon Elastic Map Reduce | Managed services such as batch processing |
| Cloud Dataproc | Amazon Elastic Map Reduce | Managed Spark and Hadoop services |
| Cloud Pub/Sub | Amazon Simple Notification Service | A simple messaging service |
These are some of the options.
I've included a comparison with AWS services for easier understanding.
I guess the flow is something like this
- Distributing data by process using Compute Engine
- Data is stored in Cloud Storage
- Receive CloudStorage data with Pub/Sub and send it to the appropriate location
- Process data with Dataflow or Dataproc
- Also, the processed data is stored in Cloud Storage
Simple data can be handled entirely with ComputeEngine, but
that becomes a single point of failure.
I believe that knowing how to effectively use managed services is key to "making good use of the cloud."
About Machine Learning

Machine learning is present in everyday situations when you use Google
Take Gmail, for example. While it's currently only available in English,
it uses machine learning to suggest replies based on context.
Computer, respond to this email: Introducing Smart Reply in Inbox by Gmail
Furthermore, Google has successfully reduced the cooling power of its data centers by 40% by using machine learning.
News - Google reduces data center cooling power by 40% using DeepMind's AI: ITpro
Various APIs
Google provides what it has cultivated over the years as an API.
Google Translate, of course, also uses a machine learning API (Translation API).
For example, this:
Speech API - Speech Recognition | Google Cloud Platform
It's exactly what it sounds like: it transcribes what you say into text.
Google apps and YouTube also have this feature, right?
While image recognition and character recognition already exist,Google Cloud Next '17it seems that something like this was announced at
Cloud Video Intelligence - Video Content Analysis | Google Cloud Platform
It's a video version of image recognition. It's in public beta, so you'll need to sign up if you want to try it out
Make it yourself
With existing APIs, if you pass a person's image through an image recognition API, it can recognize things like "person" and "male," but it
can't identify "personal name," right?
This is because the APIs already provided by Google do not learn personal names
This is a fairly well-known example, but it involves using TensorFlow, a machine learning library provided by Google,
to sort cucumbers into categories like "good cucumbers" and "large cucumbers."
Google Cloud Platform Japan Official Blog: Connecting Cucumber Farmers with Deep Learning - TensorFlow
Roughly speaking, this is the kind of flow you'll need.
You'll have to write quite a lot of detailed stuff.
- Prepare the training data, create an algorithm, and create a "trained model"
- Use a pre-trained model
- Continue learning to improve accuracy
However, implementing the algorithm is quite difficult.
That's where TensorFlow comes in.
TensorFlow is a library for implementing deep learning.
As I mentioned earlier, it was developed by Google, then released as a GCP service, and later open-sourced.
C++ and Python APIs are available
Furthermore, machine learning requires a very large amount of resources during the training phase,
mainly GPUs and CPUs (as it performs image recognition, for example).
That's why a "Cloud Machine Learning Engine" is available.
Since machine learning only requires resources during training, it's very well-suited to the cloud.
It enables the use of GPUs, and a large number of GPU-dedicated machines run in the background.
Predictive Analytics - Cloud Machine Learning Engine | Google Cloud Platform
If you're interested in TensorFlow, there's a TensorFlow User Group, and
I recommend checking out their study sessions.
(TensorFlow User Group Tokyo - connpass)
However, it was incredibly popular, with 200 people participating in a study group that was meant to have around 20 people...!!
summary
There was a lot more to say than what was said here, but that was only available to those who attended
Lunch was provided in the form of a bento box. It was delicious.
Let's not talk about how I should have taken a more appetizing photo.

Oh, GCP is often compared to AWS, but the following part made me think, "I see!"
AWS provides products that are already available as open source (e.g., Memcached and Elasticsearch) on AWS in an easy-to-use format for users, while
GCPdevelops its own products, uses them extensively, and then
those products to users as GCP servicesoffers
For example, MapReduce, developed by Google, has evolved into Dremel and has been released as GCP's "BigQuery," and MapReduce is now available as open source under the name Hadoop
GCP basically takes the opposite approach to AWS
0
