Participated in GCP's BigData and Machine Learning training "CPB100"
table of contents
My name is Ito and I am an infrastructure engineer.
Google Cloud Platform (GCP) one of the cloud technologies that has been rapidly growing (in my opinion) recently .
In December 2016, I attended a GCP seminar called Google Cloud OnBoard, and
there seemed to be over 1,000 participants at that time.
Participated in Google Cloud OnBoard | Beyond Inc.
I think it is the emergence of the Tokyo region that has made it so widespread in Japan.
Google Cloud region in Tokyo | Google Cloud Platform
I think it was around October or November 2016.
This is a long introduction, but as the title says,
I participated in the GCP training "CPB100".
The training will mainly be about big data and machine learning.
Google Cloud Platform Free Training Tour | Top Gate Co., Ltd.
About BigData
Although we did not perform any specific operations in this seminar, we were given a variety of overview information.
Speaking of Google's big data service, it is definitely "BigQuery".
BigQuery- Analytics Data Warehouse | Google Cloud Platform
Of course, this was also mentioned on OnBoard.
BigQuery has the ability to replace 10 billion rows of regular expressions in less than 10 seconds.
So, how does BigQuery work internally?
Data is divided and stored on each HDD, and when a query is run, the data is retrieved and a container is created for each.
Disk I/O becomes a bottleneck when running queries, so we divide it into many containers to enable high-speed analysis.By the way, BigQuery doesn't add any indexes and performs a full scan.
It seems difficult to index because the data is too large.
Other services
GCP | AWS | overview |
---|---|---|
Cloud Dataflow | Amazon Elastic Map Reduce | Managed services such as batch processing |
Cloud Dataproc | Amazon Elastic Map Reduce | Managed services for Spark and Hadoop |
Cloud Pub/Sub | Amazon Simple Notification Service | simple messaging service |
There is this area.
I have included a comparison with AWS services for easy understanding.
Is this what the flow looks like?
- Sort data by processing with Compute Engine
- Data is stored in CloudStorage
- Receive CloudStorage data with Pub/Sub and throw it in the appropriate place
- Process data with Dataflow or Dataproc
- Also save processed data to CloudStorage
For simple data, you can use ComputeEngine to complete the process, but
it becomes a single point of failure.
I believe that how to effectively use managed services will lead to ``effective use of the cloud.''
About MachineLearning
Machine learning exists in everyday places when you use Google.
For example, GMail. This feature is currently limited to English, but
it uses machine learning to suggest replies based on context.
Computer, respond to this email: Introducing Smart Reply in Inbox by Gmail
Additionally, we used machine learning to adjust cooling power at Google's data centers, successfully reducing it by 40%.
News - Google reduces data center cooling power by 40%, leverages DeepMind's AI: ITpro
Various APIs
Google provides what it has cultivated so far as an API.
Of course, Google Translate also uses machine learning API (Traslation API).
For example this.
Speech API - Speech Recognition | Google Cloud Platform
It's still the same, but it turns what you say into sentences.
Google apps and YouTube also have this feature.
There are image recognition and character recognition, but this seems to have been announced at
Google Cloud Next '17 Cloud Video Intelligence - Video Content Analysis | Google Cloud Platform
This is a video version of image recognition. It's a public beta, so if you want to try it out, you'll need to sign up for now.
make it yourself
With existing APIs, if you pass an image of a person through the image recognition API, you can recognize things like "person" and "male," but
you cannot recognize things like "personal name."
This is because the API already provided by Google does not learn personal names.
A fairly famous example is a machine learning library provided by Google called TensorFlow,
which was used to sort out ``good cucumbers'' and ``big cucumbers.''
Google Cloud Platform Japan Official Blog: TensorFlow connecting cucumber farmers and deep learning
Roughly speaking, the following flow is required.
I have to write very carefully.
- Prepare training data, create an algorithm, and create a "trained model"
- Use a trained model
- Learn more and more to improve accuracy
However, the algorithm is quite difficult to implement.
That's where TensorFlow comes into play.
TensorFlow is a library for implementing DeepLeraning.
As I said earlier, it is "something developed by Google that appeared as a GCP service and became open source."
C++ and Python APIs are available.
Also, MachineLearning requires a very high amount of resources when learning.
Mainly GPU, CPU, etc. (Because it does image recognition, etc.)
"Cloud Machine Learning Engine" is available for that purpose.
Machine Learning requires resources only when learning, so it is well-suited to the cloud.
GPUs are now available, and a large number of GPU-specific machines are being launched behind the scenes.
Predictive Analytics - Cloud Machine Learning Engine | Google Cloud Platform
If you are interested in TensorFlow, there is a TensorFlow User Group, so
I think it would be a good idea to join a study session there.
TensorFlow User Group Tokyo - connpass
However, it was extremely popular, with 200 people participating in a study session for about 20 people...! ! is.
summary
There were many more stories than this, but they were only for those who participated. .
A boxed lunch was provided at noon. It was delicious.
Don't ask me to take better photos.
Ah, GCP is often compared to AWS, but the following part made sense to me as I listened to the talk.
AWS provides "products that are already provided as open source (e.g. Memcached, ElasticSearch, etc.) on AWS in an easy-to-use state for users," but
with GCP, "products that we have developed ourselves"
and provide that product to users as a GCP service."
For example, MapReduce developed by Google has evolved from Dremel and has been released as GCP's "BigQuery," and MapReduce is now available as open source as Hadoop.
GCP is basically the opposite approach from AWS.