Perfect for studying during the year-end and New Year holidays! Load Datastore data into GCP BigQuery and analyze it
table of contents
Hello.
I'm Mandai, in charge of Wild on the development team.
I wanted to analyze Appmill's monitoring data, but the amount has grown to a size that makes it difficult to store it in the DB.
Now, speaking of GCP, that's it.
Well, that's where BigQuery comes in.
In this article, I will show you how to import data from Datastore on GCP into BigQuery, run queries, and export it for reuse.
Preparing to load data from Datastore
First, please note that you cannot import data directly from Datastore to BigQuery.
So, first move the data from Datastore to Cloud Storage.
To move Datastore data to Cloud Storage, use Datastore's backup feature.
Prepare a bucket in advance to store backups in Cloud Storage.
As I will explain later, we recommend that you create the bucket in the Tokyo region (asia-northeast1).
There is an item called "Management" in the Datastore menu, so select it.
Next, there is a button that says "Open datastore management", so click here.
Please note that there are a lot of mosaics because we are capturing real data.
You will only be asked for confirmation the first time you run it, but a separate window will open and display a list of data sizes for each kind in the Datastore in the selected project.
Check the Kind of the entity you want to extract and click "Backup Entities" to start backing up.
The export will not finish immediately, so please wait for a while.
Try reloading a few times, and if you see a screen like the one above, it's complete.
Register data in BigQuery
Once you've prepared your data, load it into BigQuery.
Since the UI has been updated, I would like to proceed with more captures (sorry for the too many mosaics).
Once you're on the BigQuery screen, click the Add Data link under Resources in the left menu.
Select "Pin project" and select the Cloud Storage project ID where the Datastore data was saved.
It is successful if the specified project is displayed under the search box as shown below.
Next, create a dataset in your project.
Click on the project ID that was added earlier.
The right pane will then switch to become the query editor.
Below the query editor, there is a link called "Create a dataset", so when you click on it, the dataset creation screen will pop up from the right.
Enter a dataset ID that will help you identify the data you are about to input, and select the data location.
The important thing about the location is that it matches the region where the first Cloud Storage bucket you created is located.
After creating a Cloud Storage bucket in the Tokyo region, select asia-northeast1, and so on.
If you created your BigQuery data in multiple regions, your Cloud Storage bucket can be any location within the selected multi-region.
At the time of writing, there are two multi-regions available for BigQuery: US and EU; ASIA is not available.
If you follow this article, you will have created a Cloud Storage bucket in asia-northeast1, so you will also create a BigQuery dataset in asia-northeast1.
After creating the dataset, the next step is to create the table.
A tree has been created with the dataset you created earlier hanging under the project ID on the left, so this time click on the dataset name.
Then, the right pane will change and you will see a link called ``Create Table'', so click on it.
There are many input items, but I will explain them one by one!
- table created from
- Choose where to get the data from. If uploading, the uploader will be displayed. This time, choose Cloud Storage.
- Select files from GCS bucket
- Enter the path of the file whose extension is ".[Entity Kind of Datastore].backup_info" from the Cloud Storage bucket name. You can also dig up the directory tree from the reference, but the file name of the .backup_info file is too long and gets cut off, so we recommend copying and pasting it from the Cloud Storage screen, although it's a pain.
- file format
- Select Cloud Datastore Backup.
- project name
- For this step, it is already selected.
- Dataset name
- For this step, it is already selected.
- table name
- Please give it a name that is easy to identify within the dataset.
Other items do not need to be changed unless there are special requirements.
When you press the Create button, a data loading job is created and added to the job history.
Once the job is complete and the table is populated with data, you're ready to run your queries.
run a query
Before running the query, let's write it in the editor.
In the image above, an error occurs because a dataset or table that does not exist is specified.
However, the queries that can be written here are designed to be standard SQL queries.
Specify the table name after from by connecting the dataset name and table name with a dot.
If you specify an existing dataset or table, it will be as follows.
What you should pay attention to is the lower right part, which shows the data size output as a result of the search.
As this portion grows, the cost of executing the query will increase.
2018/12/19 Currently, it's $5 per 1TB, so I think you can use it without worrying too much.
There is also a free tier of 1TB per month, which I think is enough to give it a try.
Based on the data I tried this time, 4.01GB of data was processed, so I could run it about 250 times for free.
Also, the same query is cached for 24 hours, so it will not be recalculated and will not incur any charges.
Make full use of the large free tier and get used to using even small amounts of data!
Currently, BigQuery is equipped with standard SQL that is compatible with SQL 2011, so even those who have only used MySQL or PostgreSQL can easily get started.
However, there are some things that could be done with MySQL, so I would like to introduce them in a separate article.
write the results
If you want to move the completed data somewhere, it is convenient to export the results.
Here we will show you how to export the execution results to CSV and store them in Cloud Storage.
There are several ways to export query execution results, but if the result set is small, exporting it directly to CSV and downloading it is easier and less time-consuming.
JSON is also good if you want to use it in a program, and if your company uses G Suite, you can also use Google Sheets.
However, if your result set grows large, you can only download the first part and need to export it to a BigQuery table.
To save large result sets to a BigQuery table, select BigQuery Table from Save results next to the query results title.
A modal will appear to save the results as a table, so select the project name, dataset name, enter the table name to export, and you're done.
It may take some time depending on the data size, but once completed, a new table will be added to the registered dataset.
This table will be treated as the same as the table to be queried, so let's export it to Cloud Storage using the function that exports this table itself.
If you create a table in BigQuery, you will be charged for any portion exceeding 10GB, so please be aware of this if you are using it for validation, etc.
Next, enter the full path starting with the Cloud Storage export destination bucket name, select the export format and whether to compress or not, and the export will begin.
This is also managed as a job and its status can be checked.
I had a hard time taking this capture because the resulting data size was small.
summary
This time, I summarized the flow of registering data from Datastore to BigQuery and saving it to Cloud Storage in CSV format.
As a point
- BigQuery has a pay-as-you-go system where the price is determined by the amount of data processed by a query (there is also a flat rate, but if you use it half-heartedly you will end up in the red)
- BigQuery's standard SQL is slightly different from that provided by MySQL.
- Cloud Storage makes it easy to exchange data
- Personally, I think BigQuery is easy to use.
- Exporting from Datastore to Cloud Storage takes a surprisingly long time
I guess that's what I mean.
It's a great service with quick query execution times and no stress!
No wonder it's so popular.
There are various destinations for exporting from BigQuery, and Google Sheets is suitable for communicating with non-engineers, so it may be useful in an emergency if you remember it.
That's it.