[DWH] Snowflake features and architecture [Big Data]
table of contents
This is Ohara from the technical sales department.
I will describe the characteristics and architecture of
the data warehouse (DWH) " Snowflake Famous cloud-based DWHs include ``Google BigQuery'' for GCP and ``Amazon Redshift'' for AWS, but recently ``Snowflake'' has also become more well-known.
Snowflake also allows you to specify AWS, GCP, and Azure platforms and run your Snowflake services on those infrastructures.
*Information as of September 2020.
Features of Snowflake
● A single source for all your data
Snowflake creates a single, ready-to-query source where you can effectively manage all your data, including JSON and XML, with nearly unlimited, low-cost cloud storage. You can also access and provide shared data to your customers and partners through your own private data exchange.
● Fully SQL compatible / multi-cluster
Multi-cluster computing resources with near-unlimited concurrency, supporting unlimited concurrent users and queries. Fully ANSI SQL compatible and natively supported, you can query semi-structured data directly in SQL and leverage your favorite analytical and machine learning tools.
● Near zero maintenance
Automatic updates without planned downtime reduce system management and maintenance to zero. Snowflake usage also automatically scales up and down with per-second pricing. This enables global data access and cross-cloud data synchronization.
Snowflake architecture
The Snowflake architecture is characterized by a "three-tier design" that uses each layer of storage, computing, and cloud services separately.
Compute and storage resources are physically separated but logically integrated into a single data platform system, providing an architecture that allows for non-disruptive scaling.
● Service:
It consists of stateless computing resources running in multiple Availability Zones.
This layer provides a highly available and distributed metadata store for global state management and enables services such as data pruning, data exchange, and cross-cloud data replication.
The service layer provides security and encryption key management and enables all SQL, DML and DDL functions, including:
- Provide authentication
and management
- Apply security features
- Compile and optimize queries
- Coordinate all transactions
For example, to perform data pruning, the service layer compiles query metadata to determine which micropartitions should be scanned to quickly complete the query.
This results in better performance because only the data needed to complete the query is scanned.
In addition, automatic metadata processing is performed by separate integration subsystems that perform statistics collection and other metadata operations without requiring user computing resources.
● Compute:
The compute layer is the backbone of Snowflake; all data processing is performed by a compute engine designed to process large amounts of data quickly and efficiently.
- Retrieves the minimum amount of data needed from the storage tier to satisfy the queries dictated by Snowflake's data pruning algorithms.
Snowflake's unique multiple compute engines operate on the same data simultaneously with system-wide transactional integrity and full ACID compliance
, ensuring consistent data read operations (SELECTs) as isolated workloads. Browse.
(Write operations do not block Reader)
- Caches data and query results locally, significantly improving performance and reducing costs.
(There are no computing charges for cached query results.)
● Storage:
The storage layer performs data processing in the following ways:
- Split data into micro-partitions, creating hundreds of thousands of partitions for each data file
- Extract metadata (such as timestamps and min/max values) for efficient query processing
- Compress micro-partitions to save storage and space costs
and fully encrypt your data using a secure key hierarchy
summary
Snowflake is a service that uses the infrastructure of a cloud platform, and
AWS, GCP, and Azure already have their own DWH services, so there is some competition, but
since it is a service exclusively for Snowflake's data cloud, it can be used depending on the purpose. It's also interesting to try different uses.