Snowflake - Getting Started Guide
At this point you probably heard about Snowflake Inc as the new DWH solution for cloud technologies.
If so (and if not), the following is a short introduction and a tutorial to Snowflake (including basic code examples):
Snowflake - What is it?
A cloud oriented data-warehouse solution which was founded in 2012 in the US.
Supports SQL (with additional build-in extensions) and APIs using common development languages to load, manage and query data (with support to granular level security).
Snowflake provides data storage and data analytics capabilities with simple usage and adaptation, fast data retrieval on large datasets with competitive pricing.
Snowflake can run and integrate with Cloud platforms such as AWS (since 2014) and Azure (since 2018) while it is currently in preview on GCP (on-premise deployment is not supported).
Snowflake offers a new Architecture to support modern cloud based DWHs as a service via shared data and micro-partitions architecture while taking advantages of cloud functionalities (e.g. storage, scaling etc.).
Deploying Snowflake on AWS:
As Snowflake is an AWS external solution, it would not be found as a native AWS service or in the EC2 marketplace page.
To deploy Snowflake on AWS, navigate to the following AWS marketplace web page, after clicking on “Continue to Subscribe” you’ll be redirect to a AWS login screen and then to register with Snowflake.
That’s it! Once registered and verified via email, you will be able to login to the Snowflake web application which holds a graphical interface to manage and query data.
Billing will be according to the instances type selected in the previous step.
Connecting to Snowflake web page application screenshot:
Note that the URL holds the Snowflake Account Name which is the combination of latin letters + numbers + AWS region (red square).
Snowflake Terminology
(Partial list)
The virtual warehouse is a cluster of compute resources. It provides resources such as: memory, temporary storage and CPU.
It is is an independent compute resource that can be leveraged at any time for SQL execution and DML and then turned off when it isn’t needed.
All data in Snowflake is maintained in databases.
Each database consists of one or more schemas, which are logical groupings of database objects, such as tables and views.
Snowflake does not place any hard limits on the number of databases, schemas (within a database), or objects (within a schema) you can create
All data in Snowflake is stored in database tables, logically structured as collections of columns and rows.
Snowflake table structure holds several elements such as:
Micro-partitions & Data Clustering
Clustering Keys & Clustered Tables
Automatic Clustering
Snowflakes allow to store the result of a query to be accessed as if it were a table (nice to have).
Snowflake materialized view is a pre-computed data set derived from a query specification (the SELECT in the view definition) and stored for later use.
As the View data is pre-computed, querying a materialized view is faster than executing the original query.
A schema is a logical grouping of database objects (tables, views, etc.).
Each schema belongs to a single Snowflake database.
Snowflake stages allows to define a map between internal data paths (e.g. snowflake, OS local storage) or external (e.g. AWS S3, Azure Blob Storage, GCP Cloud Storage) to a Snowflake table.
This will allow to execute the COPY statements that match a subset of files in order to upload data into Snowflake - taking advantage of parallel operations.
Creates a named file format that can be used for bulk loading data into (and unloading data out of) Snowflake tables.
Can be used with the COPY commands to upload data files holding specific format such as: CSVs files with “|” as a delimiter or JSON, AVRO or Parquet files.
Sequences are used to generate unique numbers across sessions and statements, including concurrent statements.
Sequences can be used to generate values for a primary key or any column that requires a unique value.
More on Snowflake Architecture
Snowflake Clients/APIs Support:
The following are Snowflake's supported connectors and drivers which can be downloaded from here or from the Snowflake web application:
CLI Client - SnowSQL (Linux/Win/Mac)
JDBC Driver
ODBC Driver
Snowflake Python Connector Components:
Snowflake Connector for Python
Snowflake SQLAlchemy
Node.js Driver
Go Driver
Snowflake Connector for Spark
Snowflake Connector for Kafka
Tableau Snowflake Connector
Python Example:
Step-By-Step
Loading a CSV File into Snowflake via snowsql:
The following are Snowflake supported connectors and drivers which can be downloaded from here or from the Snowflake web application:
Download Snowflake for Linux (latest version):
Install Snowflake:
This will Create a “~/bin”
directory under the home directory path (by default)
Also will create a hidden directory under the home directory path named:
“~/.snowsql”
$ ./snowsql-1.2.5-linux_x86_64.bash
Install Snowflake with no prompt example:
$ printf '%s\n' ~/bin y | ./snowsql-1.2.5-linux_x86_64.bash
Connection Test:
$ snowsql -a <AccountName> -u <UserName> -d <Database> -s <SchemaName>
Password:
>...
Set a connection via Snowflake configuration file (eliminate password prompt):
$ vi ~/.snowsql/config
# Modify the following section:
# Set Connection Name [connection.<name>]
[connections.example]
accountname = <>
username = <>
password = <>
Connect with the configuration file:
$ snowsql -c <ConnectionName>
Create a Snowflake Stage:
$ snowsql -c <ConnectionName>
> create or replace stage public.stg file_format = (type = 'CSV' field_delimiter = '|' skip_header = 1);
Load (put) a CSV file into the Stage which is located in a local OS path:
$ snowsql -c <ConnectionName>
> put file:///home/SnowflakeFiles/file.csv @~/stg
# Put CSVs using wildcard
$ snowsql -c <ConnectionName>
> put file:///home/SnowflakeFiles/*.csv @~/stg
Copy CSV file to Snowflake table:
$ snowsql -c <ConnectionName>
> copy into <TableName> from @~/stg;