Snowflake - Getting Started Guide

At this point you probably heard about Snowflake Inc as the new DWH solution for cloud technologies.

If so (and if not), the following is a short introduction and a tutorial to Snowflake (including basic code examples):

Snowflake - What is it?

  • A cloud oriented data-warehouse solution which was founded in 2012 in the US.

  • Supports SQL (with additional build-in extensions) and APIs using common development languages to load, manage and query data (with support to granular level security).

  • Snowflake provides data storage and data analytics capabilities with simple usage and adaptation, fast data retrieval on large datasets with competitive pricing.

  • Snowflake can run and integrate with Cloud platforms such as AWS (since 2014) and Azure (since 2018) while it is currently in preview on GCP (on-premise deployment is not supported).

  • Snowflake offers a new Architecture to support modern cloud based DWHs as a service via shared data and micro-partitions architecture while taking advantages of cloud functionalities (e.g. storage, scaling etc.).

Deploying Snowflake on AWS:

  • As Snowflake is an AWS external solution, it would not be found as a native AWS service or in the EC2 marketplace page.

  • To deploy Snowflake on AWS, navigate to the following AWS marketplace web page, after clicking on “Continue to Subscribe” you’ll be redirect to a AWS login screen and then to register with Snowflake.

  • That’s it! Once registered and verified via email, you will be able to login to the Snowflake web application which holds a graphical interface to manage and query data.

  • Billing will be according to the instances type selected in the previous step.

  • Connecting to Snowflake web page application screenshot:

Note that the URL holds the Snowflake Account Name which is the combination of latin letters + numbers + AWS region (red square).

Snowflake Terminology

(Partial list)

  • Warehouses

  • The virtual warehouse is a cluster of compute resources. It provides resources such as: memory, temporary storage and CPU.

  • It is is an independent compute resource that can be leveraged at any time for SQL execution and DML and then turned off when it isn’t needed.

  • Databases

  • All data in Snowflake is maintained in databases.

  • Each database consists of one or more schemas, which are logical groupings of database objects, such as tables and views.

  • Snowflake does not place any hard limits on the number of databases, schemas (within a database), or objects (within a schema) you can create

  • Tables

  • All data in Snowflake is stored in database tables, logically structured as collections of columns and rows.

  • Snowflake table structure holds several elements such as:

  • Micro-partitions & Data Clustering

  • Clustering Keys & Clustered Tables

  • Automatic Clustering

  • Views & Materialized Views

  • Snowflakes allow to store the result of a query to be accessed as if it were a table (nice to have).

  • Snowflake materialized view is a pre-computed data set derived from a query specification (the SELECT in the view definition) and stored for later use.

  • As the View data is pre-computed, querying a materialized view is faster than executing the original query.

  • Schemas

  • A schema is a logical grouping of database objects (tables, views, etc.).

  • Each schema belongs to a single Snowflake database.

  • Staging Data (Stage)

  • Snowflake stages allows to define a map between internal data paths (e.g. snowflake, OS local storage) or external (e.g. AWS S3, Azure Blob Storage, GCP Cloud Storage) to a Snowflake table.

  • This will allow to execute the COPY statements that match a subset of files in order to upload data into Snowflake - taking advantage of parallel operations.

  • File Formats

  • Creates a named file format that can be used for bulk loading data into (and unloading data out of) Snowflake tables.

  • Can be used with the COPY commands to upload data files holding specific format such as: CSVs files with “|” as a delimiter or JSON, AVRO or Parquet files.

  • Sequences

  • Sequences are used to generate unique numbers across sessions and statements, including concurrent statements.

  • Sequences can be used to generate values for a primary key or any column that requires a unique value.

  • More on Snowflake Architecture

Snowflake Clients/APIs Support:

The following are Snowflake's supported connectors and drivers which can be downloaded from here or from the Snowflake web application:

  • CLI Client - SnowSQL (Linux/Win/Mac)

  • JDBC Driver

  • ODBC Driver

  • Snowflake Python Connector Components:

  • Snowflake Connector for Python

  • Snowflake SQLAlchemy

  • Node.js Driver

  • Go Driver

  • Snowflake Connector for Spark

  • Snowflake Connector for Kafka

  • Tableau Snowflake Connector

Python Example:

Step-By-Step

Loading a CSV File into Snowflake via snowsql:

The following are Snowflake supported connectors and drivers which can be downloaded from here or from the Snowflake web application:

Download Snowflake for Linux (latest version):

$ curl -O https://sfc-repo.snowflakecomputing.com/snowsql/bootstrap/1.2/linux_x86_64/snowsql-1.2.5-linux_x86_64.bash

Install Snowflake:

  • This will Create a “~/bin”

  • directory under the home directory path (by default)

  • Also will create a hidden directory under the home directory path named:

  • “~/.snowsql”

$ ./snowsql-1.2.5-linux_x86_64.bash

Install Snowflake with no prompt example:

$ printf '%s\n' ~/bin y | ./snowsql-1.2.5-linux_x86_64.bash

Connection Test:

$ snowsql -a <AccountName> -u <UserName> -d <Database> -s <SchemaName>

Password:

>...

Set a connection via Snowflake configuration file (eliminate password prompt):

$ vi ~/.snowsql/config

# Modify the following section:

# Set Connection Name [connection.<name>]

[connections.example]

accountname = <>

username = <>

password = <>

Connect with the configuration file:

$ snowsql -c <ConnectionName>

Create a Snowflake Stage:

$ snowsql -c <ConnectionName>

> create or replace stage public.stg file_format = (type = 'CSV' field_delimiter = '|' skip_header = 1);

Load (put) a CSV file into the Stage which is located in a local OS path:

$ snowsql -c <ConnectionName>

> put file:///home/SnowflakeFiles/file.csv @~/stg

# Put CSVs using wildcard

$ snowsql -c <ConnectionName>

> put file:///home/SnowflakeFiles/*.csv @~/stg

Copy CSV file to Snowflake table:

$ snowsql -c <ConnectionName>

> copy into <TableName> from @~/stg;

#DWH #AWS

Featured Posts
Posts are coming soon
Stay tuned...
Recent Posts