Enables loading of files as soon as the data is available in stages. No need to manually load data in the tables once the Snowpipe is configured. This saves time and effort.
The Snowpipe expects a predefined statement — ‘COPY INTO…’ which is executed when a file is ‘detected’. Couple of ways to identify,
1. Using Cloud Messaging
2. Calling Snowpipe REST endpoints.
An event notification informs the pipe to execute the statements for data load.
Difference Between Snowpipe and Bulk Load
Best Practices for Data Load
Consider following,
1. Number of Load Operations SHOULD NOT EXCEED the number of data files to be loaded.
2. File size roughly 100–250 MB compressed.
3. Large data files (100 GB or larger) are NOT RECOMMENDED.
a. Aborting or skipping large file loading can result in wasted credits.
b. Splitting large files by line.
Example — Splitting 10M rows file into 100,000 each of 100 smaller files.
4. Maximum Continuous data upload duration is 24 hours. It could be aborted without any portion being uploaded.
Semi-Structure Data Upload
A cap of 16 MB per row has been set for a VARIANT type of data. A JSON is nothing but an array of multiple documents (or rows) separated by a comma. Snowflake recommends using STRIP_OUTER_ARRAY to remove the array structure and loading records in separate rows.
Continuous Data Loads
Snowpipe is developed to load files within a minute of event notification. This is necessarily not the case for larger files where a lot of computing resources can be used.
Cost is not only associated with resources used for loading the files but also managing the files in the queue. This overhead charge increases with the number of files queued for upload.
It is recommended to have one file per minute to have a good balance between cost and performance.
Columns Creation
When loading data, Snowflake tries to extract data in columnar format as much as possible but there are scenarios that can result in failure, such as,
1. A single ‘null’ value in an element will result in NO columnar extraction. It is not the same for elements with missing values. They will be extracted in a column.
2. Elements with multiple data types will NOT be extracted to a column. Example — [{“name”:” Shahrukh”},{“name”:69}] will result in no columniation.
Columns Creation impacts performance as Snowflake engines scan columns when data is columnized which is better than when the data is not columnized, resulting in the overall scan of the JSON structure.
Using STRIP_NULL_VALUES helps in Columns Creation when ‘null’ values mean they are missing values.
Good To Know
Few things to keep in mind while working with Snowpipe,
- Not necessarily the files will be loaded in an orderly fashion.
2. Metadata is used to avoid data duplication such name of the file, the path of the file, etc.
3. Pipe can be resumed or paused using Snowflake commands and is recommended to optimize the cost.
Automating Snowpipe
With reference to AWS, Snowpipe can load data automatically using Simple Query Service. An integration can be created which can then be referred to as and when the external cloud storage needs to be accessed.
Post creating an integration, the AWS policies are updated with IAM User permissions.
S3 event creation is the next step to be taken post updating the policies. An event is expected to be fired when an action(add, update, delete) from SQS. This event then triggers Snowpipe to execute its commands further.
Snowpipe Rest Endpoints
Snowflake provides a REST endpoint that can be configured to load files in the Databases using key-based authentication.
Recommendation: Create a least privileged role (and user) to ingest files from the pipe.
Snowpipe Error Notifications
The Error notifications can be pushed to cloud messaging services if an error is encountered while loading files. The error notification would have errors per file.
Setting the parameter — ON_ERROR as SKIP_FILE will enable error notifications.
Error Notifications are set up in a different manner for different cloud services.
Snowpipe Cost
Snowpipe is a serverless service by snowflake — Users can use it without thinking about the Virtual warehouse configuration. Resources can shrink or grow as per the load on Snowpipe processes. Accounts are charged on usage only.
Managing Snowpipe
The pipe object does not support the PURGE option, meaning that staged files cannot be deleted automatically when the data is successfully uploaded. These files are needed to be physically removed to delete files.
Loading historical data
Using ALTER PIPE … REFRESH statement executes the fetch of data files staged within the previous 7 days. The load history is captured in the metadata of the pipe object. It will be dropped if the pipe is recreated. Having history details in the pipe’s metadata helps to avoid duplication of data upload into the target tables which might occur post-refresh commands.
Note: This is the Seventh one in a series of articles. Click here for the previous one. Click here for the next.
Credit 1 — https://www.phdata.io/blog/how-to-optimize-snowpipe-data-load/