File Size Considerations for Snowflake

Dibyajyothi Rath

Introduction to Snowflake:

Snowflake is a cloud-based data warehousing platform that has become increasingly popular in recent years for its scalability, performance, and ease of use. One of the key considerations for optimal use of Snowflake is file size, as it affects performance, storage, and cost. This blog aims to provide a comprehensive guide on file size considerations for Snowflake, so you can make the most of this powerful platform. 

Here are the 5 benefits of File size considerations for Snowflake: 

Understanding Snowflake:

Snowflake is a unique data warehousing platform that separates storage and compute and we can store data in bulk amounts. The cost and performance of querying that data is determined by the amount of compute resources that are required to process it. Therefore, for better performance, considering the file size is essential. 

File Size Considerations:

The recommended file size for cost considerations in Snowflake can vary depending on several factors, such as the size of your data warehouse, the number of users, the complexity of queries, and the type of data that is being worked upon. 

To enhance the efficiency of parallel operations, it is recommended to generate compressed data files that are roughly 100-250 MB (or larger) in size. The VARIANT data type (used to parse JSON file) imposes a 16 MB compressed size limit on individual rows.

Loading files that are very large (100 GB or greater) is not recommended. If it is essential to load such a file, it is vital to carefully consider the ON_ERROR copy option value. Prematurely terminating or skipping a file due to minor errors could result in credit wastage and delays.

Additionally, if the data loading process takes longer than the maximum permitted 24-hour duration, it may be canceled without committing any part of the file. You can refer to the copy command for better understanding.

copy into demo_db.public.emp_basic_local 
from (select metadata$filename, metadata$file_row_number, t.$1 , t.$2 , t.$3 , t.$4 , t.$5 , t.$6 from  
@demo_db.public.%emp_basic_local t) 
file_format = (type = csv field_optionally_enclosed_by='”‘) 
on_error = ‘skip_file’;   

Splitting large data sets into small chunks to reduce the load time and storage requirements. The ideal chunk size for a large file should not be more than 250MB otherwise it will affect the data loading performance. 

There are several ways to break large files into smaller ones in Snowflake, and the recommended approach may depend on the specific use case and the operating system being used. Here are a few options: 

Linux or macOS: The split utility enables you to split a CSV file into multiple smaller files.

split -l 100000 myfile.csv pages 

This example splits a file named myfile.csv by line length. Suppose the large single file is 8 GB in size and contains 10 million lines. Split by 100,000, each of the 100 smaller files is 80 MB in size (10 million / 100,000 = 100). The split files are named pages<suffix> 

Windows: Windows does not have a built-in file splitting feature, but it does support various external tools and scripts that can be used to split large data files. For example, we have taken python script to make a large data file into chunks of files. 

For reference you can go through the below code: 

import os
def split_file(file_path, chunk_size):
with open(file_path, ‘rb’) as infile:
chunk_count = 0
while True:
chunk = infile.read(chunk_size)
if not chunk:
break
chunk_count += 1
chunk_file_path = f'{file_path}.part{chunk_count:03}’
with open(chunk_file_path, ‘wb’) as chunk_file:
chunk_file.write(chunk)
if __name__ == ‘__main__’:
file_path = ‘path/to/your/file’
chunk_size = 1024 * 1024 # 1 MB
split_file(file_path, chunk_size)

However, it is important to keep in mind that the recommended file size is just a guideline and may not be appropriate for all use cases. If you are working with large volumes of data, it may be necessary to use larger data files to avoid the overhead associated with processing many small files. 

  • To determine the optimal file size for your use case, it is recommended to test different file sizes and monitor the performance and cost of your data operations. This can help you determine the best file size for your specific needs and ensure that you are making the most efficient use of your data warehouse and resources. 
Compression Techniques:
  • Snowflake supports several compression techniques, including Snappy, Zlib, and Gzip, to help reduce the size of the data being loaded. Each compression technique has its own benefits and drawbacks, so it is important to choose the right one for your use case. For example, Snappy is a fast compression algorithm that is best suited for smaller files, while Gzip is a more comprehensive compression algorithm that is best suited for larger files. 

File/Data Loading Techniques: 

Snowflake supports three loading techniques: bulk loading, incremental loading, and streaming loading. Each of these loading techniques has its own impact on file size, so it is important to choose the right one for your use case.  

For example, bulk loading is best suited for loading large data sets into Snowflake, while incremental loading is best suited for updating existing data sets. Streaming loading is best suited for real-time data streaming, where data is loaded into Snowflake as it is generated. 

Dibyajyothi Rath

Data Engineer

Boolean Data Systems


Dibyajyothi Rath works as a Data Engineer at Boolean Data Systems and has built many end-end Data Engineering solutions. He is a Snowpro core and Matillion Associate certified engineer who has proven skills and expertise with Snowflake, Matillion, Python to name a few.

Conclusion:

In conclusion, file size considerations are one of the important aspects of using Snowflake effectively. By optimizing the size of the data being loaded into Snowflake, you can improve performance, reduce storage costs, and make the most of this powerful platform. By following the best practices mentioned in this blog, you can ensure that your Snowflake implementation is optimized for your use case. 

About Boolean Data
Systems

Boolean Data Systems is a Snowflake Select Services partner that implements solutions on cloud platforms. we help enterprises make better business decisions with data and solve real-world business analytics and data problems.

Global
Head Quarters

1255 Peachtree Parkway, Suite #4204, Alpharetta, GA 30041, USA.
Ph. : +1 678-261-8899
Fax : (470) 560-3866