Breaking Down / Splitting Large CSV Files with PowerShell

Handling large CSV files efficiently is a common challenge for data analysts and IT professionals alike. Manipulating such files can be time-consuming and resource-intensive, often resulting in performance issues. In this blog post, we will explore a PowerShell script that helps break down large CSV files into smaller, more manageable chunks. By leveraging PowerShell’s capabilities, you can improve processing times and work with data subsets more effectively.

Suppose you have a massive CSV file containing thousands or millions of records. Reading and manipulating such a file as a single entity can be cumbersome and may exceed system limitations. Furthermore, performing operations on the entire dataset might not be necessary or practical in some scenarios. Our goal is to divide this monolithic file into smaller pieces for better handling and processing. The script was initially designed for a dataset that contained ~2 million lines that crashed the Excel Desktop App of a customer.

The Script

$InputFilename = Get-Content '<<InputFilePath>>'
$OutputFilenamePattern = '2021-00-'
$LineLimit = 50000
$line = 0
$i = 0
$file = 0
$start = 0

while ($line -le $InputFilename.Length) {
    if ($i -eq $LineLimit -Or $line -eq $InputFilename.Length) {
        $file++
        $Filename = "$OutputFilenamePattern$file.csv"
        $InputFilename[$start..($line-1)] | Out-File $Filename -Force
        $start = $line;
        $i = 0
        Write-Host "$Filename"
    }
    $i++;
    $line++
}

Let’s dive into it

Let’s break down the script and understand how it solves our problem:

  1. $InputFilename: This variable stores the content of the input CSV file. Replace '<<InputFilePath>>' with the path to your large CSV file.
  2. $OutputFilenamePattern: This variable defines the pattern for the output filenames. In the provided script, it uses the pattern ‘2021-00-‘, followed by a numerical identifier for each chunk.
  3. $LineLimit: This variable sets the maximum number of lines per output file. Adjust this value as per your requirements.
  4. The script initializes variables $line, $i, $file, and $start to track line and file numbers.
  5. The while loop continues until the $line variable reaches or exceeds the length of the input file.
  6. Inside the loop, the script checks if the current line count ($i) equals the line limit ($LineLimit) or if it has reached the end of the input file.
  7. If either condition is true, the script increments the file number ($file), generates a filename based on the output filename pattern and the file number, and writes the lines between the $start and $line-1 indices to the generated file using the Out-File cmdlet.
  8. After writing the chunk of lines to a file, the script updates the $start variable to the current line index, resets the line count ($i) to 0, and displays the name of the created file using the Write-Host cmdlet.
  9. The script continues this process until it has processed all lines in the input file, creating separate output files with filenames following the pattern ‘2021-00-1.csv’, ‘2021-00-2.csv’, and so on.

Conclusion

By employing the provided PowerShell script, you can divide large CSV files into smaller, more manageable chunks. This approach enables efficient processing of data subsets, reduces system resource usage, and facilitates easier manipulation of data. Feel free to adjust the script according to your specific requirements, such as changing the line limit or modifying the output filename pattern.

Remember, PowerShell offers a wide range of functionalities that can further enhance your data processing workflows. Exploring its features and incorporating them into your scripts can greatly improve your productivity when dealing with large datasets.

Leave a comment