Breaking down large CSV files

Today I received a 45Mb CSV file for importing into a database… Needless to say the application we were importing to didn’t seem to like the size of the file, for what ever reason… So I knocked up a quite bash script to create smaller ‘chunks’ defined as a number of lines, to make importing simpler.

I’m sure there’s many way in which is can be simplified, so if you know any I’d like the contributions!

It’s run like this:

$ ./csv-chunk.sh large-data.csv 5000

The first argument being the filename and the second argument the maximum number of lines for each ‘chunk’. From that 45Mb megalith, 38 files of around 1.2Mb were produced which didn’t seem to break the other end!

Here’s the script:

#!/bin/bash

function help {
  echo "Usage:"
  echo "  $0  =5000"
  exit 1
}

if [ $# -eq 0 ]; then
  help
fi

if [ $# -eq 1 ]; then
  chunk=5000
else
  chunk=$2
fi

file=$1

if [ ! -e $file ]; then
  echo "File $file not found!"
  exit 5
fi

header=`head -n 1 $file`
max=`cat $file | wc -l`
x=1

echo "Breaking down $file ($max lines into $chunk lined files)"

for (( i=1; i<=$max; i+=$chunk )); do
  chunkfile="chunk-$x-$file"
  
  if [ -e $chunkfile ]; then
    echo "$chunkfile already exists!"
    exit 2
  fi
  
  `touch $chunkfile`
  echo $header > $chunkfile
  
  start=`expr $i + 1`
  end=`expr $i + $chunk`
  
  `sed $start,$end\!d $file >> $chunkfile`
  
  x=`expr $x + 1`
done

echo "Created $x files"
exit 0

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.