For data analysis .csv files with houndreds of thousands of data sets still play a role. You might think:
Hey, why worry about .csv, that’s an ancient format, nobody uses that!
Think again! CSV files, comma separated values (that also can be separated by tabs or kind of any character you want, but never mind that) are still used a lot on in data analysis as a raw input format.
Thanks to a former colleague I was tasked with merging 1.5 million records that were split over several files into a big CSV file. Of course they all contained neat little header portions like so:
# file 1 Name,Email Sirius,email@example.com Remus,firstname.lastname@example.org Whistler,email@example.com Salome,firstname.lastname@example.org
# file 2 Name,Email Galadriel,AwesomeElvenQueen@loth.lorien Saruman,WhiteHand69@dark.tower
So the simple way would be to just
cat file* > output_simple.csv, right? That would repeat the row that contains the headers though, which is no good.
CSV only keep first header / skip headers
To concatenate without repeating the headers, we can craft a simple (if that term ever applies to bash) script:
#!/bin/bash if [[ $# -eq 1 ]] ; then echo 'usage:' echo './merge.sh pattern output.csv' exit 1 fi output_file=$2 i=0 files=$(ls "$1"*".csv" ) echo $files for filename in $files; do echo $i if [[ $i -eq 0 ]] ; then # copy csv headers from first file echo "first file" head -1 $filename > $output_file fi echo $i "common part" # copy csv without headers from other files tail -n +2 $filename >> $output_file i=$(( $i + 1 )) done
It’s still fairly simple, because it just uses a loop, a conditional and
tail which either read a line-based file from the top or the bottom (with a specified offset).
This script will produce output like the following to
output.csv if run like this:
./merge.sh file output.csv
Name,Email Sirius,email@example.com Remus,firstname.lastname@example.org Whistler,email@example.com Salome,firstname.lastname@example.org Galadriel,AwesomeElvenQueen@loth.lorien Saruman,WhiteHand69@dark.tower
I hope somebody finds this useful. If you found this post, let me know what you’re doing with it 🙂
Are you in email marketing? Are you a data scientist?