Tuesday, April 10, 2007

csplit — splitting a text file into multiple separate files

Sometimes it's hard to get csv files imported into Excel. Especially if you have a long data file that needs to be separated out into different Excel sheets.

If you have data that needs to be in separate documents, and if the different areas in your data file are delimited from each other in some fashion, then you can use csplit to split them into multiple files. csplit means Contenxt-sensitive Split -- there's a command split that will just split the file based on number of bytes, but that's not what we want.

For example, if you had a file called input.txt that contains csv formatted spreadsheets, all beginning with the word Sheet by itself on a line, you can run the following command:

csplit -f sheet input.txt /Sheet/ \{99\}

This will split the file into 99 separate files that begin with the line Sheet. It will name them sheet00, sheet01, sheet02, etc. 99 is the max, but you can always run it again on the last file, which would be the remainder of your data after the previous 98 sheets were taken out.

It will fail if the number of files you specified is greater than the number of sheets that will be created. So do a grep -c /Sheet/ input.txt beforehand, subtract 1, and use that number. I played around with all sorts of backticks and things to try to get a version of the command that would do this for you, but it really ain't worth it.

Sure, given a bit of time you could probably whip up a perl script that would do this better and faster, and probably name the files appropriately. And remove the "Sheet" line while it's at it. But this is already there!

As an aside, I always put a space before any numeric codes in csv files that shouldn't be treated like numbers. Especially if they have dashes in them. The space keeps Excel from trying to strip leading zeroes or convert them into dates.

grep!

grep has a -q (quiet) mode.

This is GREAT for use in scripts. If the string you're grepping for gets a match, it returns 0.

No fiddling with comparison operators for your if statements!

#!/bin/sh

if grep -q needle haystack.txt
then
echo "I found a needle!"
else
echo "No needles here..."
fi