Search This Blog

2014-08-25

Splitting and Combining Files

It's a fairly common scenario that you may have files larger than the existing storage medium you have. So you can buy multiple storage items, but you'll need to split the file out into smaller chunks to be able to store it in multiple locations. How do you do this? In linux it's fairly easy using the split utility. You can even recombine and split in windows fairly easily.

The initial split is done using the split command as follows:
split -b 1024m file file.part-
Where -b is the block size you want to split by. In this example 1GB.
file is the original file you want to split
file_ is the prefix you would like to use.

When running the split command the files it will create will be of the format
file.part-aa
file.part-ab
file.part-ac
etc.
As defined by the prefix above.

To recombine these files you can use cat in linux to concatenate the existing files.
cat file-part-* > file

In windows you can use the copy command
copy /b file.part-aa + file.part-ab + file.part-ac + file.part-ad file

I'm unsure of how to specify regex for the copy command in windows but this is a quick and dirty way to get it done and you could always generate a list of files and manually script it together.

You can also pipe the output of gzip into split to compress the archive but you will probably have additional overhead of decompression, and if you don't care so much about space savings split is probably going to be quicker.

An alternate way of doing this would be to use the dd utility:

Example file
tmp
contents:
cat tmp
a
b
c
d

Sizing details (in bytes):
ls -ls tmp
0 -rw-r--r--  1 bnold  domusers  8 Aug 25 15:20 tmp

wc -c tmp
       8 tmp

Split
dd if=tmp of=tmp.part1 bs=1 count=4 
dd if=tmp of=tmp.part2 bs=1 count=4 skip=4

Restore
dd if=tmp.part1 of=tmp_new bs=1 count=4
dd if=tmp.part2 of=tmp_new bs=1 count=4 seek=4

Contents of restored file:
cat tmp_new
a
b
c
d

Size of restored file:
ls -ls tmp_new
0 -rw-r--r--  1 bnold  domusers  8 Aug 25 15:44 tmp_new

wc -c tmp
       8 tmp

Validating integrity:
md5 tmp
MD5 (tmp) = 47ece2e49e5c0333677fc34e044d8257
md5 tmp_new
MD5 (tmp_new) = 47ece2e49e5c0333677fc34e044d8257
Hashes match, we're good.
References:
http://www.linuxquestions.org/linux/answers/applications_gui_multimedia/splitting_and_merging_files_using_dd
http://en.wikipedia.org/wiki/Split_%28Unix%29
http://stackoverflow.com/questions/1120095/split-files-using-tar-gz-zip-or-bzip2
http://linuxpoison.blogspot.ca/2008/09/split-and-merge-large-files.html
http://serverfault.com/questions/86808/break-up-a-dd-image-into-multiple-files

No comments:

Post a Comment

Note: Only a member of this blog may post a comment.