The Google Mirror Site
 
 
 

Large File Support For std::ifstream

This is another one of those "if you aren't a computer programmer, you aren't going to be interested in this" kind of things.

The past couple of days, I had a problem at work. My team has to process a bunch of very large (from 1 - 8 GB) binary files, which contain records of some sort. For a while, we had an application which processed this individually, and was called something that looked like this:

./taskname.o < datafile.1
./taskname.o < datafile.2
etc

The datafiles were being piped into the application, and everything was working all hunky-dory. But this is clearly an inefficient way to process data - at the very least there is time taken up in the starting and stopping of the process, and if there is any caching that is done as part of the data processing (and in our case, there is), you are losing that in memory cache every time you start a new datafile.

So what do you do? You change the program just the smallest bit so that instead of being called with the name of the data file, it's called with the name of a file that contains the list of your data files. Very small change.

However, when you do that, you'll find that some of your datafiles don't process correctly anymore, since the creation of the std::ifstream object is failing. Then you'll note that it's failing on objects greater than 4.2 GBs. Ah HA! So clearly what's happening here is that, though you are compiling your code with the correct flags to support large file sizes (which you were perceptive enough to get from running getconf LFS_CFLAGS from your unix prompt), the std stream operators (for the implementation of stl that you are using) are not respecting the 64 bit versions of the underlying fopen routines, and are using the 32 bit versions. Clearly these will choke on files that have addresses greater than that which can be contained in a 32 bit value. Duh.

You have a couple of options here:

- Cry (I tried this for a bit. While cathartic, this is ultimately an ineffectual solution).
- Rewrite your code using the 'raw' C routines (fopen and fread instead of the really nice std objects which are awesome. Or at least as awesome as C++ gets)
- Follow the advice of Lance Diduck, who suggests inheriting from std::basic_strambuf, and then implement xputsn, underflow, overflow, sputc, sgetc (I tried this for a while as well, and it resulted in a repeat of the first option).
- Think about why exactly this worked when you were piping information right into the application. That doesn't make sense, does it? Wait. Wait! What if we do the following:

int fd = open( dump_files[i].c_str(), O_LARGEFILE|O_RDONLY );
ifstream dump_stream(fd);

So we use the raw C open (which *does* respect the compile flags for large file support) and use *that* to act as a constructor argument for the std::ifstream! Will this work? Probably, as long as we just stream in from the file descriptor. It almost definitely won't work if you call seek, since the position is probably going to be read as negative. So take care to not do that.

TrackBack

TrackBack URL for this entry:
http://www.alltooflat.com/cgi-bin/mt-tb.cgi/622

Post a comment

(If you haven't left a comment here before, you may need to be approved by the site owner before your comment will appear. Until then, it won't appear on the entry. Thanks for waiting.)

Join All Too Flat now!
Site Map [rss] Huge Huge! © 2005 Contact The Webmaster
Donate to help Alltooflat with the bandwidth bills