Friday, November 11, 2011

Memory Mapped File IO

How often do you have to write a loop to read a file? Do you issue fscanf, fread, or perhaps create a stream via some object-oriented language?

Operating systems also provide support for easily reading in the entire file. A couple of function calls concluding with mmap() or MapViewOfFileEx() provides a char* pointer to the file's data. And with that, the file has been read into the program. So let's take a few moments to understand what the operating system is doing and use that to identify the advantages (and disadvantages) that memory mapping file IO provides.

When a program opens a file and performs IO via calls like fscanf, the program is using buffered IO. The operating system opens the file and sends IO operations based on the disk's granularity, which is at least 512B and often more. On a call like, fscanf(fp, "%d", &value), the program is requesting somewhere between 1 and 16 bytes, which is far less than the disk's block size. The operating system will read in the entire disk block and then copy the appropriate bytes to the program, while retaining the remained in its file cache. (By the way, the standard libraries also perform buffering, where the previous fscanf call might request 128 bytes from the OS and then process this data before returning to the application itself). Now, in making this call, how many copies of the data now exist in the system? There is one on disk, one in the OS's file cache, one in the standard library, and one in the application.

If the application requested the file via mmap, nothing happens except the OS reserves a range of the application's virtual address space for the file's data. Then when the application accesses an address in this range, the page fault will read in the data from the disk into the file cache and the virtual address will point to this same data.

If your program needs to make multiple passes through the file, or has a regular structure (such that a data structure could be defined, perhaps a future post on this), then memory mapped file IO can save time and space. However, if there are many small files, the file data only needs to be read once, or significant processing is required, then using a more "traditional" IO mechanism would be advisable.

Thursday, November 3, 2011

Repost Are you a Programmer

Consider the question, what percentage of programs are released to the public (either commercial or open source)? Even excluding student assignments, this percentage may be far lower than one might think. My expectations are biased by having many friends at Apple, Facebook, Google, and Microsoft. But thinking further, I spent 10-20% of my time as a salaried employee doing just that, writing little programs. These coding projects were intended to help me do my job faster, which was contributing to a commercial software product.

This is my introduction to a piece of career advice regarding being a Computer Scientist (*cough* programmer), Don't Call Yourself a Programmer.