Binary file IO: Converting imported sequences of chars to desired type

Post by Rune Allnor
Hi all.
I have used the method from this page,
http://www.cplusplus.com/reference/iostream/istream/read/
to read some binary data from a file to a char[] buffer.
The 4 first characters constitute the binary encoding of
a float type number. What is the better way to transfer
the chars to a float variable?
The naive C way would be to use memcopy. Is there a
better C++ way?

This is the correct way since memcpy() allows you to copy unaligned data
into an aligned object.

Another way is to read data directly into the aligned object:

float f;
stream.read(reinterpret_cast<char*>(&f), sizeof f);

--
Max

James Kanze

2009-10-18 09:10:15 UTC

Post by Rune Allnor
I have used the method from this page,
http://www.cplusplus.com/reference/iostream/istream/read/
to read some binary data from a file to a char[] buffer.
The 4 first characters constitute the binary encoding of
a float type number. What is the better way to transfer
the chars to a float variable?
The naive C way would be to use memcopy. Is there a
better C++ way?

This is the correct way since memcpy() allows you to copy
unaligned data into an aligned object.
float f;
stream.read(reinterpret_cast<char*>(&f), sizeof f);

Neither, of course, work, except in very limited cases.

To convert bytes written in a binary byte stream to any internal
format, you have to know the format in the file; if you also
know the internal format, and have only limited portability
concerns, you can generally do the conversion much faster; a
truely portable read requires use of ldexp, etc., but if you are
willing to limit your portability to machines using IEEE
(Windows and mainstream Unix, but not mainframes), and the file
format is IEEE, you can simply read the data as a 32 bit
unsigned int, then use reinterpret_cast (or memcpy).

FWIW: the fully portable solution is something like:

class ByteGetter
{
public:
explicit ByteGetter( ixdrstream& stream )
: mySentry( stream )
, myStream( stream )
, mySB( stream->rdbuf() )
, myIsFirst( true )
{
if ( ! mySentry ) {
mySB = NULL ;
}
}
uint8_t get()
{
int result = 0 ;
if ( mySB != NULL ) {
result = mySB->sgetc() ;
if ( result == EOF ) {
result = 0 ;
myStream.setstate( myIsFirst
? std::ios::failbit | std::ios::eofbit
: std::ios::failbit | std::ios::eofbit |
std::ios::badbit ) ;
}
}
myIsFirst = false ;
return result ;
}

private:
ixdrstream::sentry mySentry ;
ixdrstream& myStream ;
std::streambuf* mySB ;
bool myIsFirst ;
} ;

ixdrstream&
ixdrstream::operator>>(
uint32_t& dest )
{
ByteGetter source( *this ) ;
uint32_t tmp = source.get() << 24 ;
tmp |= source.get() << 16 ;
tmp |= source.get() << 8 ;
tmp |= source.get() ;
if ( *this ) {
dest = tmp ;
}
return *this ;
}

ixdrstream&
ixdrstream::operator>>(
float& dest )
{
uint32_t tmp ;
operator>>( tmp ) ;
if ( *this ) {
float f = 0.0 ;
if ( (tmp & 0x7FFFFFFF) != 0 ) {
f = ldexp( ((tmp & 0x007FFFFF) | 0x00800000),
(int)((tmp & 0x7F800000) >> 23) - 126 -
24 ) ;
}
if ( (tmp & 0x80000000) != 0 ) {
f = -f ;
}
dest = f ;
}
return *this ;
}

The above code still needs work to handle NaN's and Infinity
correctly, but it should give a good idea of what it necessary.

If you aren't concerned about machines which aren't IEEE, of
course, you can just memcpy the tmp after having read it in the
last function above, or use a reinterpret_cast to force the
types.

--
James Kanze

Maxim Yegorushkin

2009-10-18 11:13:13 UTC

This is the correct way since memcpy() allows you to copy
unaligned data into an aligned object.
float f;
stream.read(reinterpret_cast<char*>(&f), sizeof f);

Neither, of course, work, except in very limited cases.

The assumption was that the float was written by the same program or a
program with a compatible binary API. Is that the case you meant in
"except in very limited cases"?

--
Max

James Kanze

2009-10-19 09:58:01 UTC

This is the correct way since memcpy() allows you to copy
unaligned data into an aligned object.
float f;
stream.read(reinterpret_cast<char*>(&f), sizeof f);

Neither, of course, work, except in very limited cases.

The assumption was that the float was written by the same
program or a program with a compatible binary API. Is that the
case you meant in "except in very limited cases"?

More or less. Formally, there's no guarantee that the
compatible binary API works, but in practice, it almost
certainly will.

Note, however, that most systems today support several
incompatible binary API's; which one the compiler uses depends
on the version and the options used for compiling. In practice,
it's not something you can count on except for very short lived
data: I wouldn't hesitate about using it for spilling temporary
data to disk, to be reread later by the same process. I can
imagine that it's quite acceptable as well if you have one
program collecting data during e.g. a week, and another
processing all of the data in batch over the week-end, provided
that both programs were compiled with the same compiler, using
the same options. Beyond that, I'd have my doubts (having been
bit with the problem more than once in the past). As a general
rule, it's better to define a format, and match it. (Even if I
were using a memory dump, I'd first "define" the format, just
ensuring that the definition was compatible to the in memory
image. That way, if worse comes to worse, at least a
maintenance programmer will know what to expect, and will have a
chance at making it work.)

--
James Kanze

Jorgen Grahn

2009-10-23 08:07:40 UTC

...

Post by Maxim Yegorushkin
The assumption was that the float was written by the same
program or a program with a compatible binary API. Is that the
case you meant in "except in very limited cases"?

More or less. Formally, there's no guarantee that the
compatible binary API works, but in practice, it almost
certainly will.
Note, however, that most systems today support several
incompatible binary API's; which one the compiler uses depends
on the version and the options used for compiling. In practice,
it's not something you can count on except for very short lived
data: I wouldn't hesitate about using it for spilling temporary
data to disk, to be reread later by the same process. I can
imagine that it's quite acceptable as well if you have one
program collecting data during e.g. a week, and another
processing all of the data in batch over the week-end, provided
that both programs were compiled with the same compiler, using
the same options. Beyond that, I'd have my doubts (having been
bit with the problem more than once in the past). As a general
rule, it's better to define a format, and match it. (Even if I
were using a memory dump, I'd first "define" the format, just
ensuring that the definition was compatible to the in memory
image. That way, if worse comes to worse, at least a
maintenance programmer will know what to expect, and will have a
chance at making it work.)

But if you have a choice, it's IMO almost always better to write the
data as text, compressing it first using something like gzip if I/O or
disk space is an issue.

(Loss of precision when printing decimal floats could be a problem in
this case though ...)

/Jorgen

--
// Jorgen Grahn <grahn@ Oo o. . .
\X/ snipabacken.se> O o .

James Kanze

2009-10-23 09:27:06 UTC

Post by Maxim Yegorushkin
The assumption was that the float was written by the same
program or a program with a compatible binary API. Is that
the case you meant in "except in very limited cases"?

More or less. Formally, there's no guarantee that the
compatible binary API works, but in practice, it almost
certainly will.
Note, however, that most systems today support several
incompatible binary API's; which one the compiler uses
depends on the version and the options used for compiling.
In practice, it's not something you can count on except for
very short lived data: I wouldn't hesitate about using it
for spilling temporary data to disk, to be reread later by
the same process. I can imagine that it's quite acceptable
as well if you have one program collecting data during e.g.
a week, and another processing all of the data in batch over
the week-end, provided that both programs were compiled with
the same compiler, using the same options. Beyond that, I'd
have my doubts (having been bit with the problem more than
once in the past). As a general rule, it's better to define
a format, and match it. (Even if I were using a memory
dump, I'd first "define" the format, just ensuring that the
definition was compatible to the in memory image. That way,
if worse comes to worse, at least a maintenance programmer
will know what to expect, and will have a chance at making
it work.)

But if you have a choice, it's IMO almost always better to
write the data as text, compressing it first using something
like gzip if I/O or disk space is an issue.

Totally agreed. Especially for the maintenance programmer, who
can see at a glance what is being written.

Post by Jorgen Grahn
(Loss of precision when printing decimal floats could be a
problem in this case though ...)

It's a hard problem in general. If writing and reading to
internal formats with the same precision, it's sufficient to
output enough digits. If you don't know the precision of the
reader, however, you don't really know how many digits to output
when writing.

--
James Kanze

Jorgen Grahn

2009-10-25 13:25:43 UTC

...

Post by Jorgen Grahn
(Loss of precision when printing decimal floats could be a
problem in this case though ...)

Good point; I didn't think of that aspect (i.e. not give a false
impression of precision when the input is e.g. 3.14 and you output
it as 3.14000000).

I was more thinking about reading "0.20000000000000000" but printing
0.20000000000000001. But now that I think of it, it's a loss of
precision in the input; there is no way to avoid it and still use
float/double internally.

/Jorgen

--
// Jorgen Grahn <grahn@ Oo o. . .
\X/ snipabacken.se> O o .

James Kanze

2009-10-25 17:13:55 UTC

Post by Jorgen Grahn
(Loss of precision when printing decimal floats could be a
problem in this case though ...)

It's a hard problem in general. If writing and reading to
internal formats with the same precision, it's sufficient to
output enough digits. If you don't know the precision of
the reader, however, you don't really know how many digits
to output when writing.

Good point; I didn't think of that aspect (i.e. not give a
false impression of precision when the input is e.g. 3.14 and
you output it as 3.14000000).

I'm not sure what you're referring to here. We're talking about
the format used for transmitting data from one machine to
another. Given enough digits and the same basic format, it's
always possible to make a round trip, writing, then reading, and
getting the exact value back (even if the value output isn't the
exact value).

Post by Jorgen Grahn
I was more thinking about reading "0.20000000000000000" but
printing 0.20000000000000001.

For data communications, the problem occurs in the opposite
sense. Except that with enough digits (17 for IEEE double, I
think), it won't occur.

Post by Jorgen Grahn
But now that I think of it, it's a loss of precision in the
input; there is no way to avoid it and still use float/double
internally.

But for this application, if you know how many digits are needed
to ensure correct reading, the loss of precision when reading
will exactly offset the error when writing.

The problem only comes up when you don't know the number of
digits in the reader's format. This is particularly an issue
with double, since the second most widely used format (IBM
mainframe double) has more digits precision than IEEE double,
and 17 digits probably won't be enough; you'll get something
very close, but it might not be the closest possible
representation. Which in this case would be exactly the
starting value---I think that IBM mainframe double precision can
represent all IEEE double values in range exactly. (Warning:
this is all very much off the top of my head. I've not done any
real analysis to verify the actual case of IBM floating point
versus IEEE. The problem can definitely occur, however, and it
wouldn't be difficult to imagine a 128 bit double format where
it did.)

--
James Kanze

Jorgen Grahn

2009-10-26 16:37:41 UTC

Post by Jorgen Grahn
(Loss of precision when printing decimal floats could be a
problem in this case though ...)

It's a hard problem in general. If writing and reading to
internal formats with the same precision, it's sufficient to
output enough digits. If you don't know the precision of
the reader, however, you don't really know how many digits
to output when writing.

Good point; I didn't think of that aspect (i.e. not give a
false impression of precision when the input is e.g. 3.14 and
you output it as 3.14000000).

I'm not sure what you're referring to here. We're talking about
the format used for transmitting data from one machine to
another. [...]

I guess I am demonstrating why I try to stay away from
floating-point ;-) It is a tricky area.

/Jorgen

--
// Jorgen Grahn <grahn@ Oo o. . .
\X/ snipabacken.se> O o .

Rune Allnor

2009-10-25 14:13:49 UTC

Post by Maxim Yegorushkin
The assumption was that the float was written by the same
program or a program with a compatible binary API. Is that
the case you meant in "except in very limited cases"?

More or less. Formally, there's no guarantee that the
compatible binary API works, but in practice, it almost
certainly will.
Note, however, that most systems today support several
incompatible binary API's; which one the compiler uses
depends on the version and the options used for compiling.
In practice, it's not something you can count on except for
very short lived data: I wouldn't hesitate about using it
for spilling temporary data to disk, to be reread later by
the same process. I can imagine that it's quite acceptable
as well if you have one program collecting data during e.g.
a week, and another processing all of the data in batch over
the week-end, provided that both programs were compiled with
the same compiler, using the same options. Beyond that, I'd
have my doubts (having been bit with the problem more than
once in the past). As a general rule, it's better to define
a format, and match it. (Even if I were using a memory
dump, I'd first "define" the format, just ensuring that the
definition was compatible to the in memory image. That way,
if worse comes to worse, at least a maintenance programmer
will know what to expect, and will have a chance at making
it work.)

But if you have a choice, it's IMO almost always better to
write the data as text, compressing it first using something
like gzip if I/O or disk space is an issue.

Totally agreed. Especially for the maintenance programmer, who
can see at a glance what is being written.

The user might have opinions, though.

File I/O operations with text-formatted floating-point data
take time. A *lot* of time. The rule-of-thumb is 30-60 seconds
per 100 MBytes of text-formatted FP numeric data, compared to
fractions of a second for the same data (natively) binary encoded
(just try it).

In heavy-duty data processing applications one just can not afford
to spend more time than absolutely necessary. Text-formatted data
is not an option.

If there are problems with binary floating point I/O formats, then
that's a question for the C++ standards committee. It ought to be
a simple technical (as opposed to political) matter to specify that
binary FP I/O could be set to comply to some already defined
standard,
like e.g. IEEE 754.

The matter isn't fundamentally different from setting locales and
character encodings with text files.

Rune

James Kanze

2009-10-25 17:47:28 UTC

[...]

Post by Jorgen Grahn
But if you have a choice, it's IMO almost always better to
write the data as text, compressing it first using something
like gzip if I/O or disk space is an issue.

Totally agreed. Especially for the maintenance programmer,
who can see at a glance what is being written.

The user might have opinions, though.
File I/O operations with text-formatted floating-point data
take time. A *lot* of time.

A lot of time compared to what? My experience has always been
that the disk IO is the limiting factor (but my data sets have
generally been very mixed, with a lot of non floating point data
as well). And binary formatting can be more or less expensive
as well---I'd rather deal with text than a BER encoded double.
And Jorgen said very explicitly "if you have a choice".
Sometimes you don't have the choice: you have to conform to an
already defined external format, or the profiler says you don't
have the choice.

Post by Rune Allnor
The rule-of-thumb is 30-60 seconds per 100 MBytes of
text-formatted FP numeric data, compared to fractions of a
second for the same data (natively) binary encoded (just try
it).

Try it on what machine:-). Obviously, the formatting/parsing
speed will depend on the CPU speed, which varies enormously. By
a factor of much more than 2 (which is what you've mentionned).

Again, I've no recent measurements, so I can't be sure, but I
suspect that the real difference in speed will come from the
fact that you're writing more bytes with a text format, and on a
slow medium, that can make a real difference. (In one
application, where we had to transmit tens of kilobytes over a
50 Baud link---and there's no typo there, it was 50 bits, or
about 6 bytes, per second---we didn't even consider using text.
Even though there wasn't any floating point involved.)

Post by Rune Allnor
In heavy-duty data processing applications one just can not
afford to spend more time than absolutely necessary.
Text-formatted data is not an option.

I'm working in such an application at the moment, and our
external format(s) are all text. And the conversions of the
individual values has never been a problem. (One of the formats
is XML. And our disks and network are fast enough that even
that hasn't been a problem.)

Post by Rune Allnor
If there are problems with binary floating point I/O formats,
then that's a question for the C++ standards committee. It
ought to be a simple technical (as opposed to political)
matter to specify that binary FP I/O could be set to comply to
some already defined standard, like e.g. IEEE 754.

So that the language couldn't be used on some important
platforms? (Most mainframes still do not use IEEE. Most don't
even use binary: IBM's are base 16, and Unisys's base 8.) And
of course, not all IEEE is "binary compatible" either: a file
dumped from the Sparcs I've done most of my work on won't be
readable on the PC's I currently work on.

--
James Kanze

Rune Allnor

2009-10-25 18:39:45 UTC

[...]

Post by Jorgen Grahn
But if you have a choice, it's IMO almost always better to
write the data as text, compressing it first using something
like gzip if I/O or disk space is an issue.

Totally agreed. Especially for the maintenance programmer,
who can see at a glance what is being written.

The user might have opinions, though.
File I/O operations with text-formatted floating-point data
take time. A *lot* of time.

A lot of time compared to what?

Wall clock time. Relative time, compared to dumping
binary data to disk. Any way you want.

My experience has always been
that the disk IO is the limiting factor

Disk IO is certainly *a* limiting factor. But not the
only one. In this case it's not even the dominant one.
See the example below.

(but my data sets have
generally been very mixed, with a lot of non floating point data
as well). And binary formatting can be more or less expensive
as well---I'd rather deal with text than a BER encoded double.
And Jorgen said very explicitly "if you have a choice".
Sometimes you don't have the choice: you have to conform to an
already defined external format, or the profiler says you don't
have the choice.

Post by Rune Allnor
The rule-of-thumb is 30-60 seconds per 100 MBytes of
text-formatted FP numeric data, compared to fractions of a
second for the same data (natively) binary encoded (just try
it).

Try it on what machine:-).

Any machine. The problem is to decode text-formatted numbers
to binary.

Obviously, the formatting/parsing
speed will depend on the CPU speed, which varies enormously. By
a factor of much more than 2 (which is what you've mentionned).
Again, I've no recent measurements, so I can't be sure, but I
suspect that the real difference in speed will come from the
fact that you're writing more bytes with a text format,

This is a factor. Binary files are usually about 20%-70% of the
size of the text file, depending on numbers of significant digits
and other formatting text glyphs. File sizes don't account for the
time 50-100x difference.

Here is a test I wrote in matlab a few years ago, to demonstrate
the problem (WinXP, 2.4GHz, no idea about disk):

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
N = 10000000;
d1=randn(N,1);
t1=cputime;
save test.txt d1 -ascii
t2=cputime-t1;
disp(['Wrote ASCII data in ',num2str(t2),' seconds'])

t3=cputime;
d2=load('test.txt','-ascii');
t4=cputime-t3;
disp(['Read ASCII data in ',num2str(t4),' seconds'])

t5=cputime;
fid=fopen('test.raw','w');
fwrite(fid,d1,'double');
fclose(fid);
t6=cputime-t5;
disp(['Wrote binary data in ',num2str(t6),' seconds'])

t7=cputime;
fid=fopen('test.raw','r');
d3=fread(fid,'double');
fclose(fid);
t8=cputime-t7;
disp(['Read binary data in ',num2str(t8),' seconds'])
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

Output:
------------------------------------
Wrote ASCII data in 24.0469 seconds
Read ASCII data in 42.2031 seconds
Wrote binary data in 0.10938 seconds
Read binary data in 0.32813 seconds
------------------------------------

Binary writes are 24.0/0.1 = 240x faster than text write.
Binary reads are 42.2/0.32 = 130x faster than text read.

The script first generates ten million random numbers,
and writes them to file on both ASCII and binary double
precision floating point formats. The files are then read
straight back in, hopefully eliminating effects of file
caches etc. The ASCII file in this test is 175 MBytes, while
the binary file is about 78 MBytes. The first few lines
in the text file look like

-4.3256481e-001
-1.6655844e+000
1.2533231e-001
2.8767642e-001

(one leading whitespace, one negative sign or whitespace,
no trailing spaces) which is not excessive, neither with
respect to the number of significant digits, or the number
of other characters.

The timing numbers (both absolute and relative) would be of
similar orders of magnitude if you repeated the test with C++.

and on a
slow medium, that can make a real difference. (In one
application, where we had to transmit tens of kilobytes over a
50 Baud link---and there's no typo there, it was 50 bits, or
about 6 bytes, per second---we didn't even consider using text.
Even though there wasn't any floating point involved.)

Post by Rune Allnor
In heavy-duty data processing applications one just can not
afford to spend more time than absolutely necessary.
Text-formatted data is not an option.

The application I'm working with would need to crunch through
some 10 GBytes of numerical data per hour. Just reading that
amount of data from a text format would require on the order of

1e10/1.75e8*42s = 2400s = 40 minutes.

There is no point in even considering using a text format
for these kinds of things.

I can't see how the problem is different from text encoding.
The 7-bit ANSI character set is the baseline. A number of
8-bit ASCII encodings are in use, and who knows how many 16-bit
encodings. No one says which one should be used. Only which
ones should be available.

Rune

James Kanze

2009-10-26 17:06:56 UTC

Post by James Kanze
[...]

Post by Jorgen Grahn
But if you have a choice, it's IMO almost always better to
write the data as text, compressing it first using something
like gzip if I/O or disk space is an issue.

Totally agreed. Especially for the maintenance programmer,
who can see at a glance what is being written.

The user might have opinions, though.
File I/O operations with text-formatted floating-point data
take time. A *lot* of time.

A lot of time compared to what?

Wall clock time. Relative time, compared to dumping
binary data to disk. Any way you want.

The only comparison that is relevant is compared to some other
way of doing it.

Post by James Kanze
My experience has always been
that the disk IO is the limiting factor

Disk IO is certainly *a* limiting factor. But not the only
one. In this case it's not even the dominant one.

And that obviously depends on the CPU speed and the disk speed.
Text formatting does take some additional CPU time; if the disk
is slow and the CPU fast, this will be less important than if
the disk is fast and the CPU slow.

Post by Rune Allnor
See the example below.

Which will only be for one compiler, on one particular CPU, with
one set of compiler options.

(Note that it's very, very difficult to measure these things
accurately, because of things like disk buffering. The order
you run the tests can make a big difference: under Windows, at
least, the first test run always runs considerably faster than
if it is run in some other position, for example.)

Post by James Kanze
(but my data sets have generally been very mixed, with a lot
of non floating point data as well). And binary formatting
can be more or less expensive as well---I'd rather deal with
text than a BER encoded double. And Jorgen said very
explicitly "if you have a choice". Sometimes you don't have
the choice: you have to conform to an already defined
external format, or the profiler says you don't have the
choice.

Post by Rune Allnor
The rule-of-thumb is 30-60 seconds per 100 MBytes of
text-formatted FP numeric data, compared to fractions of a
second for the same data (natively) binary encoded (just
try it).

Try it on what machine:-).

Any machine. The problem is to decode text-formatted numbers
to binary.

You're giving concrete figures. "Any machine" doesn't make
sense in such cases: I've seen factors of more than 10 in terms
of disk speed between different hard drives (and if the drive is
remote mounted, over a slow network, the difference can be even
more), and in my time, I've seen at least six or seven orders of
magnitude in speed between CPU's. (I've worked on 8 bit machines
which took on an average 10 ųs per machine instruction, with no
hardware multiply and divide, much less floating point
instructions.)

The compiler and the library implementation also make a
significant difference. I knocked up a quick test (which isn't
very accurate, because it makes no attempt to take into account
disk caching and such), and tried it on the two machines I have
handy: a very old (2002) laptop under Windows, using VC++, and a
very recent, high performance desktop under Linux, using g++.
Under Windows, the difference between text and binary was a
factor of about 3; under Linux, about 15. Apparently, the
conversion routines in the Microsoft compiler are a lot, lot
better than those in g++. The difference would be larger if I
had a higher speed disk or data bus; it would be significantly
smaller (close to zero, probably) if I synchronized each write.
(A synchronized disk write is about 10 ms, at least on a top of
the line Sun Sparc.)

In terms of concrete numbers, of course... Using time gave me
values too small to be significant for 10000000 doubles on the
Linux machine (top of the line AMD processor of less than a year
ago); for 100000000 doubles, it was around 85 seconds for text
(written in scientific format, with 17 digits precision, each
value followed by a new line, total file size 2.4 GB). For
10000000, it was around 45 seconds under Windows (file size 250
MB).

It's interesting to note that the Windows version is clearly IO
dominated. The difference in speed between text and binary is
pretty much the same as the difference in file size.

Post by James Kanze
Obviously, the formatting/parsing
speed will depend on the CPU speed, which varies enormously. By
a factor of much more than 2 (which is what you've mentionned).
Again, I've no recent measurements, so I can't be sure, but I
suspect that the real difference in speed will come from the
fact that you're writing more bytes with a text format,

There is no 50-100x difference. There's at most a difference of
15x, on the machines I've tested; the difference would probably
be less if I somehow inhibited the effects of disk caching
(because the disk access times would increase); I won't bother
trying it with synchronized writes, however, because that would
go to the opposite extreme, and you'd probably never use
synchronized writes for each double: when they're needed, it's
for each record.

Post by Rune Allnor
Here is a test I wrote in matlab a few years ago, to

I'm afraid it doesn't demonstrate anything to me, because I have
no idea how Matlib works. It might be using unbuffered output
for text, or synchronizing at each double. And in what format?

Post by Rune Allnor
The script first generates ten million random numbers,
and writes them to file on both ASCII and binary double
precision floating point formats. The files are then read
straight back in, hopefully eliminating effects of file
caches etc.

Actually, reading immediately after writing maximizes the
effects of file caches. And on a modern machine, with say 4GB
main memory, a small file like this will be fully cached.

Post by Rune Allnor
The ASCII file in this test is 175 MBytes, while
the binary file is about 78 MBytes.

If you're dumping raw data, a binary file with 10000000 doubles,
on a PC, should be exactly 80 MB.

Post by Rune Allnor
The first few lines in the text file look like
-4.3256481e-001
-1.6655844e+000
1.2533231e-001
2.8767642e-001
(one leading whitespace, one negative sign or whitespace, no
trailing spaces) which is not excessive, neither with respect
to the number of significant digits, or the number of other
characters.

It's not sufficient with regards to the number of digits. You
won't read back in what you've written.

Post by Rune Allnor
The timing numbers (both absolute and relative) would be of
similar orders of magnitude if you repeated the test with C++.

I did, and they aren't. They're actually very different in two
separate C++ environments.

Post by Rune Allnor
The application I'm working with would need to crunch through
some 10 GBytes of numerical data per hour. Just reading that
amount of data from a text format would require on the order
of
1e10/1.75e8*42s = 2400s = 40 minutes.
There is no point in even considering using a text format for
these kinds of things.

But it must not be doing much processing on the data, just
copying it and maybe a little scaling. My applications do
significant calculations (which I'll admit I don't understand,
but they do take a lot of CPU time). The time spent writing the
results, even in XML, is only a small part of the total runtime.

I can't see how the problem is different from text encoding.
The 7-bit ANSI character set is the baseline. A number of
8-bit ASCII encodings are in use, and who knows how many
16-bit encodings. No one says which one should be used. Only
which ones should be available.

The current standard doesn't even say that. It only gives a
minimum list of characters which must be supported. But I'm not
sure what your argument is: you're saying that we should
standardize some binary format more than the text format?

(The big difference is, of course, is that while the standard
doesn't specify any encoding, there are a number of different
encodings which are supported on a lot of different machines.
Where as a raw dump of double doesn't work even between a PC and
a Sparc. Or between an older Mac, with a Power PC, and a newer
one, with an Intel chip. Upgrade your machine, and you loose
your data.)

--
James Kanze

Rune Allnor

2009-10-26 17:55:14 UTC

Post by Rune Allnor
File I/O operations with text-formatted floating-point data
take time. A *lot* of time.

A lot of time compared to what?

Wall clock time. Relative time, compared to dumping
binary data to disk. Any way you want.

The only comparison that is relevant is compared to some other
way of doing it.

OK. Text-based IO compard to binary IO.

Post by Rune Allnor
The rule-of-thumb is 30-60 seconds per 100 MBytes of
text-formatted FP numeric data, compared to fractions of a
second for the same data (natively) binary encoded (just
try it).

Try it on what machine:-).

Any machine. The problem is to decode text-formatted numbers
to binary.

You're giving concrete figures.

Yep. But as rule-of-thumb. My point is not to be accurate
(you have made a very convincing case why that would be
difficult), but to point out what performance costs and
trade-offs are involved when using text-based file fomats.

Post by James Kanze
In terms of concrete numbers, of course... Using time gave me
values too small to be significant for 10000000 doubles on the
Linux machine (top of the line AMD processor of less than a year
ago); for 100000000 doubles, it was around 85 seconds for text
(written in scientific format, with 17 digits precision, each
value followed by a new line, total file size 2.4 GB). For
10000000, it was around 45 seconds under Windows (file size 250
MB).

I suspect you might either have access to a bit more funky
hardware than most users, or have the skills to fine tune
what you have better than most users. Or both.

Again, your assets ight not be representative for the
average users.

Post by Rune Allnor
Here is a test I wrote in matlab a few years ago, to

I'm afraid it doesn't demonstrate anything to me, because I have
no idea how Matlib works. It might be using unbuffered output
for text, or synchronizing at each double. And in what format?

Actually, reading immediately after writing maximizes the
effects of file caches. And on a modern machine, with say 4GB
main memory, a small file like this will be fully cached.

I'll rephrase: Eliminates *variability* due to file caches.
Whatever happens affect both files in equal amounts. It would
bias results if one file was cached and the other not.

Post by Rune Allnor
The ASCII file in this test is 175 MBytes, while
the binary file is about 78 MBytes.

If you're dumping raw data, a binary file with 10000000 doubles,
on a PC, should be exactly 80 MB.

It was. The file browser I used reported the file size
in KBytes. Multiply the number by 1024 and you get
exactly 80 Mbytes.

It's not sufficient with regards to the number of digits. You
won't read back in what you've written.

I know. If that was a constraint, file sizes and read/write
times would increase correspondingly.

Post by Rune Allnor
The timing numbers (both absolute and relative) would be of
similar orders of magnitude if you repeated the test with C++.

I did, and they aren't. They're actually very different in two
separate C++ environments.

The read? Th eapplication I am talking about would require
a fair bit of number crunching. If I could process 1 hrs worth
of measurements in 20 minutes, I'd rather cash out the remaining
40 minutes in early results, rather than spend them waiting
for disk IO to complete.

I can't see how the problem is different from text encoding.
The 7-bit ANSI character set is the baseline. A number of
8-bit ASCII encodings are in use, and who knows how many
16-bit encodings. No one says which one should be used. Only
which ones should be available.

Yep. Some formats. like IEEE 754 (and maybe descendants)
are fairly universal. No matter what the native formats
look like, it ought to suffice to call a standard method
to dump binary data on the format.

Post by James Kanze
(The big difference is, of course, is that while the standard
doesn't specify any encoding, there are a number of different
encodings which are supported on a lot of different machines.
Where as a raw dump of double doesn't work even between a PC and
a Sparc. Or between an older Mac, with a Power PC, and a newer
one, with an Intel chip. Upgrade your machine, and you loose
your data.)

Exactly. Which is why there ought to be a standardized
binary floating point format that is portable between
platforms.

Rune

James Kanze

2009-10-28 12:40:12 UTC

Post by Rune Allnor
The rule-of-thumb is 30-60 seconds per 100 MBytes of
text-formatted FP numeric data, compared to fractions of a
second for the same data (natively) binary encoded (just
try it).

Try it on what machine:-).

Any machine. The problem is to decode text-formatted numbers
to binary.

You're giving concrete figures.

Yep. But as rule-of-thumb. My point is not to be accurate (you
have made a very convincing case why that would be difficult),
but to point out what performance costs and trade-offs are
involved when using text-based file fomats.

The problem is that there is no real rule-of-thumb possible.
Machines (and compilers) differ too much today.

Post by James Kanze
In terms of concrete numbers, of course... Using time gave
me values too small to be significant for 10000000 doubles
on the Linux machine (top of the line AMD processor of less
than a year ago); for 100000000 doubles, it was around 85
seconds for text (written in scientific format, with 17
digits precision, each value followed by a new line, total
file size 2.4 GB). For 10000000, it was around 45 seconds
under Windows (file size 250 MB).

I suspect you might either have access to a bit more funky
hardware than most users, or have the skills to fine tune what
you have better than most users. Or both.

The code was written very quickly, with no tricks or anything.
It was tested on off the shelf PC's---one admittedly older than
those most people are using, the other fairly recent. The
compilers in question were the version of g++ installed with
Suse Linux, and the free download version of VC++. I don't
think that there's anything in there that can be considered
"funky" (except maybe that most people professionally concerned
with high input have professional class machines to do it, which
are out of my price range), and I certainly didn't tune
anything.

Again, your assets might not be representative for the
average users.

Well, I'm not sure there's such a thing as an average user. But
my machines are very off the shelf, and I'd consider VC++ and
g++ very "average" as well, in the sense that they're what an
average user is most likely to see.

Post by Rune Allnor
Here is a test I wrote in matlab a few years ago, to

I'm afraid it doesn't demonstrate anything to me, because I have
no idea how Matlib works. It might be using unbuffered output
for text, or synchronizing at each double. And in what format?

Actually, reading immediately after writing maximizes the
effects of file caches. And on a modern machine, with say 4GB
main memory, a small file like this will be fully cached.

I'll rephrase: Eliminates *variability* due to file caches.

By choosing the best case, which rarely exists in practice.

Whatever happens affect both files in equal amounts. It would
bias results if one file was cached and the other not.

What is cached depends on what the OS can fit in memory. In
other words, the first file you wrote was far more likely to be
cached than the second.

Post by Rune Allnor
The ASCII file in this test is 175 MBytes, while
the binary file is about 78 MBytes.

If you're dumping raw data, a binary file with 10000000
doubles, on a PC, should be exactly 80 MB.

It was. The file browser I used reported the file size
in KBytes. Multiply the number by 1024 and you get
exactly 80 Mbytes.

Strictly speaking, a KB is exactly 1000 bytes, not 1024:-). But
I know, different programs treat this differently.

It's not sufficient with regards to the number of digits.
You won't read back in what you've written.

I know. If that was a constraint, file sizes and read/write
times would increase correspondingly.

It was a constraint. Explicitly. At least in this thread, but
more generally: about the only time it won't be a constraint is
when the files are for human consumption, in which case, I think
you'd agree, binary isn't acceptable.

Post by Rune Allnor
The timing numbers (both absolute and relative) would be
of similar orders of magnitude if you repeated the test
with C++.

I did, and they aren't. They're actually very different in
two separate C++ environments.

Post by Rune Allnor
The application I'm working with would need to crunch
through some 10 GBytes of numerical data per hour. Just
reading that amount of data from a text format would
require on the order of
1e10/1.75e8*42s = 2400s = 40 minutes.
There is no point in even considering using a text format
for these kinds of things.

But it must not be doing much processing on the data, just
copying it and maybe a little scaling. My applications do
significant calculations (which I'll admit I don't
understand, but they do take a lot of CPU time). The time
spent writing the results, even in XML, is only a small part
of the total runtime.

The read?

I don't know. It's by some other applications, in other
departments, and I have no idea what they do with the data.

You're probably right, however, that to be accurate, I should do
some comparisons including reading. For various reasons (having
to deal with possible errors, etc.), the CPU overhead when
reading is typically higher than when writing.

But I'm really only disputing your order of magnitude
differences, because they don't correspond with my experience
(nor my measurements). There's definitely more overhead with
text format. The only question is whether that overhead is more
expensive than the cost of the alternatives, and a there depends
on what you're doing. Obviously, if you can't afford the
overhead (and I've worked on applications which couldn't), then
you use binary, but my experience is that a lot of people jump
to binary far too soon, because the overhead isn't that critical
that often.

I can't see how the problem is different from text encoding.
The 7-bit ANSI character set is the baseline. A number of
8-bit ASCII encodings are in use, and who knows how many
16-bit encodings. No one says which one should be used. Only
which ones should be available.

To date, neither C nor C++ have made the slightest gest in the
direction of standardizing any binary formats. There are other
(conflicting) standards which do: XDR, for example, or BER. I
personally think that adding a second set of streams, supporting
XDR, to the standard, would be a good thing, but I've never had
the time to actually write up such a proposal. And a general
binary format is quite complex to specify; it's one thing to say
you want to output a table of double, but to be standardized,
you also have to define what is output when a large mix of types
are streamed, and how much information is necessary about the
initial data in order to read them.

Post by James Kanze
(The big difference is, of course, is that while the
standard doesn't specify any encoding, there are a number of
different encodings which are supported on a lot of
different machines. Where as a raw dump of double doesn't
work even between a PC and a Sparc. Or between an older
Mac, with a Power PC, and a newer one, with an Intel chip.
Upgrade your machine, and you loose your data.)

Exactly. Which is why there ought to be a standardized binary
floating point format that is portable between platforms.

There are several: I've used both XDR and BER in applications in
the past. One of the reasons C++ doesn't address this issue is
that there are several, and C++ doesn't want to choose one over
the others.

--
James Kanze

Rune Allnor

2009-10-28 14:55:48 UTC

Post by James Kanze
The code was written very quickly, with no tricks or anything.

Just out of curiosity - would it be possible to see your code?
As far as I can tell, you haven't posted it (If you have, I have
missed it).

Rune

James Kanze

2009-10-29 10:00:05 UTC

Post by James Kanze
The code was written very quickly, with no tricks or anything.

Just out of curiosity - would it be possible to see your code?
As far as I can tell, you haven't posted it (If you have, I
have missed it).

I haven't posted it because it's on my machine at home (in
France), and I'm currently working in London, and don't have
immediate access to it. Redoing it here (from memory):

#include <fstream>
#include <iostream>
#include <string>
#include <vector>
#include <stddef.h>
#include <stdlib.h>
#include <time.h>

class FileOutput
{
protected:
std::string my_type;
std::ofstream my_file;
time_t my_start;
time_t my_end;
public:
FileOutput( std::string const& type, bool is_binary = true )
: my_type( type )
, my_file( ("test_" + type + ".dat").c_str(),
is_binary ? std::ios::out | std::ios::binary
: std::ios::out )
{
my_start = time( NULL );
}
~FileOutput()
{
my_end = time( NULL ) ;
my_file.close();
std::cout << my_type << ": "
<< (my_end - my_start) << " sec." << std::endl;
}

virtual void output( double d ) = 0;
};

class RawOutput : public FileOutput
{
public:
RawOutput() : FileOutput( "raw" ) {}
virtual void output( double d )
{
my_file.write( reinterpret_cast< char* >(&d), sizeof(d) );
}
};

class CookedOutput : public FileOutput
{
public:
CookedOutput() : FileOutput( "cooked" ) {}
virtual void output( double d )
{
unsigned long long const& tmp
= reinterpret_cast< unsigned long long const& >(d);
int shift = 64 ;
while ( shift > 0 ) {
shift -= 8 ;
my_file.put( (tmp >> shift) & 0xFF );
}
}
};

class TextOutput : public FileOutput
{
public:
TextOutput() : FileOutput( "text", false )
{
my_file.setf( std::ios::scientific,
std::ios::floatfield );
my_file.precision( 17 );
}
virtual void output( double d )
{
my_file << d << '\n';
}
};

template< typename File >
void
test( std::vector< double > const& values )
{
File dest;
for ( std::vector< double >::const_iterator iter = values.begin
();
iter != values.end();
++ iter ) {
dest.output( *iter );
}
}

int
main()
{
size_t const size = 10000000;
std::vector< double >
v;
while ( v.size() != size ) {
v.push_back( (double)( rand() ) / (double)( RAND_MAX ) );
}
test< TextOutput >( v );
test< CookedOutput >( v );
test< RawOutput >( v );
return 0;
}

Compiled with "cl /EHs /O2 timefmt.cc". On my local disk here,
I get:
text: 90 sec.
cooked: 31 sec.
raw: 9 sec.
The last is, of course, not significant, except that it is very
small. (I can't run it on the networked disk, where any real
data would normally go, because it would use too much network
bandwidth, possibly interfering with others. Suffice it to say
that the networked disk is about 5 or more times slower, so the
relative differences would be reduced by that amount.) I'm not
sure what's different in the code above (or the environment---I
suspect that the disk bandwidth is higher here, since I'm on a
professional PC, and not a "home computer") compared to my tests
at home (under Windows); at home, there was absolutely no
difference in the times for raw and cooked. (Cooked is, of
course, XDR format, at least on a machine like the PC, which
uses IEEE floating point.)

--
James Kanze

Rune Allnor

2009-10-29 14:02:17 UTC

On 29 Okt, 11:00, James Kanze <***@gmail.com> wrote:
...

Compiled with "cl /EHs /O2 timefmt.cc". On my local disk here,
text: 90 sec.
cooked: 31 sec.
raw: 9 sec.
The last is, of course, not significant, except that it is very
small. (I can't run it on the networked disk, where any real
data would normally go, because it would use too much network
bandwidth, possibly interfering with others. Suffice it to say
that the networked disk is about 5 or more times slower, so the
relative differences would be reduced by that amount.) I'm not
sure what's different in the code above (or the environment---I
suspect that the disk bandwidth is higher here, since I'm on a
professional PC, and not a "home computer") compared to my tests
at home (under Windows); at home, there was absolutely no
difference in the times for raw and cooked. (Cooked is, of
course, XDR format, at least on a machine like the PC, which
uses IEEE floating point.)

Hmm.... so everything was done on your local disc? Which means
one would expect that disk I/O delays are proportional to file
sizes?

If so, the raw/cooked binary formats are a bit confusing.
According to this page,

http://publib.boulder.ibm.com/infocenter/systems//index.jsp?topic=/com.ibm.aix.progcomm/doc/progcomc/xdr_datatypes.htm

the XDR data type format uses "the IEEE standard" (I can find no
mention of exactly *which* IEEE standard...) to encode both single-
precision and double-precision floating point numbers.

IF "the IEEE standard" happens to mean "IEEE 754" there is a
chance that an optimizing compiler might deduce that re-coding
numbers on IEEE 754 format to another number on IEEE 754 format
essentially is a No-Op.

Even if XDR uses some other format than IEEE754, your numbers
show one significant effect:

1) Double-precision XDR is of the same size as double-precision
IEEE 754 (64 bits / number).
2) Handling XDR takes significantly longer than handling native
binary formats.

Since you run the test with the same amopunts of data on the
same local disk with the same delay factors, this factor ~4
of longer time spent on handling XDR data must be explained by
something else than mere disk IO.

The obvious suspect is the extra manipulations and recoding of
XDR data. Where native-format binary IO only needs to perform
a memcpy from the file buffer to the destination, the XDR data
first needs to be decoded to an intermediate format, and then
re-encoded to the native binary format before the result can
be piped on to the destination.

The same happens - but on a larger scale - when dealing with
text-based formats:

1) Verify that the next sequence of characters represent a
valid number format
2) Decide how many glyphs need to be considered for decoding
3) Decode text characters to digits
4) Scale according to digit placement in number
5) Repeat for exponent
6) Do the math to compute the number

True, this takes insignificant amounts of time when compared
to disk IO, but unless you use a multi-thread system where
one thread reads from disk and another thread converts the
formats while one waits for the next batch of data to arrive
from the disk, one have to do all of this sequentially in
addition to waiting for disk IO.

Nah, I still think that any additional non-trivial handling
of data will impact IO times of data. In single-thread
environments.

Rune

James Kanze

2009-10-29 16:36:43 UTC

Post by Rune Allnor
...

Post by James Kanze
Compiled with "cl /EHs /O2 timefmt.cc". On my local disk here,
text: 90 sec.
cooked: 31 sec.
raw: 9 sec.
The last is, of course, not significant, except that it is
very small. (I can't run it on the networked disk, where
any real data would normally go, because it would use too
much network bandwidth, possibly interfering with others.
Suffice it to say that the networked disk is about 5 or more
times slower, so the relative differences would be reduced
by that amount.) I'm not sure what's different in the code
above (or the environment---I suspect that the disk
bandwidth is higher here, since I'm on a professional PC,
and not a "home computer") compared to my tests at home
(under Windows); at home, there was absolutely no difference
in the times for raw and cooked. (Cooked is, of course, XDR
format, at least on a machine like the PC, which uses IEEE
floating point.)

Hmm.... so everything was done on your local disc? Which means
one would expect that disk I/O delays are proportional to file
sizes?

More or less. There are also caching effects, which I've not
tried to mask or control, which means that the results should be
taken with a grain of salt. More generally, there are a lot of
variables involved, and I've not made any attempts to control
any of them, which probably explains the differences I'm seeing
from one machine to the next.

Post by Rune Allnor
If so, the raw/cooked binary formats are a bit confusing.
According to this page,
http://publib.boulder.ibm.com/infocenter/systems//index.jsp?topic=/co...
the XDR data type format uses "the IEEE standard" (I can find
no mention of exactly *which* IEEE standard...) to encode both
single- precision and double-precision floating point numbers.
IF "the IEEE standard" happens to mean "IEEE 754" there is a
chance that an optimizing compiler might deduce that re-coding
numbers on IEEE 754 format to another number on IEEE 754
format essentially is a No-Op.

I'm not sure what you're referring to. My "cooked" format is a
simplified, non-portable implementation of XDR---non portable
because it only works on machines which have 64 long longs and
use IEEE floating point.

Post by Rune Allnor
Even if XDR uses some other format than IEEE754, your numbers
1) Double-precision XDR is of the same size as double-precision
IEEE 754 (64 bits / number).
2) Handling XDR takes significantly longer than handling native
binary formats.

Again, that depends on the machine. On my tests at home, it
didn't. I've not had the occasion to determine where the
difference lies.

Post by Rune Allnor
Since you run the test with the same amopunts of data on the
same local disk with the same delay factors,

I don't know whether the delay factor is the same. A lot
depends on how the system caches disk accesses. A more
significant test would use synchronized writing, but
synchronized at what point?

Post by Rune Allnor
this factor ~4 of longer time spent on handling XDR data must
be explained by something else than mere disk IO.

*IF* there is no optimization, *AND* disk accesses cost nothing,
then a factor of about 4 sounds about right.

Post by Rune Allnor
The obvious suspect is the extra manipulations and recoding of
XDR data. Where native-format binary IO only needs to perform
a memcpy from the file buffer to the destination, the XDR data
first needs to be decoded to an intermediate format, and then
re-encoded to the native binary format before the result can
be piped on to the destination.
The same happens - but on a larger scale - when dealing with
1) Verify that the next sequence of characters represent a
valid number format
2) Decide how many glyphs need to be considered for decoding
3) Decode text characters to digits
4) Scale according to digit placement in number
5) Repeat for exponent
6) Do the math to compute the number

That's input, not output. Input is significantly harder for
text, since it has to be able to detect errors. For XDR, the
difference between input and output probably isn't signficant,
since the only error that you can really detect is an end of
file in the middle of a value.

Post by Rune Allnor
True, this takes insignificant amounts of time when compared
to disk IO, but unless you use a multi-thread system where one
thread reads from disk and another thread converts the formats
while one waits for the next batch of data to arrive from the
disk, one have to do all of this sequentially in addition to
waiting for disk IO.
Nah, I still think that any additional non-trivial handling of
data will impact IO times of data. In single-thread
environments.

You can always use asynchronous IO:-). And what if your
implementation of filebuf uses memory mapped files?

The issues are extremely complex, and can't easily be
summarized. About the most you can say is that using text I/O
won't increase the time more than about a factor of 10, and may
increase it significantly less. (I wish I could run the tests
on the drives we usually use---I suspect that the difference
between text and binary would be close to negligible, because of
the significantly lower data transfer rates.)

--
James Kanze

Brian

2009-10-26 21:50:57 UTC

[...]

Post by Jorgen Grahn
But if you have a choice, it's IMO almost always better to
write the data as text, compressing it first using something
like gzip if I/O or disk space is an issue.

Totally agreed. Especially for the maintenance programmer,
who can see at a glance what is being written.

The user might have opinions, though.
File I/O operations with text-formatted floating-point data
take time. A *lot* of time.

A lot of time compared to what?

Wall clock time. Relative time, compared to dumping
binary data to disk. Any way you want.

The only comparison that is relevant is compared to some other
way of doing it.

My experience has always been
that the disk IO is the limiting factor

Disk IO is certainly *a* limiting factor. But not the only
one. In this case it's not even the dominant one.

Post by Rune Allnor
See the example below.

Which will only be for one compiler, on one particular CPU, with
one set of compiler options.
(Note that it's very, very difficult to measure these things
accurately, because of things like disk buffering. The order
you run the tests can make a big difference: under Windows, at
least, the first test run always runs considerably faster than
if it is run in some other position, for example.)