Discussion:
Binary file IO: Converting imported sequences of chars to desired type
(too old to reply)
Rune Allnor
2009-10-17 17:39:24 UTC
Permalink
Hi all.

I have used the method from this page,

http://www.cplusplus.com/reference/iostream/istream/read/

to read some binary data from a file to a char[] buffer.

The 4 first characters constitute the binary encoding of
a float type number. What is the better way to transfer
the chars to a float variable?

The naive C way would be to use memcopy. Is there a
better C++ way?

Rune
Maxim Yegorushkin
2009-10-17 17:47:01 UTC
Permalink
Post by Rune Allnor
Hi all.
I have used the method from this page,
http://www.cplusplus.com/reference/iostream/istream/read/
to read some binary data from a file to a char[] buffer.
The 4 first characters constitute the binary encoding of
a float type number. What is the better way to transfer
the chars to a float variable?
The naive C way would be to use memcopy. Is there a
better C++ way?
This is the correct way since memcpy() allows you to copy unaligned data
into an aligned object.

Another way is to read data directly into the aligned object:

float f;
stream.read(reinterpret_cast<char*>(&f), sizeof f);
--
Max
James Kanze
2009-10-18 09:10:15 UTC
Permalink
Post by Maxim Yegorushkin
Post by Rune Allnor
I have used the method from this page,
http://www.cplusplus.com/reference/iostream/istream/read/
to read some binary data from a file to a char[] buffer.
The 4 first characters constitute the binary encoding of
a float type number. What is the better way to transfer
the chars to a float variable?
The naive C way would be to use memcopy. Is there a
better C++ way?
This is the correct way since memcpy() allows you to copy
unaligned data into an aligned object.
float f;
stream.read(reinterpret_cast<char*>(&f), sizeof f);
Neither, of course, work, except in very limited cases.

To convert bytes written in a binary byte stream to any internal
format, you have to know the format in the file; if you also
know the internal format, and have only limited portability
concerns, you can generally do the conversion much faster; a
truely portable read requires use of ldexp, etc., but if you are
willing to limit your portability to machines using IEEE
(Windows and mainstream Unix, but not mainframes), and the file
format is IEEE, you can simply read the data as a 32 bit
unsigned int, then use reinterpret_cast (or memcpy).

FWIW: the fully portable solution is something like:

class ByteGetter
{
public:
explicit ByteGetter( ixdrstream& stream )
: mySentry( stream )
, myStream( stream )
, mySB( stream->rdbuf() )
, myIsFirst( true )
{
if ( ! mySentry ) {
mySB = NULL ;
}
}
uint8_t get()
{
int result = 0 ;
if ( mySB != NULL ) {
result = mySB->sgetc() ;
if ( result == EOF ) {
result = 0 ;
myStream.setstate( myIsFirst
? std::ios::failbit | std::ios::eofbit
: std::ios::failbit | std::ios::eofbit |
std::ios::badbit ) ;
}
}
myIsFirst = false ;
return result ;
}

private:
ixdrstream::sentry mySentry ;
ixdrstream& myStream ;
std::streambuf* mySB ;
bool myIsFirst ;
} ;

ixdrstream&
ixdrstream::operator>>(
uint32_t& dest )
{
ByteGetter source( *this ) ;
uint32_t tmp = source.get() << 24 ;
tmp |= source.get() << 16 ;
tmp |= source.get() << 8 ;
tmp |= source.get() ;
if ( *this ) {
dest = tmp ;
}
return *this ;
}

ixdrstream&
ixdrstream::operator>>(
float& dest )
{
uint32_t tmp ;
operator>>( tmp ) ;
if ( *this ) {
float f = 0.0 ;
if ( (tmp & 0x7FFFFFFF) != 0 ) {
f = ldexp( ((tmp & 0x007FFFFF) | 0x00800000),
(int)((tmp & 0x7F800000) >> 23) - 126 -
24 ) ;
}
if ( (tmp & 0x80000000) != 0 ) {
f = -f ;
}
dest = f ;
}
return *this ;
}

The above code still needs work to handle NaN's and Infinity
correctly, but it should give a good idea of what it necessary.

If you aren't concerned about machines which aren't IEEE, of
course, you can just memcpy the tmp after having read it in the
last function above, or use a reinterpret_cast to force the
types.

--
James Kanze
Maxim Yegorushkin
2009-10-18 11:13:13 UTC
Permalink
Post by James Kanze
Post by Maxim Yegorushkin
Post by Rune Allnor
I have used the method from this page,
http://www.cplusplus.com/reference/iostream/istream/read/
to read some binary data from a file to a char[] buffer.
The 4 first characters constitute the binary encoding of
a float type number. What is the better way to transfer
the chars to a float variable?
The naive C way would be to use memcopy. Is there a
better C++ way?
This is the correct way since memcpy() allows you to copy
unaligned data into an aligned object.
float f;
stream.read(reinterpret_cast<char*>(&f), sizeof f);
Neither, of course, work, except in very limited cases.
The assumption was that the float was written by the same program or a
program with a compatible binary API. Is that the case you meant in
"except in very limited cases"?
--
Max
James Kanze
2009-10-19 09:58:01 UTC
Permalink
Post by Maxim Yegorushkin
Post by James Kanze
Post by Maxim Yegorushkin
Post by Rune Allnor
I have used the method from this page,
http://www.cplusplus.com/reference/iostream/istream/read/
to read some binary data from a file to a char[] buffer.
The 4 first characters constitute the binary encoding of
a float type number. What is the better way to transfer
the chars to a float variable?
The naive C way would be to use memcopy. Is there a
better C++ way?
This is the correct way since memcpy() allows you to copy
unaligned data into an aligned object.
float f;
stream.read(reinterpret_cast<char*>(&f), sizeof f);
Neither, of course, work, except in very limited cases.
The assumption was that the float was written by the same
program or a program with a compatible binary API. Is that the
case you meant in "except in very limited cases"?
More or less. Formally, there's no guarantee that the
compatible binary API works, but in practice, it almost
certainly will.

Note, however, that most systems today support several
incompatible binary API's; which one the compiler uses depends
on the version and the options used for compiling. In practice,
it's not something you can count on except for very short lived
data: I wouldn't hesitate about using it for spilling temporary
data to disk, to be reread later by the same process. I can
imagine that it's quite acceptable as well if you have one
program collecting data during e.g. a week, and another
processing all of the data in batch over the week-end, provided
that both programs were compiled with the same compiler, using
the same options. Beyond that, I'd have my doubts (having been
bit with the problem more than once in the past). As a general
rule, it's better to define a format, and match it. (Even if I
were using a memory dump, I'd first "define" the format, just
ensuring that the definition was compatible to the in memory
image. That way, if worse comes to worse, at least a
maintenance programmer will know what to expect, and will have a
chance at making it work.)

--
James Kanze
Jorgen Grahn
2009-10-23 08:07:40 UTC
Permalink
...
Post by James Kanze
Post by Maxim Yegorushkin
The assumption was that the float was written by the same
program or a program with a compatible binary API. Is that the
case you meant in "except in very limited cases"?
More or less. Formally, there's no guarantee that the
compatible binary API works, but in practice, it almost
certainly will.
Note, however, that most systems today support several
incompatible binary API's; which one the compiler uses depends
on the version and the options used for compiling. In practice,
it's not something you can count on except for very short lived
data: I wouldn't hesitate about using it for spilling temporary
data to disk, to be reread later by the same process. I can
imagine that it's quite acceptable as well if you have one
program collecting data during e.g. a week, and another
processing all of the data in batch over the week-end, provided
that both programs were compiled with the same compiler, using
the same options. Beyond that, I'd have my doubts (having been
bit with the problem more than once in the past). As a general
rule, it's better to define a format, and match it. (Even if I
were using a memory dump, I'd first "define" the format, just
ensuring that the definition was compatible to the in memory
image. That way, if worse comes to worse, at least a
maintenance programmer will know what to expect, and will have a
chance at making it work.)
But if you have a choice, it's IMO almost always better to write the
data as text, compressing it first using something like gzip if I/O or
disk space is an issue.

(Loss of precision when printing decimal floats could be a problem in
this case though ...)

/Jorgen
--
// Jorgen Grahn <grahn@ Oo o. . .
\X/ snipabacken.se> O o .
James Kanze
2009-10-23 09:27:06 UTC
Permalink
Post by Jorgen Grahn
...
Post by James Kanze
Post by Maxim Yegorushkin
The assumption was that the float was written by the same
program or a program with a compatible binary API. Is that
the case you meant in "except in very limited cases"?
More or less. Formally, there's no guarantee that the
compatible binary API works, but in practice, it almost
certainly will.
Note, however, that most systems today support several
incompatible binary API's; which one the compiler uses
depends on the version and the options used for compiling.
In practice, it's not something you can count on except for
very short lived data: I wouldn't hesitate about using it
for spilling temporary data to disk, to be reread later by
the same process. I can imagine that it's quite acceptable
as well if you have one program collecting data during e.g.
a week, and another processing all of the data in batch over
the week-end, provided that both programs were compiled with
the same compiler, using the same options. Beyond that, I'd
have my doubts (having been bit with the problem more than
once in the past). As a general rule, it's better to define
a format, and match it. (Even if I were using a memory
dump, I'd first "define" the format, just ensuring that the
definition was compatible to the in memory image. That way,
if worse comes to worse, at least a maintenance programmer
will know what to expect, and will have a chance at making
it work.)
But if you have a choice, it's IMO almost always better to
write the data as text, compressing it first using something
like gzip if I/O or disk space is an issue.
Totally agreed. Especially for the maintenance programmer, who
can see at a glance what is being written.
Post by Jorgen Grahn
(Loss of precision when printing decimal floats could be a
problem in this case though ...)
It's a hard problem in general. If writing and reading to
internal formats with the same precision, it's sufficient to
output enough digits. If you don't know the precision of the
reader, however, you don't really know how many digits to output
when writing.

--
James Kanze
Jorgen Grahn
2009-10-25 13:25:43 UTC
Permalink
...
Post by James Kanze
Post by Jorgen Grahn
(Loss of precision when printing decimal floats could be a
problem in this case though ...)
It's a hard problem in general. If writing and reading to
internal formats with the same precision, it's sufficient to
output enough digits. If you don't know the precision of the
reader, however, you don't really know how many digits to output
when writing.
Good point; I didn't think of that aspect (i.e. not give a false
impression of precision when the input is e.g. 3.14 and you output
it as 3.14000000).

I was more thinking about reading "0.20000000000000000" but printing
0.20000000000000001. But now that I think of it, it's a loss of
precision in the input; there is no way to avoid it and still use
float/double internally.

/Jorgen
--
// Jorgen Grahn <grahn@ Oo o. . .
\X/ snipabacken.se> O o .
James Kanze
2009-10-25 17:13:55 UTC
Permalink
Post by Jorgen Grahn
...
Post by James Kanze
Post by Jorgen Grahn
(Loss of precision when printing decimal floats could be a
problem in this case though ...)
It's a hard problem in general. If writing and reading to
internal formats with the same precision, it's sufficient to
output enough digits. If you don't know the precision of
the reader, however, you don't really know how many digits
to output when writing.
Good point; I didn't think of that aspect (i.e. not give a
false impression of precision when the input is e.g. 3.14 and
you output it as 3.14000000).
I'm not sure what you're referring to here. We're talking about
the format used for transmitting data from one machine to
another. Given enough digits and the same basic format, it's
always possible to make a round trip, writing, then reading, and
getting the exact value back (even if the value output isn't the
exact value).
Post by Jorgen Grahn
I was more thinking about reading "0.20000000000000000" but
printing 0.20000000000000001.
For data communications, the problem occurs in the opposite
sense. Except that with enough digits (17 for IEEE double, I
think), it won't occur.
Post by Jorgen Grahn
But now that I think of it, it's a loss of precision in the
input; there is no way to avoid it and still use float/double
internally.
But for this application, if you know how many digits are needed
to ensure correct reading, the loss of precision when reading
will exactly offset the error when writing.

The problem only comes up when you don't know the number of
digits in the reader's format. This is particularly an issue
with double, since the second most widely used format (IBM
mainframe double) has more digits precision than IEEE double,
and 17 digits probably won't be enough; you'll get something
very close, but it might not be the closest possible
representation. Which in this case would be exactly the
starting value---I think that IBM mainframe double precision can
represent all IEEE double values in range exactly. (Warning:
this is all very much off the top of my head. I've not done any
real analysis to verify the actual case of IBM floating point
versus IEEE. The problem can definitely occur, however, and it
wouldn't be difficult to imagine a 128 bit double format where
it did.)

--
James Kanze
Jorgen Grahn
2009-10-26 16:37:41 UTC
Permalink
Post by James Kanze
Post by Jorgen Grahn
...
Post by James Kanze
Post by Jorgen Grahn
(Loss of precision when printing decimal floats could be a
problem in this case though ...)
It's a hard problem in general. If writing and reading to
internal formats with the same precision, it's sufficient to
output enough digits. If you don't know the precision of
the reader, however, you don't really know how many digits
to output when writing.
Good point; I didn't think of that aspect (i.e. not give a
false impression of precision when the input is e.g. 3.14 and
you output it as 3.14000000).
I'm not sure what you're referring to here. We're talking about
the format used for transmitting data from one machine to
another. [...]
I guess I am demonstrating why I try to stay away from
floating-point ;-) It is a tricky area.

/Jorgen
--
// Jorgen Grahn <grahn@ Oo o. . .
\X/ snipabacken.se> O o .
Rune Allnor
2009-10-25 14:13:49 UTC
Permalink
Post by Jorgen Grahn
...
Post by Maxim Yegorushkin
The assumption was that the float was written by the same
program or a program with a compatible binary API. Is that
the case you meant in "except in very limited cases"?
More or less.  Formally, there's no guarantee that the
compatible binary API works, but in practice, it almost
certainly will.
Note, however, that most systems today support several
incompatible binary API's; which one the compiler uses
depends on the version and the options used for compiling.
In practice, it's not something you can count on except for
very short lived data: I wouldn't hesitate about using it
for spilling temporary data to disk, to be reread later by
the same process.  I can imagine that it's quite acceptable
as well if you have one program collecting data during e.g.
a week, and another processing all of the data in batch over
the week-end, provided that both programs were compiled with
the same compiler, using the same options.  Beyond that, I'd
have my doubts (having been bit with the problem more than
once in the past).  As a general rule, it's better to define
a format, and match it.  (Even if I were using a memory
dump, I'd first "define" the format, just ensuring that the
definition was compatible to the in memory image.  That way,
if worse comes to worse, at least a maintenance programmer
will know what to expect, and will have a chance at making
it work.)
But if you have a choice, it's IMO almost always better to
write the data as text, compressing it first using something
like gzip if I/O or disk space is an issue.
Totally agreed.  Especially for the maintenance programmer, who
can see at a glance what is being written.
The user might have opinions, though.

File I/O operations with text-formatted floating-point data
take time. A *lot* of time. The rule-of-thumb is 30-60 seconds
per 100 MBytes of text-formatted FP numeric data, compared to
fractions of a second for the same data (natively) binary encoded
(just try it).

In heavy-duty data processing applications one just can not afford
to spend more time than absolutely necessary. Text-formatted data
is not an option.

If there are problems with binary floating point I/O formats, then
that's a question for the C++ standards committee. It ought to be
a simple technical (as opposed to political) matter to specify that
binary FP I/O could be set to comply to some already defined
standard,
like e.g. IEEE 754.

The matter isn't fundamentally different from setting locales and
character encodings with text files.

Rune
James Kanze
2009-10-25 17:47:28 UTC
Permalink
[...]
Post by Rune Allnor
Post by James Kanze
Post by Jorgen Grahn
But if you have a choice, it's IMO almost always better to
write the data as text, compressing it first using something
like gzip if I/O or disk space is an issue.
Totally agreed. Especially for the maintenance programmer,
who can see at a glance what is being written.
The user might have opinions, though.
File I/O operations with text-formatted floating-point data
take time. A *lot* of time.
A lot of time compared to what? My experience has always been
that the disk IO is the limiting factor (but my data sets have
generally been very mixed, with a lot of non floating point data
as well). And binary formatting can be more or less expensive
as well---I'd rather deal with text than a BER encoded double.
And Jorgen said very explicitly "if you have a choice".
Sometimes you don't have the choice: you have to conform to an
already defined external format, or the profiler says you don't
have the choice.
Post by Rune Allnor
The rule-of-thumb is 30-60 seconds per 100 MBytes of
text-formatted FP numeric data, compared to fractions of a
second for the same data (natively) binary encoded (just try
it).
Try it on what machine:-). Obviously, the formatting/parsing
speed will depend on the CPU speed, which varies enormously. By
a factor of much more than 2 (which is what you've mentionned).

Again, I've no recent measurements, so I can't be sure, but I
suspect that the real difference in speed will come from the
fact that you're writing more bytes with a text format, and on a
slow medium, that can make a real difference. (In one
application, where we had to transmit tens of kilobytes over a
50 Baud link---and there's no typo there, it was 50 bits, or
about 6 bytes, per second---we didn't even consider using text.
Even though there wasn't any floating point involved.)
Post by Rune Allnor
In heavy-duty data processing applications one just can not
afford to spend more time than absolutely necessary.
Text-formatted data is not an option.
I'm working in such an application at the moment, and our
external format(s) are all text. And the conversions of the
individual values has never been a problem. (One of the formats
is XML. And our disks and network are fast enough that even
that hasn't been a problem.)
Post by Rune Allnor
If there are problems with binary floating point I/O formats,
then that's a question for the C++ standards committee. It
ought to be a simple technical (as opposed to political)
matter to specify that binary FP I/O could be set to comply to
some already defined standard, like e.g. IEEE 754.
So that the language couldn't be used on some important
platforms? (Most mainframes still do not use IEEE. Most don't
even use binary: IBM's are base 16, and Unisys's base 8.) And
of course, not all IEEE is "binary compatible" either: a file
dumped from the Sparcs I've done most of my work on won't be
readable on the PC's I currently work on.

--
James Kanze
Rune Allnor
2009-10-25 18:39:45 UTC
Permalink
    [...]
Post by Rune Allnor
Post by Jorgen Grahn
But if you have a choice, it's IMO almost always better to
write the data as text, compressing it first using something
like gzip if I/O or disk space is an issue.
Totally agreed.  Especially for the maintenance programmer,
who can see at a glance what is being written.
The user might have opinions, though.
File I/O operations with text-formatted floating-point data
take time. A *lot* of time.
A lot of time compared to what?
Wall clock time. Relative time, compared to dumping
binary data to disk. Any way you want.
 My experience has always been
that the disk IO is the limiting factor
Disk IO is certainly *a* limiting factor. But not the
only one. In this case it's not even the dominant one.
See the example below.
(but my data sets have
generally been very mixed, with a lot of non floating point data
as well).  And binary formatting can be more or less expensive
as well---I'd rather deal with text than a BER encoded double.
And Jorgen said very explicitly "if you have a choice".
Sometimes you don't have the choice: you have to conform to an
already defined external format, or the profiler says you don't
have the choice.
Post by Rune Allnor
The rule-of-thumb is 30-60 seconds per 100 MBytes of
text-formatted FP numeric data, compared to fractions of a
second for the same data (natively) binary encoded (just try
it).
Try it on what machine:-).
Any machine. The problem is to decode text-formatted numbers
to binary.
 Obviously, the formatting/parsing
speed will depend on the CPU speed, which varies enormously.  By
a factor of much more than 2 (which is what you've mentionned).
Again, I've no recent measurements, so I can't be sure, but I
suspect that the real difference in speed will come from the
fact that you're writing more bytes with a text format,
This is a factor. Binary files are usually about 20%-70% of the
size of the text file, depending on numbers of significant digits
and other formatting text glyphs. File sizes don't account for the
time 50-100x difference.

Here is a test I wrote in matlab a few years ago, to demonstrate
the problem (WinXP, 2.4GHz, no idea about disk):

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
N = 10000000;
d1=randn(N,1);
t1=cputime;
save test.txt d1 -ascii
t2=cputime-t1;
disp(['Wrote ASCII data in ',num2str(t2),' seconds'])

t3=cputime;
d2=load('test.txt','-ascii');
t4=cputime-t3;
disp(['Read ASCII data in ',num2str(t4),' seconds'])

t5=cputime;
fid=fopen('test.raw','w');
fwrite(fid,d1,'double');
fclose(fid);
t6=cputime-t5;
disp(['Wrote binary data in ',num2str(t6),' seconds'])

t7=cputime;
fid=fopen('test.raw','r');
d3=fread(fid,'double');
fclose(fid);
t8=cputime-t7;
disp(['Read binary data in ',num2str(t8),' seconds'])
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

Output:
------------------------------------
Wrote ASCII data in 24.0469 seconds
Read ASCII data in 42.2031 seconds
Wrote binary data in 0.10938 seconds
Read binary data in 0.32813 seconds
------------------------------------

Binary writes are 24.0/0.1 = 240x faster than text write.
Binary reads are 42.2/0.32 = 130x faster than text read.

The script first generates ten million random numbers,
and writes them to file on both ASCII and binary double
precision floating point formats. The files are then read
straight back in, hopefully eliminating effects of file
caches etc. The ASCII file in this test is 175 MBytes, while
the binary file is about 78 MBytes. The first few lines
in the text file look like

-4.3256481e-001
-1.6655844e+000
1.2533231e-001
2.8767642e-001

(one leading whitespace, one negative sign or whitespace,
no trailing spaces) which is not excessive, neither with
respect to the number of significant digits, or the number
of other characters.

The timing numbers (both absolute and relative) would be of
similar orders of magnitude if you repeated the test with C++.
and on a
slow medium, that can make a real difference.  (In one
application, where we had to transmit tens of kilobytes over a
50 Baud link---and there's no typo there, it was 50 bits, or
about 6 bytes, per second---we didn't even consider using text.
Even though there wasn't any floating point involved.)
Post by Rune Allnor
In heavy-duty data processing applications one just can not
afford to spend more time than absolutely necessary.
Text-formatted data is not an option.
I'm working in such an application at the moment, and our
external format(s) are all text.  And the conversions of the
individual values has never been a problem.  (One of the formats
is XML.  And our disks and network are fast enough that even
that hasn't been a problem.)
The application I'm working with would need to crunch through
some 10 GBytes of numerical data per hour. Just reading that
amount of data from a text format would require on the order of

1e10/1.75e8*42s = 2400s = 40 minutes.

There is no point in even considering using a text format
for these kinds of things.
Post by Rune Allnor
If there are problems with binary floating point I/O formats,
then that's a question for the C++ standards committee. It
ought to be a simple technical (as opposed to political)
matter to specify that binary FP I/O could be set to comply to
some already defined standard, like e.g. IEEE 754.
So that the language couldn't be used on some important
platforms?  (Most mainframes still do not use IEEE.  Most don't
even use binary: IBM's are base 16, and Unisys's base 8.)  And
of course, not all IEEE is "binary compatible" either: a file
dumped from the Sparcs I've done most of my work on won't be
readable on the PC's I currently work on.
I can't see how the problem is different from text encoding.
The 7-bit ANSI character set is the baseline. A number of
8-bit ASCII encodings are in use, and who knows how many 16-bit
encodings. No one says which one should be used. Only which
ones should be available.

Rune
James Kanze
2009-10-26 17:06:56 UTC
Permalink
Post by Rune Allnor
Post by James Kanze
[...]
Post by Rune Allnor
Post by James Kanze
Post by Jorgen Grahn
But if you have a choice, it's IMO almost always better to
write the data as text, compressing it first using something
like gzip if I/O or disk space is an issue.
Totally agreed. Especially for the maintenance programmer,
who can see at a glance what is being written.
The user might have opinions, though.
File I/O operations with text-formatted floating-point data
take time. A *lot* of time.
A lot of time compared to what?
Wall clock time. Relative time, compared to dumping
binary data to disk. Any way you want.
The only comparison that is relevant is compared to some other
way of doing it.
Post by Rune Allnor
Post by James Kanze
My experience has always been
that the disk IO is the limiting factor
Disk IO is certainly *a* limiting factor. But not the only
one. In this case it's not even the dominant one.
And that obviously depends on the CPU speed and the disk speed.
Text formatting does take some additional CPU time; if the disk
is slow and the CPU fast, this will be less important than if
the disk is fast and the CPU slow.
Post by Rune Allnor
See the example below.
Which will only be for one compiler, on one particular CPU, with
one set of compiler options.

(Note that it's very, very difficult to measure these things
accurately, because of things like disk buffering. The order
you run the tests can make a big difference: under Windows, at
least, the first test run always runs considerably faster than
if it is run in some other position, for example.)
Post by Rune Allnor
Post by James Kanze
(but my data sets have generally been very mixed, with a lot
of non floating point data as well). And binary formatting
can be more or less expensive as well---I'd rather deal with
text than a BER encoded double. And Jorgen said very
explicitly "if you have a choice". Sometimes you don't have
the choice: you have to conform to an already defined
external format, or the profiler says you don't have the
choice.
Post by Rune Allnor
The rule-of-thumb is 30-60 seconds per 100 MBytes of
text-formatted FP numeric data, compared to fractions of a
second for the same data (natively) binary encoded (just
try it).
Try it on what machine:-).
Any machine. The problem is to decode text-formatted numbers
to binary.
You're giving concrete figures. "Any machine" doesn't make
sense in such cases: I've seen factors of more than 10 in terms
of disk speed between different hard drives (and if the drive is
remote mounted, over a slow network, the difference can be even
more), and in my time, I've seen at least six or seven orders of
magnitude in speed between CPU's. (I've worked on 8 bit machines
which took on an average 10 ųs per machine instruction, with no
hardware multiply and divide, much less floating point
instructions.)

The compiler and the library implementation also make a
significant difference. I knocked up a quick test (which isn't
very accurate, because it makes no attempt to take into account
disk caching and such), and tried it on the two machines I have
handy: a very old (2002) laptop under Windows, using VC++, and a
very recent, high performance desktop under Linux, using g++.
Under Windows, the difference between text and binary was a
factor of about 3; under Linux, about 15. Apparently, the
conversion routines in the Microsoft compiler are a lot, lot
better than those in g++. The difference would be larger if I
had a higher speed disk or data bus; it would be significantly
smaller (close to zero, probably) if I synchronized each write.
(A synchronized disk write is about 10 ms, at least on a top of
the line Sun Sparc.)

In terms of concrete numbers, of course... Using time gave me
values too small to be significant for 10000000 doubles on the
Linux machine (top of the line AMD processor of less than a year
ago); for 100000000 doubles, it was around 85 seconds for text
(written in scientific format, with 17 digits precision, each
value followed by a new line, total file size 2.4 GB). For
10000000, it was around 45 seconds under Windows (file size 250
MB).

It's interesting to note that the Windows version is clearly IO
dominated. The difference in speed between text and binary is
pretty much the same as the difference in file size.
Post by Rune Allnor
Post by James Kanze
Obviously, the formatting/parsing
speed will depend on the CPU speed, which varies enormously. By
a factor of much more than 2 (which is what you've mentionned).
Again, I've no recent measurements, so I can't be sure, but I
suspect that the real difference in speed will come from the
fact that you're writing more bytes with a text format,
This is a factor. Binary files are usually about 20%-70% of the
size of the text file, depending on numbers of significant digits
and other formatting text glyphs. File sizes don't account for the
time 50-100x difference.
There is no 50-100x difference. There's at most a difference of
15x, on the machines I've tested; the difference would probably
be less if I somehow inhibited the effects of disk caching
(because the disk access times would increase); I won't bother
trying it with synchronized writes, however, because that would
go to the opposite extreme, and you'd probably never use
synchronized writes for each double: when they're needed, it's
for each record.
Post by Rune Allnor
Here is a test I wrote in matlab a few years ago, to
I'm afraid it doesn't demonstrate anything to me, because I have
no idea how Matlib works. It might be using unbuffered output
for text, or synchronizing at each double. And in what format?
Post by Rune Allnor
The script first generates ten million random numbers,
and writes them to file on both ASCII and binary double
precision floating point formats. The files are then read
straight back in, hopefully eliminating effects of file
caches etc.
Actually, reading immediately after writing maximizes the
effects of file caches. And on a modern machine, with say 4GB
main memory, a small file like this will be fully cached.
Post by Rune Allnor
The ASCII file in this test is 175 MBytes, while
the binary file is about 78 MBytes.
If you're dumping raw data, a binary file with 10000000 doubles,
on a PC, should be exactly 80 MB.
Post by Rune Allnor
The first few lines in the text file look like
-4.3256481e-001
-1.6655844e+000
1.2533231e-001
2.8767642e-001
(one leading whitespace, one negative sign or whitespace, no
trailing spaces) which is not excessive, neither with respect
to the number of significant digits, or the number of other
characters.
It's not sufficient with regards to the number of digits. You
won't read back in what you've written.
Post by Rune Allnor
The timing numbers (both absolute and relative) would be of
similar orders of magnitude if you repeated the test with C++.
I did, and they aren't. They're actually very different in two
separate C++ environments.
Post by Rune Allnor
The application I'm working with would need to crunch through
some 10 GBytes of numerical data per hour. Just reading that
amount of data from a text format would require on the order
of
1e10/1.75e8*42s = 2400s = 40 minutes.
There is no point in even considering using a text format for
these kinds of things.
But it must not be doing much processing on the data, just
copying it and maybe a little scaling. My applications do
significant calculations (which I'll admit I don't understand,
but they do take a lot of CPU time). The time spent writing the
results, even in XML, is only a small part of the total runtime.
Post by Rune Allnor
Post by James Kanze
Post by Rune Allnor
If there are problems with binary floating point I/O formats,
then that's a question for the C++ standards committee. It
ought to be a simple technical (as opposed to political)
matter to specify that binary FP I/O could be set to comply to
some already defined standard, like e.g. IEEE 754.
So that the language couldn't be used on some important
platforms? (Most mainframes still do not use IEEE. Most don't
even use binary: IBM's are base 16, and Unisys's base 8.) And
of course, not all IEEE is "binary compatible" either: a file
dumped from the Sparcs I've done most of my work on won't be
readable on the PC's I currently work on.
I can't see how the problem is different from text encoding.
The 7-bit ANSI character set is the baseline. A number of
8-bit ASCII encodings are in use, and who knows how many
16-bit encodings. No one says which one should be used. Only
which ones should be available.
The current standard doesn't even say that. It only gives a
minimum list of characters which must be supported. But I'm not
sure what your argument is: you're saying that we should
standardize some binary format more than the text format?

(The big difference is, of course, is that while the standard
doesn't specify any encoding, there are a number of different
encodings which are supported on a lot of different machines.
Where as a raw dump of double doesn't work even between a PC and
a Sparc. Or between an older Mac, with a Power PC, and a newer
one, with an Intel chip. Upgrade your machine, and you loose
your data.)

--
James Kanze
Rune Allnor
2009-10-26 17:55:14 UTC
Permalink
Post by James Kanze
Post by Rune Allnor
Post by James Kanze
Post by Rune Allnor
File I/O operations with text-formatted floating-point data
take time. A *lot* of time.
A lot of time compared to what?
Wall clock time. Relative time, compared to dumping
binary data to disk. Any way you want.
The only comparison that is relevant is compared to some other
way of doing it.
OK. Text-based IO compard to binary IO.
Post by James Kanze
Post by Rune Allnor
Post by James Kanze
(but my data sets have generally been very mixed, with a lot
of non floating point data as well).  And binary formatting
can be more or less expensive as well---I'd rather deal with
text than a BER encoded double.  And Jorgen said very
explicitly "if you have a choice".  Sometimes you don't have
the choice: you have to conform to an already defined
external format, or the profiler says you don't have the
choice.
Post by Rune Allnor
The rule-of-thumb is 30-60 seconds per 100 MBytes of
text-formatted FP numeric data, compared to fractions of a
second for the same data (natively) binary encoded (just
try it).
Try it on what machine:-).
Any machine. The problem is to decode text-formatted numbers
to binary.
You're giving concrete figures.
Yep. But as rule-of-thumb. My point is not to be accurate
(you have made a very convincing case why that would be
difficult), but to point out what performance costs and
trade-offs are involved when using text-based file fomats.
Post by James Kanze
In terms of concrete numbers, of course... Using time gave me
values too small to be significant for 10000000 doubles on the
Linux machine (top of the line AMD processor of less than a year
ago); for 100000000 doubles, it was around 85 seconds for text
(written in scientific format, with 17 digits precision, each
value followed by a new line, total file size 2.4 GB).  For
10000000, it was around 45 seconds under Windows (file size 250
MB).
I suspect you might either have access to a bit more funky
hardware than most users, or have the skills to fine tune
what you have better than most users. Or both.
Post by James Kanze
Post by Rune Allnor
Post by James Kanze
Obviously, the formatting/parsing
speed will depend on the CPU speed, which varies enormously.  By
a factor of much more than 2 (which is what you've mentionned).
Again, I've no recent measurements, so I can't be sure, but I
suspect that the real difference in speed will come from the
fact that you're writing more bytes with a text format,
This is a factor. Binary files are usually about 20%-70% of the
size of the text file, depending on numbers of significant digits
and other formatting text glyphs. File sizes don't account for the
time 50-100x difference.
There is no 50-100x difference.  There's at most a difference of
15x, on the machines I've tested; the difference would probably
be less if I somehow inhibited the effects of disk caching
(because the disk access times would increase);
Again, your assets ight not be representative for the
average users.
Post by James Kanze
Post by Rune Allnor
Here is a test I wrote in matlab a few years ago, to
I'm afraid it doesn't demonstrate anything to me, because I have
no idea how Matlib works.  It might be using unbuffered output
for text, or synchronizing at each double.  And in what format?
Post by Rune Allnor
The script first generates ten million random numbers,
and writes them to file on both ASCII and binary double
precision floating point formats. The files are then read
straight back in, hopefully eliminating effects of file
caches etc.
Actually, reading immediately after writing maximizes the
effects of file caches.  And on a modern machine, with say 4GB
main memory, a small file like this will be fully cached.
I'll rephrase: Eliminates *variability* due to file caches.
Whatever happens affect both files in equal amounts. It would
bias results if one file was cached and the other not.
Post by James Kanze
Post by Rune Allnor
The ASCII file in this test is 175 MBytes, while
the binary file is about 78 MBytes.
If you're dumping raw data, a binary file with 10000000 doubles,
on a PC, should be exactly 80 MB.
It was. The file browser I used reported the file size
in KBytes. Multiply the number by 1024 and you get
exactly 80 Mbytes.
Post by James Kanze
Post by Rune Allnor
The first few lines in the text file look like
 -4.3256481e-001
 -1.6655844e+000
  1.2533231e-001
  2.8767642e-001
(one leading whitespace, one negative sign or whitespace, no
trailing spaces) which is not excessive, neither with respect
to the number of significant digits, or the number of other
characters.
It's not sufficient with regards to the number of digits.  You
won't read back in what you've written.
I know. If that was a constraint, file sizes and read/write
times would increase correspondingly.
Post by James Kanze
Post by Rune Allnor
The timing numbers (both absolute and relative) would be of
similar orders of magnitude if you repeated the test with C++.
I did, and they aren't.  They're actually very different in two
separate C++ environments.
Post by Rune Allnor
The application I'm working with would need to crunch through
some 10 GBytes of numerical data per hour. Just reading that
amount of data from a text format would require on the order
of
1e10/1.75e8*42s = 2400s = 40 minutes.
There is no point in even considering using a text format for
these kinds of things.
But it must not be doing much processing on the data, just
copying it and maybe a little scaling.  My applications do
significant calculations (which I'll admit I don't understand,
but they do take a lot of CPU time).  The time spent writing the
results, even in XML, is only a small part of the total runtime.
The read? Th eapplication I am talking about would require
a fair bit of number crunching. If I could process 1 hrs worth
of measurements in 20 minutes, I'd rather cash out the remaining
40 minutes in early results, rather than spend them waiting
for disk IO to complete.
Post by James Kanze
Post by Rune Allnor
Post by James Kanze
Post by Rune Allnor
If there are problems with binary floating point I/O formats,
then that's a question for the C++ standards committee. It
ought to be a simple technical (as opposed to political)
matter to specify that binary FP I/O could be set to comply to
some already defined standard, like e.g. IEEE 754.
So that the language couldn't be used on some important
platforms?  (Most mainframes still do not use IEEE.  Most don't
even use binary: IBM's are base 16, and Unisys's base 8.)  And
of course, not all IEEE is "binary compatible" either: a file
dumped from the Sparcs I've done most of my work on won't be
readable on the PC's I currently work on.
I can't see how the problem is different from text encoding.
The 7-bit ANSI character set is the baseline. A number of
8-bit ASCII encodings are in use, and who knows how many
16-bit encodings. No one says which one should be used. Only
which ones should be available.
The current standard doesn't even say that.  It only gives a
minimum list of characters which must be supported.  But I'm not
sure what your argument is: you're saying that we should
standardize some binary format more than the text format?
Yep. Some formats. like IEEE 754 (and maybe descendants)
are fairly universal. No matter what the native formats
look like, it ought to suffice to call a standard method
to dump binary data on the format.
Post by James Kanze
(The big difference is, of course, is that while the standard
doesn't specify any encoding, there are a number of different
encodings which are supported on a lot of different machines.
Where as a raw dump of double doesn't work even between a PC and
a Sparc.  Or between an older Mac, with a Power PC, and a newer
one, with an Intel chip.  Upgrade your machine, and you loose
your data.)
Exactly. Which is why there ought to be a standardized
binary floating point format that is portable between
platforms.

Rune
James Kanze
2009-10-28 12:40:12 UTC
Permalink
Post by James Kanze
Post by Rune Allnor
Post by James Kanze
(but my data sets have generally been very mixed, with a lot
of non floating point data as well). And binary formatting
can be more or less expensive as well---I'd rather deal with
text than a BER encoded double. And Jorgen said very
explicitly "if you have a choice". Sometimes you don't have
the choice: you have to conform to an already defined
external format, or the profiler says you don't have the
choice.
Post by Rune Allnor
The rule-of-thumb is 30-60 seconds per 100 MBytes of
text-formatted FP numeric data, compared to fractions of a
second for the same data (natively) binary encoded (just
try it).
Try it on what machine:-).
Any machine. The problem is to decode text-formatted numbers
to binary.
You're giving concrete figures.
Yep. But as rule-of-thumb. My point is not to be accurate (you
have made a very convincing case why that would be difficult),
but to point out what performance costs and trade-offs are
involved when using text-based file fomats.
The problem is that there is no real rule-of-thumb possible.
Machines (and compilers) differ too much today.
Post by James Kanze
In terms of concrete numbers, of course... Using time gave
me values too small to be significant for 10000000 doubles
on the Linux machine (top of the line AMD processor of less
than a year ago); for 100000000 doubles, it was around 85
seconds for text (written in scientific format, with 17
digits precision, each value followed by a new line, total
file size 2.4 GB). For 10000000, it was around 45 seconds
under Windows (file size 250 MB).
I suspect you might either have access to a bit more funky
hardware than most users, or have the skills to fine tune what
you have better than most users. Or both.
The code was written very quickly, with no tricks or anything.
It was tested on off the shelf PC's---one admittedly older than
those most people are using, the other fairly recent. The
compilers in question were the version of g++ installed with
Suse Linux, and the free download version of VC++. I don't
think that there's anything in there that can be considered
"funky" (except maybe that most people professionally concerned
with high input have professional class machines to do it, which
are out of my price range), and I certainly didn't tune
anything.
Post by James Kanze
Post by Rune Allnor
Post by James Kanze
Obviously, the formatting/parsing
speed will depend on the CPU speed, which varies enormously. By
a factor of much more than 2 (which is what you've mentionned).
Again, I've no recent measurements, so I can't be sure, but I
suspect that the real difference in speed will come from the
fact that you're writing more bytes with a text format,
This is a factor. Binary files are usually about 20%-70% of the
size of the text file, depending on numbers of significant digits
and other formatting text glyphs. File sizes don't account for the
time 50-100x difference.
There is no 50-100x difference. There's at most a difference of
15x, on the machines I've tested; the difference would probably
be less if I somehow inhibited the effects of disk caching
(because the disk access times would increase);
Again, your assets might not be representative for the
average users.
Well, I'm not sure there's such a thing as an average user. But
my machines are very off the shelf, and I'd consider VC++ and
g++ very "average" as well, in the sense that they're what an
average user is most likely to see.
Post by James Kanze
Post by Rune Allnor
Here is a test I wrote in matlab a few years ago, to
I'm afraid it doesn't demonstrate anything to me, because I have
no idea how Matlib works. It might be using unbuffered output
for text, or synchronizing at each double. And in what format?
Post by Rune Allnor
The script first generates ten million random numbers,
and writes them to file on both ASCII and binary double
precision floating point formats. The files are then read
straight back in, hopefully eliminating effects of file
caches etc.
Actually, reading immediately after writing maximizes the
effects of file caches. And on a modern machine, with say 4GB
main memory, a small file like this will be fully cached.
I'll rephrase: Eliminates *variability* due to file caches.
By choosing the best case, which rarely exists in practice.
Whatever happens affect both files in equal amounts. It would
bias results if one file was cached and the other not.
What is cached depends on what the OS can fit in memory. In
other words, the first file you wrote was far more likely to be
cached than the second.
Post by James Kanze
Post by Rune Allnor
The ASCII file in this test is 175 MBytes, while
the binary file is about 78 MBytes.
If you're dumping raw data, a binary file with 10000000
doubles, on a PC, should be exactly 80 MB.
It was. The file browser I used reported the file size
in KBytes. Multiply the number by 1024 and you get
exactly 80 Mbytes.
Strictly speaking, a KB is exactly 1000 bytes, not 1024:-). But
I know, different programs treat this differently.
Post by James Kanze
Post by Rune Allnor
The first few lines in the text file look like
-4.3256481e-001
-1.6655844e+000
1.2533231e-001
2.8767642e-001
(one leading whitespace, one negative sign or whitespace, no
trailing spaces) which is not excessive, neither with respect
to the number of significant digits, or the number of other
characters.
It's not sufficient with regards to the number of digits.
You won't read back in what you've written.
I know. If that was a constraint, file sizes and read/write
times would increase correspondingly.
It was a constraint. Explicitly. At least in this thread, but
more generally: about the only time it won't be a constraint is
when the files are for human consumption, in which case, I think
you'd agree, binary isn't acceptable.
Post by James Kanze
Post by Rune Allnor
The timing numbers (both absolute and relative) would be
of similar orders of magnitude if you repeated the test
with C++.
I did, and they aren't. They're actually very different in
two separate C++ environments.
Post by Rune Allnor
The application I'm working with would need to crunch
through some 10 GBytes of numerical data per hour. Just
reading that amount of data from a text format would
require on the order of
1e10/1.75e8*42s = 2400s = 40 minutes.
There is no point in even considering using a text format
for these kinds of things.
But it must not be doing much processing on the data, just
copying it and maybe a little scaling. My applications do
significant calculations (which I'll admit I don't
understand, but they do take a lot of CPU time). The time
spent writing the results, even in XML, is only a small part
of the total runtime.
The read?
I don't know. It's by some other applications, in other
departments, and I have no idea what they do with the data.

You're probably right, however, that to be accurate, I should do
some comparisons including reading. For various reasons (having
to deal with possible errors, etc.), the CPU overhead when
reading is typically higher than when writing.

But I'm really only disputing your order of magnitude
differences, because they don't correspond with my experience
(nor my measurements). There's definitely more overhead with
text format. The only question is whether that overhead is more
expensive than the cost of the alternatives, and a there depends
on what you're doing. Obviously, if you can't afford the
overhead (and I've worked on applications which couldn't), then
you use binary, but my experience is that a lot of people jump
to binary far too soon, because the overhead isn't that critical
that often.
Post by James Kanze
Post by Rune Allnor
Post by James Kanze
Post by Rune Allnor
If there are problems with binary floating point I/O formats,
then that's a question for the C++ standards committee. It
ought to be a simple technical (as opposed to political)
matter to specify that binary FP I/O could be set to comply to
some already defined standard, like e.g. IEEE 754.
So that the language couldn't be used on some important
platforms? (Most mainframes still do not use IEEE. Most don't
even use binary: IBM's are base 16, and Unisys's base 8.) And
of course, not all IEEE is "binary compatible" either: a file
dumped from the Sparcs I've done most of my work on won't be
readable on the PC's I currently work on.
I can't see how the problem is different from text encoding.
The 7-bit ANSI character set is the baseline. A number of
8-bit ASCII encodings are in use, and who knows how many
16-bit encodings. No one says which one should be used. Only
which ones should be available.
The current standard doesn't even say that. It only gives a
minimum list of characters which must be supported. But I'm
not sure what your argument is: you're saying that we should
standardize some binary format more than the text format?
Yep. Some formats. like IEEE 754 (and maybe descendants)
are fairly universal. No matter what the native formats
look like, it ought to suffice to call a standard method
to dump binary data on the format.
To date, neither C nor C++ have made the slightest gest in the
direction of standardizing any binary formats. There are other
(conflicting) standards which do: XDR, for example, or BER. I
personally think that adding a second set of streams, supporting
XDR, to the standard, would be a good thing, but I've never had
the time to actually write up such a proposal. And a general
binary format is quite complex to specify; it's one thing to say
you want to output a table of double, but to be standardized,
you also have to define what is output when a large mix of types
are streamed, and how much information is necessary about the
initial data in order to read them.
Post by James Kanze
(The big difference is, of course, is that while the
standard doesn't specify any encoding, there are a number of
different encodings which are supported on a lot of
different machines. Where as a raw dump of double doesn't
work even between a PC and a Sparc. Or between an older
Mac, with a Power PC, and a newer one, with an Intel chip.
Upgrade your machine, and you loose your data.)
Exactly. Which is why there ought to be a standardized binary
floating point format that is portable between platforms.
There are several: I've used both XDR and BER in applications in
the past. One of the reasons C++ doesn't address this issue is
that there are several, and C++ doesn't want to choose one over
the others.

--
James Kanze
Rune Allnor
2009-10-28 14:55:48 UTC
Permalink
Post by James Kanze
The code was written very quickly, with no tricks or anything.
Just out of curiosity - would it be possible to see your code?
As far as I can tell, you haven't posted it (If you have, I have
missed it).

Rune
James Kanze
2009-10-29 10:00:05 UTC
Permalink
Post by Rune Allnor
Post by James Kanze
The code was written very quickly, with no tricks or anything.
Just out of curiosity - would it be possible to see your code?
As far as I can tell, you haven't posted it (If you have, I
have missed it).
I haven't posted it because it's on my machine at home (in
France), and I'm currently working in London, and don't have
immediate access to it. Redoing it here (from memory):

#include <fstream>
#include <iostream>
#include <string>
#include <vector>
#include <stddef.h>
#include <stdlib.h>
#include <time.h>

class FileOutput
{
protected:
std::string my_type;
std::ofstream my_file;
time_t my_start;
time_t my_end;
public:
FileOutput( std::string const& type, bool is_binary = true )
: my_type( type )
, my_file( ("test_" + type + ".dat").c_str(),
is_binary ? std::ios::out | std::ios::binary
: std::ios::out )
{
my_start = time( NULL );
}
~FileOutput()
{
my_end = time( NULL ) ;
my_file.close();
std::cout << my_type << ": "
<< (my_end - my_start) << " sec." << std::endl;
}

virtual void output( double d ) = 0;
};

class RawOutput : public FileOutput
{
public:
RawOutput() : FileOutput( "raw" ) {}
virtual void output( double d )
{
my_file.write( reinterpret_cast< char* >(&d), sizeof(d) );
}
};

class CookedOutput : public FileOutput
{
public:
CookedOutput() : FileOutput( "cooked" ) {}
virtual void output( double d )
{
unsigned long long const& tmp
= reinterpret_cast< unsigned long long const& >(d);
int shift = 64 ;
while ( shift > 0 ) {
shift -= 8 ;
my_file.put( (tmp >> shift) & 0xFF );
}
}
};

class TextOutput : public FileOutput
{
public:
TextOutput() : FileOutput( "text", false )
{
my_file.setf( std::ios::scientific,
std::ios::floatfield );
my_file.precision( 17 );
}
virtual void output( double d )
{
my_file << d << '\n';
}
};

template< typename File >
void
test( std::vector< double > const& values )
{
File dest;
for ( std::vector< double >::const_iterator iter = values.begin
();
iter != values.end();
++ iter ) {
dest.output( *iter );
}
}

int
main()
{
size_t const size = 10000000;
std::vector< double >
v;
while ( v.size() != size ) {
v.push_back( (double)( rand() ) / (double)( RAND_MAX ) );
}
test< TextOutput >( v );
test< CookedOutput >( v );
test< RawOutput >( v );
return 0;
}

Compiled with "cl /EHs /O2 timefmt.cc". On my local disk here,
I get:
text: 90 sec.
cooked: 31 sec.
raw: 9 sec.
The last is, of course, not significant, except that it is very
small. (I can't run it on the networked disk, where any real
data would normally go, because it would use too much network
bandwidth, possibly interfering with others. Suffice it to say
that the networked disk is about 5 or more times slower, so the
relative differences would be reduced by that amount.) I'm not
sure what's different in the code above (or the environment---I
suspect that the disk bandwidth is higher here, since I'm on a
professional PC, and not a "home computer") compared to my tests
at home (under Windows); at home, there was absolutely no
difference in the times for raw and cooked. (Cooked is, of
course, XDR format, at least on a machine like the PC, which
uses IEEE floating point.)

--
James Kanze
Rune Allnor
2009-10-29 14:02:17 UTC
Permalink
On 29 Okt, 11:00, James Kanze <***@gmail.com> wrote:
...
Compiled with "cl /EHs /O2 timefmt.cc".  On my local disk here,
    text: 90 sec.
    cooked: 31 sec.
    raw: 9 sec.
The last is, of course, not significant, except that it is very
small.  (I can't run it on the networked disk, where any real
data would normally go, because it would use too much network
bandwidth, possibly interfering with others.  Suffice it to say
that the networked disk is about 5 or more times slower, so the
relative differences would be reduced by that amount.)  I'm not
sure what's different in the code above (or the environment---I
suspect that the disk bandwidth is higher here, since I'm on a
professional PC, and not a "home computer") compared to my tests
at home (under Windows); at home, there was absolutely no
difference in the times for raw and cooked.  (Cooked is, of
course, XDR format, at least on a machine like the PC, which
uses IEEE floating point.)
Hmm.... so everything was done on your local disc? Which means
one would expect that disk I/O delays are proportional to file
sizes?

If so, the raw/cooked binary formats are a bit confusing.
According to this page,

http://publib.boulder.ibm.com/infocenter/systems//index.jsp?topic=/com.ibm.aix.progcomm/doc/progcomc/xdr_datatypes.htm

the XDR data type format uses "the IEEE standard" (I can find no
mention of exactly *which* IEEE standard...) to encode both single-
precision and double-precision floating point numbers.

IF "the IEEE standard" happens to mean "IEEE 754" there is a
chance that an optimizing compiler might deduce that re-coding
numbers on IEEE 754 format to another number on IEEE 754 format
essentially is a No-Op.

Even if XDR uses some other format than IEEE754, your numbers
show one significant effect:

1) Double-precision XDR is of the same size as double-precision
IEEE 754 (64 bits / number).
2) Handling XDR takes significantly longer than handling native
binary formats.

Since you run the test with the same amopunts of data on the
same local disk with the same delay factors, this factor ~4
of longer time spent on handling XDR data must be explained by
something else than mere disk IO.

The obvious suspect is the extra manipulations and recoding of
XDR data. Where native-format binary IO only needs to perform
a memcpy from the file buffer to the destination, the XDR data
first needs to be decoded to an intermediate format, and then
re-encoded to the native binary format before the result can
be piped on to the destination.

The same happens - but on a larger scale - when dealing with
text-based formats:

1) Verify that the next sequence of characters represent a
valid number format
2) Decide how many glyphs need to be considered for decoding
3) Decode text characters to digits
4) Scale according to digit placement in number
5) Repeat for exponent
6) Do the math to compute the number

True, this takes insignificant amounts of time when compared
to disk IO, but unless you use a multi-thread system where
one thread reads from disk and another thread converts the
formats while one waits for the next batch of data to arrive
from the disk, one have to do all of this sequentially in
addition to waiting for disk IO.

Nah, I still think that any additional non-trivial handling
of data will impact IO times of data. In single-thread
environments.

Rune
James Kanze
2009-10-29 16:36:43 UTC
Permalink
Post by Rune Allnor
...
Post by James Kanze
Compiled with "cl /EHs /O2 timefmt.cc". On my local disk here,
text: 90 sec.
cooked: 31 sec.
raw: 9 sec.
The last is, of course, not significant, except that it is
very small. (I can't run it on the networked disk, where
any real data would normally go, because it would use too
much network bandwidth, possibly interfering with others.
Suffice it to say that the networked disk is about 5 or more
times slower, so the relative differences would be reduced
by that amount.) I'm not sure what's different in the code
above (or the environment---I suspect that the disk
bandwidth is higher here, since I'm on a professional PC,
and not a "home computer") compared to my tests at home
(under Windows); at home, there was absolutely no difference
in the times for raw and cooked. (Cooked is, of course, XDR
format, at least on a machine like the PC, which uses IEEE
floating point.)
Hmm.... so everything was done on your local disc? Which means
one would expect that disk I/O delays are proportional to file
sizes?
More or less. There are also caching effects, which I've not
tried to mask or control, which means that the results should be
taken with a grain of salt. More generally, there are a lot of
variables involved, and I've not made any attempts to control
any of them, which probably explains the differences I'm seeing
from one machine to the next.
Post by Rune Allnor
If so, the raw/cooked binary formats are a bit confusing.
According to this page,
http://publib.boulder.ibm.com/infocenter/systems//index.jsp?topic=/co...
the XDR data type format uses "the IEEE standard" (I can find
no mention of exactly *which* IEEE standard...) to encode both
single- precision and double-precision floating point numbers.
IF "the IEEE standard" happens to mean "IEEE 754" there is a
chance that an optimizing compiler might deduce that re-coding
numbers on IEEE 754 format to another number on IEEE 754
format essentially is a No-Op.
I'm not sure what you're referring to. My "cooked" format is a
simplified, non-portable implementation of XDR---non portable
because it only works on machines which have 64 long longs and
use IEEE floating point.
Post by Rune Allnor
Even if XDR uses some other format than IEEE754, your numbers
1) Double-precision XDR is of the same size as double-precision
IEEE 754 (64 bits / number).
2) Handling XDR takes significantly longer than handling native
binary formats.
Again, that depends on the machine. On my tests at home, it
didn't. I've not had the occasion to determine where the
difference lies.
Post by Rune Allnor
Since you run the test with the same amopunts of data on the
same local disk with the same delay factors,
I don't know whether the delay factor is the same. A lot
depends on how the system caches disk accesses. A more
significant test would use synchronized writing, but
synchronized at what point?
Post by Rune Allnor
this factor ~4 of longer time spent on handling XDR data must
be explained by something else than mere disk IO.
*IF* there is no optimization, *AND* disk accesses cost nothing,
then a factor of about 4 sounds about right.
Post by Rune Allnor
The obvious suspect is the extra manipulations and recoding of
XDR data. Where native-format binary IO only needs to perform
a memcpy from the file buffer to the destination, the XDR data
first needs to be decoded to an intermediate format, and then
re-encoded to the native binary format before the result can
be piped on to the destination.
The same happens - but on a larger scale - when dealing with
1) Verify that the next sequence of characters represent a
valid number format
2) Decide how many glyphs need to be considered for decoding
3) Decode text characters to digits
4) Scale according to digit placement in number
5) Repeat for exponent
6) Do the math to compute the number
That's input, not output. Input is significantly harder for
text, since it has to be able to detect errors. For XDR, the
difference between input and output probably isn't signficant,
since the only error that you can really detect is an end of
file in the middle of a value.
Post by Rune Allnor
True, this takes insignificant amounts of time when compared
to disk IO, but unless you use a multi-thread system where one
thread reads from disk and another thread converts the formats
while one waits for the next batch of data to arrive from the
disk, one have to do all of this sequentially in addition to
waiting for disk IO.
Nah, I still think that any additional non-trivial handling of
data will impact IO times of data. In single-thread
environments.
You can always use asynchronous IO:-). And what if your
implementation of filebuf uses memory mapped files?

The issues are extremely complex, and can't easily be
summarized. About the most you can say is that using text I/O
won't increase the time more than about a factor of 10, and may
increase it significantly less. (I wish I could run the tests
on the drives we usually use---I suspect that the difference
between text and binary would be close to negligible, because of
the significantly lower data transfer rates.)

--
James Kanze
Brian
2009-10-26 21:50:57 UTC
Permalink
Post by James Kanze
Post by Rune Allnor
    [...]
Post by Rune Allnor
Post by Jorgen Grahn
But if you have a choice, it's IMO almost always better to
write the data as text, compressing it first using something
like gzip if I/O or disk space is an issue.
Totally agreed.  Especially for the maintenance programmer,
who can see at a glance what is being written.
The user might have opinions, though.
File I/O operations with text-formatted floating-point data
take time. A *lot* of time.
A lot of time compared to what?
Wall clock time. Relative time, compared to dumping
binary data to disk. Any way you want.
The only comparison that is relevant is compared to some other
way of doing it.
Post by Rune Allnor
 My experience has always been
that the disk IO is the limiting factor
Disk IO is certainly *a* limiting factor. But not the only
one. In this case it's not even the dominant one.
And that obviously depends on the CPU speed and the disk speed.
Text formatting does take some additional CPU time; if the disk
is slow and the CPU fast, this will be less important than if
the disk is fast and the CPU slow.
Post by Rune Allnor
See the example below.
Which will only be for one compiler, on one particular CPU, with
one set of compiler options.
(Note that it's very, very difficult to measure these things
accurately, because of things like disk buffering.  The order
you run the tests can make a big difference: under Windows, at
least, the first test run always runs considerably faster than
if it is run in some other position, for example.)
Post by Rune Allnor
(but my data sets have generally been very mixed, with a lot
of non floating point data as well).  And binary formatting
can be more or less expensive as well---I'd rather deal with
text than a BER encoded double.  And Jorgen said very
explicitly "if you have a choice".  Sometimes you don't have
the choice: you have to conform to an already defined
external format, or the profiler says you don't have the
choice.
Post by Rune Allnor
The rule-of-thumb is 30-60 seconds per 100 MBytes of
text-formatted FP numeric data, compared to fractions of a
second for the same data (natively) binary encoded (just
try it).
Try it on what machine:-).
Any machine. The problem is to decode text-formatted numbers
to binary.
You're giving concrete figures.  "Any machine" doesn't make
sense in such cases:  I've seen factors of more than 10 in terms
of disk speed between different hard drives (and if the drive is
remote mounted, over a slow network, the difference can be even
more), and in my time, I've seen at least six or seven orders of
magnitude in speed between CPU's.  (I've worked on 8 bit machines
which took on an average 10 ųs per machine instruction, with no
hardware multiply and divide, much less floating point
instructions.)
The compiler and the library implementation also make a
significant difference.  I knocked up a quick test (which isn't
very accurate, because it makes no attempt to take into account
disk caching and such), and tried it on the two machines I have
handy: a very old (2002) laptop under Windows, using VC++, and a
very recent, high performance desktop under Linux, using g++.
Under Windows, the difference between text and binary was a
factor of about 3; under Linux, about 15.  Apparently, the
conversion routines in the Microsoft compiler are a lot, lot
better than those in g++.  The difference would be larger if I
had a higher speed disk or data bus; it would be significantly
smaller (close to zero, probably) if I synchronized each write.
(A synchronized disk write is about 10 ms, at least on a top of
the line Sun Sparc.)
In terms of concrete numbers, of course... Using time gave me
values too small to be significant for 10000000 doubles on the
Linux machine (top of the line AMD processor of less than a year
ago); for 100000000 doubles, it was around 85 seconds for text
(written in scientific format, with 17 digits precision, each
value followed by a new line, total file size 2.4 GB).  For
10000000, it was around 45 seconds under Windows (file size 250
MB).
It's interesting to note that the Windows version is clearly IO
dominated.  The difference in speed between text and binary is
pretty much the same as the difference in file size.
Post by Rune Allnor
Obviously, the formatting/parsing
speed will depend on the CPU speed, which varies enormously.  By
a factor of much more than 2 (which is what you've mentionned).
Again, I've no recent measurements, so I can't be sure, but I
suspect that the real difference in speed will come from the
fact that you're writing more bytes with a text format,
This is a factor. Binary files are usually about 20%-70% of the
size of the text file, depending on numbers of significant digits
and other formatting text glyphs. File sizes don't account for the
time 50-100x difference.
There is no 50-100x difference.  There's at most a difference of
15x, on the machines I've tested; the difference would probably
be less if I somehow inhibited the effects of disk caching
(because the disk access times would increase); I won't bother
trying it with synchronized writes, however, because that would
go to the opposite extreme, and you'd probably never use
synchronized writes for each double: when they're needed, it's
for each record.
Post by Rune Allnor
Here is a test I wrote in matlab a few years ago, to
I'm afraid it doesn't demonstrate anything to me, because I have
no idea how Matlib works.  It might be using unbuffered output
for text, or synchronizing at each double.  And in what format?
Post by Rune Allnor
The script first generates ten million random numbers,
and writes them to file on both ASCII and binary double
precision floating point formats. The files are then read
straight back in, hopefully eliminating effects of file
caches etc.
Actually, reading immediately after writing maximizes the
effects of file caches.  And on a modern machine, with say 4GB
main memory, a small file like this will be fully cached.
Post by Rune Allnor
The ASCII file in this test is 175 MBytes, while
the binary file is about 78 MBytes.
If you're dumping raw data, a binary file with 10000000 doubles,
on a PC, should be exactly 80 MB.
Post by Rune Allnor
The first few lines in the text file look like
 -4.3256481e-001
 -1.6655844e+000
  1.2533231e-001
  2.8767642e-001
(one leading whitespace, one negative sign or whitespace, no
trailing spaces) which is not excessive, neither with respect
to the number of significant digits, or the number of other
characters.
It's not sufficient with regards to the number of digits.  You
won't read back in what you've written.
Post by Rune Allnor
The timing numbers (both absolute and relative) would be of
similar orders of magnitude if you repeated the test with C++.
I did, and they aren't.  They're actually very different in two
separate C++ environments.
Post by Rune Allnor
The application I'm working with would need to crunch through
some 10 GBytes of numerical data per hour. Just reading that
amount of data from a text format would require on the order
of
1e10/1.75e8*42s = 2400s = 40 minutes.
There is no point in even considering using a text format for
these kinds of things.
But it must not be doing much processing on the data, just
copying it and maybe a little scaling.  My applications do
significant calculations (which I'll admit I don't understand,
but they do take a lot of CPU time).  The time spent writing the
results, even in XML, is only a small part of the total runtime.
Post by Rune Allnor
Post by Rune Allnor
If there are problems with binary floating point I/O formats,
then that's a question for the C++ standards committee. It
ought to be a simple technical (as opposed to political)
matter to specify that binary FP I/O could be set to comply to
some already defined standard, like e.g. IEEE 754.
So that the language couldn't be used on some important
platforms?  (Most mainframes still do not use IEEE.  Most don't
even use binary: IBM's are base 16, and Unisys's base 8.)  And
of course, not all IEEE is "binary compatible" either: a file
dumped from the Sparcs I've done most of my work on won't be
readable on the PC's I currently work on.
I can't see how the problem is different from text encoding.
The 7-bit ANSI character set is the baseline. A number of
8-bit ASCII encodings are in use, and who knows how many
16-bit encodings. No one says which one should be used. Only
which ones should be available.
The current standard doesn't even say that.  It only gives a
minimum list of characters which must be supported.  But I'm not
sure what your argument is: you're saying that we should
standardize some binary format more than the text format?
I haven't invested in text or XML marshalling because
I think binary formats are going to prevail. With the
portability edge taken away from text, there won't be
much reason to use text.


Brian Wood
http://webEbenezer.net

"All things (e.g. A camel's journey through
A needle's eye) are possible it's true.
But picture how the camel feels, squeezed out
In one long bloody thread from tail to snout."

C. S. Lewis
James Kanze
2009-10-28 12:42:21 UTC
Permalink
Post by Brian
I haven't invested in text or XML marshalling because
I think binary formats are going to prevail.
Which binary format? There are quite a few to choose from.
Post by Brian
With the portability edge taken away from text, there won't be
much reason to use text.
The main reason to use text is that it's an order of magnitude
easier to debug. And that's not likely to change.

--
James Kanze
mzdude
2009-10-28 16:53:21 UTC
Permalink
Post by James Kanze
The main reason to use text is that it's an order of magnitude
easier to debug.  And that's not likely to change.
Is that text 8 bit ASCII, 16 bit, wchart_t, MBCS, UNICODE ... :^)
Mick
2009-10-28 18:14:58 UTC
Permalink
Post by mzdude
Post by James Kanze
The main reason to use text is that it's an order of magnitude
easier to debug. And that's not likely to change.
Is that text 8 bit ASCII, 16 bit, wchart_t, MBCS, UNICODE ... :^)
Quill & Parchment.
--
------------
< I'm Karmic >
------------
\
\
___
{~._.~}
( Y )
()~*~()
(_)-(_)
Brian
2009-10-28 20:38:47 UTC
Permalink
Post by Brian
I haven't invested in text or XML marshalling because
I think binary formats are going to prevail.
Which binary format?  There are quite a few to choose from.
I'm only aware of a few of them. I don't know if
it matters much to me which one is selected. It's
more that there's a standard.
Post by Brian
With the portability edge taken away from text, there won't be
much reason to use text.
The main reason to use text is that it's an order of magnitude
easier to debug.  And that's not likely to change.
I was thinking that having a standard for binary would
help with debugging. I guess it is a tradeoff between
development costs and bandwidth costs.


Brian Wood
http://webEbenezer.net
Brian
2009-10-28 21:19:10 UTC
Permalink
Post by Brian
Post by James Kanze
Post by Brian
With the portability edge taken away from text, there won't be
much reason to use text.
The main reason to use text is that it's an order of magnitude
easier to debug.  And that's not likely to change.
I was thinking that having a standard for binary would
help with debugging.  I guess it is a tradeoff between
development costs and bandwidth costs.
Does this perspective seem accurate? Assuming the order
of magnitude is correct, then the question becomes
something like if language A takes 10 times longer to
learn than language B, but once you learn A you can
communicate in 1/3 the time it takes for those using B.
So those who learn how to use A have an advantage over
those who don't.


Brian Wood
James Kanze
2009-10-29 10:03:38 UTC
Permalink
Post by Brian
Post by James Kanze
Post by Brian
I haven't invested in text or XML marshalling because
I think binary formats are going to prevail.
Which binary format? There are quite a few to choose from.
I'm only aware of a few of them. I don't know if
it matters much to me which one is selected. It's
more that there's a standard.
Post by James Kanze
Post by Brian
With the portability edge taken away from text, there
won't be much reason to use text.
The main reason to use text is that it's an order of
magnitude easier to debug. And that's not likely to change.
I was thinking that having a standard for binary would help
with debugging.
It might. It would certainly encourage tools for reading it.
On the other hand: we already have a couple of standards for
binary, and I haven't seen that many tools. Part of the reason
might be because one of the most common standards, XDR, is
basically untyped, so the tools wouldn't really know how to read
it anyway. (There are tools which display certain specific uses
of XDR in human readable format, e.g. tcpdump.)

--
James Kanze
Gerhard Fiedler
2009-10-28 21:23:49 UTC
Permalink
Post by Rune Allnor
Here is a test I wrote in matlab a few years ago, to demonstrate
[... Matlab code]
------------------------------------
Wrote ASCII data in 24.0469 seconds
Read ASCII data in 42.2031 seconds
Wrote binary data in 0.10938 seconds
Read binary data in 0.32813 seconds
------------------------------------
Binary writes are 24.0/0.1 = 240x faster than text write.
Binary reads are 42.2/0.32 = 130x faster than text read.
In Matlab. This doesn't say much if anything about any other program.
Possibly Matlab has a lousy (in terms of speed) text IO.

Re the precision issue: When writing out text, there isn't really a need
to go decimal, too. Hex or octal numbers are also text. Speeds up the
conversion (probably not by much, but still) and provides a way to write
out the exact value that is in memory (and recreate that exact value --
no matter the involved precisions).

Gerhard
James Kanze
2009-10-29 10:09:01 UTC
Permalink
Post by Gerhard Fiedler
Post by Rune Allnor
Here is a test I wrote in matlab a few years ago, to
[... Matlab code]
------------------------------------
Wrote ASCII data in 24.0469 seconds
Read ASCII data in 42.2031 seconds
Wrote binary data in 0.10938 seconds
Read binary data in 0.32813 seconds
------------------------------------
Binary writes are 24.0/0.1 = 240x faster than text write.
Binary reads are 42.2/0.32 = 130x faster than text read.
In Matlab. This doesn't say much if anything about any other
program. Possibly Matlab has a lousy (in terms of speed) text
IO.
Obviously, not possibly. I get a factor of between 3 and 10,
depending on the compiler and the system. I get a signficant
difference simply running what I think is the same program (more
or less) on two different machines, using the same compiler and
having the same architecture---one probably has a much higher
speed IO bus than the other, and that makes the difference.
Post by Gerhard Fiedler
Re the precision issue: When writing out text, there isn't
really a need to go decimal, too. Hex or octal numbers are
also text. Speeds up the conversion (probably not by much, but
still) and provides a way to write out the exact value that is
in memory (and recreate that exact value -- no matter the
involved precisions).
But it defeats one of the major reasons for using text: human
readability.

--
James Kanze
Gerhard Fiedler
2009-10-29 20:18:02 UTC
Permalink
Post by James Kanze
Post by Gerhard Fiedler
Re the precision issue: When writing out text, there isn't really a
need to go decimal, too. Hex or octal numbers are also text. Speeds
up the conversion (probably not by much, but still) and provides a
way to write out the exact value that is in memory (and recreate
that exact value -- no matter the involved precisions).
But it defeats one of the major reasons for using text: human
readability.
Not that much. For (casual, not precision) reading, a few digits are
usually enough, and most people who read this type of output (meant to
be communication between programs) are programmers, hence typically
reasonably fluent in octal and hex. The most important issue is that the
fields (mantissa sign, mantissa, exponent sign, exponent, etc.) are
decoded and appropriately presented. Whether the mantissa and the
exponent are then in decimal, octal or hexadecimal IMO doesn't make much
of a difference.

Since what we're talking about is only relevant for huge amounts of
data, doing anything more with that data than just a cursory look at
some numbers (which IMO is fine in octal or hex) generally needs a
program anyway.

Gerhard
James Kanze
2009-10-30 08:44:00 UTC
Permalink
Post by Gerhard Fiedler
Post by Gerhard Fiedler
Re the precision issue: When writing out text, there isn't
really a need to go decimal, too. Hex or octal numbers are
also text. Speeds up the conversion (probably not by much,
but still) and provides a way to write out the exact value
that is in memory (and recreate that exact value -- no
matter the involved precisions).
human readability.
Not that much. For (casual, not precision) reading, a few
digits are usually enough, and most people who read this type
of output (meant to be communication between programs) are
programmers, hence typically reasonably fluent in octal and
hex. The most important issue is that the fields (mantissa
sign, mantissa, exponent sign, exponent, etc.) are decoded and
appropriately presented. Whether the mantissa and the exponent
are then in decimal, octal or hexadecimal IMO doesn't make
much of a difference.
Agreed (sort of): I thought you were talking about outputting a
hex dump of the bytes. Separating out the mantissa and the
exponent is a simple and rapid compromize: it's not anywhere
near as readable as the normal format, but as you say, it should
be sufficient for most uses by a professional in the field.
Having done that, however, I suspect that on most machines,
outputting the different fields in decimal, rather than hex,
would probably not make a significant different.
Post by Gerhard Fiedler
Since what we're talking about is only relevant for huge
amounts of data, doing anything more with that data than just
a cursory look at some numbers (which IMO is fine in octal or
hex) generally needs a program anyway.
One would hope that you could start debugging with much smaller
sets of data. And if you do end up one LSB off after reading,
you'll probably want to look at the exact value.

--
James Kanze
Rune Allnor
2009-10-30 09:37:31 UTC
Permalink
Post by James Kanze
Post by Gerhard Fiedler
Post by Gerhard Fiedler
Re the precision issue: When writing out text, there isn't
really a need to go decimal, too. Hex or octal numbers are
also text. Speeds up the conversion (probably not by much,
but still) and provides a way to write out the exact value
that is in memory (and recreate that exact value -- no
matter the involved precisions).
human readability.
Not that much. For (casual, not precision) reading, a few
digits are usually enough, and most people who read this type
of output (meant to be communication between programs) are
programmers, hence typically reasonably fluent in octal and
hex. The most important issue is that the fields (mantissa
sign, mantissa, exponent sign, exponent, etc.) are decoded and
appropriately presented. Whether the mantissa and the exponent
are then in decimal, octal or hexadecimal IMO doesn't make
much of a difference.
Agreed (sort of): I thought you were talking about outputting a
hex dump of the bytes.  Separating out the mantissa and the
exponent is a simple and rapid compromize: it's not anywhere
near as readable as the normal format, but as you say, it should
be sufficient for most uses by a professional in the field.
Having done that, however, I suspect that on most machines,
outputting the different fields in decimal, rather than hex,
would probably not make a significant different.
Post by Gerhard Fiedler
Since what we're talking about is only relevant for huge
amounts of data, doing anything more with that data than just
a cursory look at some numbers (which IMO is fine in octal or
hex) generally needs a program anyway.
One would hope that you could start debugging with much smaller
sets of data.  And if you do end up one LSB off after reading,
you'll probably want to look at the exact value.
So what does text-based formats actually buy you?

- Files are several times larger than binary dumps
- IO delays are several times (I'd say orders) slower
for thext than for binary
- Human users don't benefit from the text dumps anyway,
since they are too large to be useful
- Human readers would have to make an effort to
convert text dumps to readable format

In the end, text formats require humans to do the same
work converting data to a readable format as would be
required with binary data, AND it provides file sizes
and IO delays as additional nuisances.

Rune
James Kanze
2009-10-30 16:08:12 UTC
Permalink
[...]
Post by Rune Allnor
So what does text-based formats actually buy you?
Shorter development times, less expensive development, greater
reliability...

In sum, lower cost.

--
James Kanze
Rune Allnor
2009-10-31 14:19:10 UTC
Permalink
    [...]
Post by Rune Allnor
So what does text-based formats actually buy you?
Shorter development times, less expensive development, greater
reliability...
In sum, lower cost.
As long as you keep two factors in mind:

1) The user's time is not yours (the programmer) to waste.
2) The users's storage facilities (disk space, network
bandwidth etc) are not yours (the programmer) to waste.

Those who want easy, not awfully challenging jobs might be
better off flipping burgers.

Rune
James Kanze
2009-11-02 10:12:18 UTC
Permalink
Post by Rune Allnor
Post by James Kanze
[...]
Post by Rune Allnor
So what does text-based formats actually buy you?
Shorter development times, less expensive development, greater
reliability...
In sum, lower cost.
1) The user's time is not yours (the programmer) to waste.
2) The users's storage facilities (disk space, network
bandwidth etc) are not yours (the programmer) to waste.
The user pays for your time. Spending it to do something which
results in a less reliable program, and that he doesn't need, is
irresponsible, and borders on fraud.
Post by Rune Allnor
Those who want easy, not awfully challenging jobs might be
better off flipping burgers.
Writing the most reliable programs for the lowest cost is
challenging enough without going out of your way to make it
harder. If you're an amateur, doing this for fun, do whatever
amuses you the most. If you're a professional, selling your
services, professional ontology requires provided the best
service possible at the lowest price possible.

--
James Kanze
Brian
2009-11-02 21:05:39 UTC
Permalink
Post by Rune Allnor
    [...]
Post by Rune Allnor
So what does text-based formats actually buy you?
Shorter development times, less expensive development, greater
reliability...
In sum, lower cost.
1) The user's time is not yours (the programmer) to waste.
2) The users's storage facilities (disk space, network
   bandwidth etc) are not yours (the programmer) to waste.
The user pays for your time.  Spending it to do something which
results in a less reliable program, and that he doesn't need, is
irresponsible, and borders on fraud.
Post by Rune Allnor
Those who want easy, not awfully challenging jobs might be
better off flipping burgers.
Writing the most reliable programs for the lowest cost is
challenging enough without going out of your way to make it
harder.  If you're an amateur, doing this for fun, do whatever
amuses you the most.  If you're a professional, selling your
services, professional ontology requires provided the best
service possible at the lowest price possible.
I'm interested in binary in this context as an
alternative to text because I believe markets and
conditions are likely to continue to be volatile for
a while. If I had more confidence in various
officials, B.O. (Obama), Putin, Ahmadinejad, etc.,
I'd be less likely to think things are going to be
volatile. I like what Rabbi Michael Healer said
when he met the governor of Texas -- Rick Perry --
a few years ago: "I didn't vote for you and I
don't trust you." I didn't vote for B.O. and I
don't trust him either.


Brian Wood
Ebenezer Enterprises
http://webEbenezer.net
Brian
2009-11-02 23:39:48 UTC
Permalink
Post by Brian
Post by Rune Allnor
    [...]
Post by Rune Allnor
So what does text-based formats actually buy you?
Shorter development times, less expensive development, greater
reliability...
In sum, lower cost.
1) The user's time is not yours (the programmer) to waste.
2) The users's storage facilities (disk space, network
   bandwidth etc) are not yours (the programmer) to waste.
The user pays for your time.  Spending it to do something which
results in a less reliable program, and that he doesn't need, is
irresponsible, and borders on fraud.
Post by Rune Allnor
Those who want easy, not awfully challenging jobs might be
better off flipping burgers.
Writing the most reliable programs for the lowest cost is
challenging enough without going out of your way to make it
harder.  If you're an amateur, doing this for fun, do whatever
amuses you the most.  If you're a professional, selling your
services, professional ontology requires provided the best
service possible at the lowest price possible.
I'm interested in binary in this context as an
alternative to text because I believe markets and
conditions are likely to continue to be volatile for
a while.  
This is interesting --

http://stackoverflow.com/questions/1058051/boost-serialization-performance-text-vs-binary-format

M. Troyer, who I think is still around the Boost list,
considered using binary to be "essential."

http://lists.boost.org/Archives/boost/2002/11/39601.php

I'm not sure if those participating in this thread
come from a scientific application background as Troyer
does.


Brian Wood
Ebenezer Enterprises
http://webEbenezer.net
Rune Allnor
2009-11-03 19:09:35 UTC
Permalink
Post by Brian
Post by Brian
Post by Rune Allnor
    [...]
Post by Rune Allnor
So what does text-based formats actually buy you?
Shorter development times, less expensive development, greater
reliability...
In sum, lower cost.
1) The user's time is not yours (the programmer) to waste.
2) The users's storage facilities (disk space, network
   bandwidth etc) are not yours (the programmer) to waste.
The user pays for your time.  Spending it to do something which
results in a less reliable program, and that he doesn't need, is
irresponsible, and borders on fraud.
Post by Rune Allnor
Those who want easy, not awfully challenging jobs might be
better off flipping burgers.
Writing the most reliable programs for the lowest cost is
challenging enough without going out of your way to make it
harder.  If you're an amateur, doing this for fun, do whatever
amuses you the most.  If you're a professional, selling your
services, professional ontology requires provided the best
service possible at the lowest price possible.
I'm interested in binary in this context as an
alternative to text because I believe markets and
conditions are likely to continue to be volatile for
a while.  
This is interesting --
http://stackoverflow.com/questions/1058051/boost-serialization-perfor...
M. Troyer, who I think is still around the Boost list,
considered using binary to be "essential."
http://lists.boost.org/Archives/boost/2002/11/39601.php
I'm not sure if those participating in this thread
come from a scientific application background as Troyer
does.
I used to be involved with seismic data porcessing. About 12
years ago, the company I worked for got the first TByte disk
stack nationwide. Before that time, the guys who went offshore
came back with truckloads of EXAByte tapes. Just loading the
tapes to the disk drives took weeks.

The applciation I'm working with has to do with bathymetry
map processing. 'Bathymetry' just means 'underwater terrain',
so the end product is a map of the sea floor.

There are huge amounts of data flowing through (I wouldn't
be surprised if present day 'simple' mapping tasks are comparable
to late '80s seismic processing, what computational through-put
is concerned), and the job is essentially real-time: A directive
to discontinue present survey activities might be recieved at any
time (surveying is done from general-purpose vessles), in which
case the vessel ad crew needs to shut down all activities and
switch focus to whatevere assignment is coming up, in a matter
of minutes or hours. At best one might accept a couple of hours
latency on the processed result after a new batch of survey
data is available, but that's it. Since any survey can go on
for indefinite lengths of time, one needs to be able to process
each data batch faster than it took to measure, or one will
accumulate backlog.

The processing is done in multiple stages, so one just can't
wait for text-based file IO to complete. Those who base their
data flow on text files are not able to complete even the
shortest survey processing whithin the time it takes to survey
the data - which is the essential aspect of a real-time operation.

Rune
Brian
2009-11-01 06:54:38 UTC
Permalink
    [...]
Post by Rune Allnor
So what does text-based formats actually buy you?
Shorter development times, less expensive development, greater
reliability...
In sum, lower cost.
Since a message using a text format is generally longer than
binary formats, text leaves systems more vulnerable to
network problems caused by storms, cyber attacks, etc.
I won't argue the point about it being easier to use text,
but think it's a little like buying an SUV. If the price of
gas goes way up, many wish they had never bought an SUV.
Using binary might be a way to mitigate the pain caused by
volatile markets/conditions.


Brian Wood
Ebenezer Enterprises
http://webEbenezer.net
Gerhard Fiedler
2009-11-01 20:32:03 UTC
Permalink
Since a message using a text format is generally longer than binary
formats, text leaves systems more vulnerable to network problems
caused by storms, cyber attacks, etc. I won't argue the point about
it being easier to use text, but think it's a little like buying an
SUV. If the price of gas goes way up, many wish they had never
bought an SUV. Using binary might be a way to mitigate the pain
caused by volatile markets/conditions.
If you're talking about sending something over a potentially unstable
network connection, simple binary is pretty bad. With text encoding
(could be e.g. base64 encoded binary, or pretty much everything else
that's guaranteed not to use all available symbols), you have a few
symbols left that you can use for stream synchronization. This is in
general much more important that a few bytes more to transmit. This may
even be important when storing data on disk: the chances of recovering
data if there's a problem is much higher if you have sync symbols in the
data stream.

There's a point for (simple) binary protocols when all you have is an
8bit microcontroller with 100 bytes of RAM and 1k of Flash. But you
typically don't program these in standard-compliant C++ :)

IMO this has nothing to do with SUVs... more with seat belts, if you
really want an automotive analogy. While they add weight to the vehicle,
and on (very) rare occasions may complicate things if there's a problem,
in most problem cases they can save your face, and more. (Which, back to
programming, may save your job -- and with it the payments for your SUV.
Now here we're back to the SUV :)

Gerhard
Brian
2009-11-02 20:04:28 UTC
Permalink
Post by Gerhard Fiedler
Since a message using a text format is generally longer than binary
formats, text leaves systems more vulnerable to network problems
caused by storms, cyber attacks, etc. I won't argue the point about
it being easier to use text, but think it's a little like buying an
SUV.  If the price of gas goes way up, many wish they had never
bought an SUV. Using binary might be a way tomitigate the pain
caused by volatile markets/conditions.
If you're talking about sending something over a potentially unstable
network connection, simple binary is pretty bad. With text encoding
(could be e.g. base64 encoded binary, or pretty much everything else
that's guaranteed not to use all available symbols), you have a few
symbols left that you can use for stream synchronization. This is in
general much more important that a few bytes more to transmit. This may
even be important when storing data on disk: the chances of recovering
data if there's a problem is much higher if you have sync symbols in the
data stream.
If it were just a "few bytes more" I wouldn't be saying
anything. Likewise the difference between an SUV and
a fuel efficient vehicle isn't trivial. People wouldn't
be wishing they had never bought an SUV if that were
the case.


Brian Wood
Ebenezer Enterprises
http://webEbenezer.net
Gerhard Fiedler
2009-11-03 10:14:49 UTC
Permalink
Post by Gerhard Fiedler
Since a message using a text format is generally longer than binary
formats, text leaves systems more vulnerable to network problems
caused by storms, cyber attacks, etc. I won't argue the point about
it being easier to use text, but think it's a little like buying an
SUV.  If the price of gas goes way up, many wish they had never
bought an SUV. Using binary might be a way tomitigate the pain
caused by volatile markets/conditions.
If you're talking about sending something over a potentially
unstable network connection, simple binary is pretty bad. With text
encoding (could be e.g. base64 encoded binary, or pretty much
everything else that's guaranteed not to use all available symbols),
you have a few symbols left that you can use for stream
synchronization. This is in general much more important that a few
bytes more to transmit. This may even be important when storing data
on disk: the chances of recovering data if there's a problem is much
higher if you have sync symbols in the data stream.
If it were just a "few bytes more" I wouldn't be saying anything.
Likewise the difference between an SUV and a fuel efficient vehicle
isn't trivial. People wouldn't be wishing they had never bought an
SUV if that were the case.
It is longer, but you were talking about unreliable networks. And
resyncing a binary stream is by design very problematic. Since you often
don't know beforehand the length of records (think strings), you have
length information encoded in your binary stream. If one length field is
bad and unrecoverable, pretty much the complete rest of the stream is
unreadable because you're out of sync from that point on. This is also
valid for data on disks.

Now, if you used an encoding with a few unused symbols, you can use
those symbols to add synchronization markers (records, whatever), and
even if a length field is bad, you maybe lost a record but not the whole
remainder of the stream.

On unreliable networks, I take that any day over the size advantage of
raw binary. Of course, this is not about text vs binary, this is about
whether raw binary is the best choice for unreliable networks. It isn't.

If you want both (speed and reliability), you'd create a custom encoding
that leaves only a few symbols unused that you then can use for syncing.
But raw binary is not a good choice over unreliable networks.

And I still think that this has nothing to do with SUVs. How many people
do you know that are wishing they never had used a text protocol? How
many are there wishing they never had used raw binary over an unreliable
network link?

Gerhard
Brian
2009-11-03 17:35:47 UTC
Permalink
Post by Gerhard Fiedler
Post by Gerhard Fiedler
Since a message using a text format is generally longer than binary
formats, text leaves systems more vulnerable to network problems
caused by storms, cyber attacks, etc. I won't argue the point about
it being easier to use text, but think it's a little like buying an
SUV.  If the price of gas goes way up, many wish they had never
bought an SUV. Using binary might be a way tomitigate the pain
caused by volatile markets/conditions.
If you're talking about sending something over a potentially
unstable network connection, simple binary is pretty bad. With text
encoding (could be e.g. base64 encoded binary, or pretty much
everything else that's guaranteed not to use all available symbols),
you have a few symbols left that you can use for stream
synchronization. This is in general much more important that a few
bytes more to transmit. This may even be important when storing data
on disk: the chances of recovering data if there's a problem is much
higher if you have sync symbols in the data stream.
If it were just a "few bytes more" I wouldn't be saying anything.
Likewise the difference between an SUV and a fuel efficient vehicle
isn't trivial.  People wouldn't be wishing they had never bought an
SUV if that were the case.
It is longer, but you were talking about unreliable networks. And
resyncing a binary stream is by design very problematic. Since you often
don't know beforehand the length of records (think strings), you have
length information encoded in your binary stream.
Yes.
Post by Gerhard Fiedler
If one length field is
bad and unrecoverable, pretty much the complete rest of the stream is
unreadable because you're out of sync from that point on. This is also
valid for data on disks.
I think there are ways to avoid that. Sentinel values are
often used in binary streams. If you get to the end of a
message and don't find the sentinel, you can scan until
you do find it. It's true that you may find a false
positive with binary, but the whole stream isn't lost.
Additionally, the message length can be embedded two times.
If the two lengths match, then an errant sublength within
the message won't cause any trouble to the whole stream,
but it may make it impossible to interpret one message.
If the two message lengths don't match then you have to
do some checking. If you have a max message length, you
check both values against that. If both are less than
that you would have to proceed with caution.
Post by Gerhard Fiedler
Now, if you used an encoding with a few unused symbols, you can use
those symbols to add synchronization markers (records, whatever), and
even if a length field is bad, you maybe lost a record but not the whole
remainder of the stream.
On unreliable networks, I take that any day over the size advantage of
raw binary. Of course, this is not about text vs binary, this is about
whether raw binary is the best choice for unreliable networks. It isn't.
Just saying "it isn't" doesn't convince me.
Post by Gerhard Fiedler
If you want both (speed and reliability), you'd create a custom encoding
that leaves only a few symbols unused that you then can use for syncing.
But raw binary is not a good choice over unreliable networks.
And I still think that this has nothing to do with SUVs. How many people
do you know that are wishing they never had used a text protocol? How
many are there wishing they never had used raw binary over an unreliable
network link?
I don't know any in either of those two categories.
Some predicted spiking oil prices 10 years ago and
they based their decisions on those predictions.
Something similar may happen with bandwidth prices.


Brian Wood
Ebenezer Enterprises
http://webEbenezer.net


I read today of a man who was fired for saying,
"I think homosexuality is bad stuff."
http://www.wnd.com/index.php?fa=PAGE.view&pageId=114779
I agree with him - it is bad stuff.
Gerhard Fiedler
2009-11-03 20:14:12 UTC
Permalink
Post by Brian
Post by Gerhard Fiedler
And I still think that this has nothing to do with SUVs. How many
people do you know that are wishing they never had used a text
protocol? How many are there wishing they never had used raw binary
over an unreliable network link?
I don't know any in either of those two categories.
Wasn't it you who wrote "People wouldn't be wishing they had never
bought an SUV if that were the case", while using the analogy of text
format and SUVs? I thought you'd know at least "people" who wished they
had used binary -- if not, how do you get to the analogy in the first
place?
Post by Brian
Some predicted spiking oil prices 10 years ago and they based their
decisions on those predictions. Something similar may happen with
bandwidth prices.
Right, may. In general, when programming, I don't base my decisions on
such "predictions". If you take all those predictions made, you get
probably more misses than hits. I tend to try to get more hits than
misses when programming... this is better for the near-term financial
situation, and I can know this without making any shaky predictions :)

Gerhard
Gerhard Fiedler
2009-10-30 11:35:57 UTC
Permalink
Post by Gerhard Fiedler
Post by James Kanze
Post by Gerhard Fiedler
Re the precision issue: When writing out text, there isn't really a
need to go decimal, too. Hex or octal numbers are also text.
Speeds up the conversion (probably not by much, but still) and
provides a way to write out the exact value that is in memory (and
recreate that exact value -- no matter the involved precisions).
But it defeats one of the major reasons for using text: human
readability.
Not that much. For (casual, not precision) reading, a few digits are
usually enough, and most people who read this type of output (meant
to be communication between programs) are programmers, hence
typically reasonably fluent in octal and hex. The most important
issue is that the fields (mantissa sign, mantissa, exponent sign,
exponent, etc.) are decoded and appropriately presented. Whether the
mantissa and the exponent are then in decimal, octal or hexadecimal
IMO doesn't make much of a difference.
Agreed (sort of): I thought you were talking about outputting a hex
dump of the bytes. Separating out the mantissa and the exponent is a
simple and rapid compromize: it's not anywhere near as readable as
the normal format, but as you say, it should be sufficient for most
uses by a professional in the field.
I think the biggest advantage of doing it this way is that the text
representation makes it portable between different binary floating point
formats, and that the octal or hex representation avoids any rounding
problems and maintains the exact value, independently of precision and
other details of the binary representation (on both sides).
Having done that, however, I suspect that on most machines, outputting
the different fields in decimal, rather than hex, would probably not
make a significant different.
That may well be. But the rounding aspect is still a problem.
Post by Gerhard Fiedler
Since what we're talking about is only relevant for huge amounts of
data, doing anything more with that data than just a cursory look at
some numbers (which IMO is fine in octal or hex) generally needs a
program anyway.
One would hope that you could start debugging with much smaller sets
of data. And if you do end up one LSB off after reading, you'll
probably want to look at the exact value.
Sure. You always can use debug flags for outputting debug values.

Gerhard
James Kanze
2009-10-30 16:16:01 UTC
Permalink
Post by Gerhard Fiedler
Post by James Kanze
Post by Gerhard Fiedler
Post by Gerhard Fiedler
Re the precision issue: When writing out text, there
isn't really a need to go decimal, too. Hex or octal
numbers are also text. Speeds up the conversion
(probably not by much, but still) and provides a way to
write out the exact value that is in memory (and recreate
that exact value -- no matter the involved precisions).
human readability.
Not that much. For (casual, not precision) reading, a few
digits are usually enough, and most people who read this
type of output (meant to be communication between programs)
are programmers, hence typically reasonably fluent in octal
and hex. The most important issue is that the fields
(mantissa sign, mantissa, exponent sign, exponent, etc.)
are decoded and appropriately presented. Whether the
mantissa and the exponent are then in decimal, octal or
hexadecimal IMO doesn't make much of a difference.
Agreed (sort of): I thought you were talking about
outputting a hex dump of the bytes. Separating out the
it's not anywhere near as readable as the normal format, but
as you say, it should be sufficient for most uses by a
professional in the field.
I think the biggest advantage of doing it this way is that the
text representation makes it portable between different binary
floating point formats, and that the octal or hex
representation avoids any rounding problems and maintains the
exact value, independently of precision and other details of
the binary representation (on both sides).
Post by James Kanze
Having done that, however, I suspect that on most machines,
outputting the different fields in decimal, rather than hex,
would probably not make a significant different.
That may well be. But the rounding aspect is still a problem.
No. You're basically outputting (and reading) two integers: the
exponent (expressed as a power of two), and the mantissa
(expressed as the actual value times some power of two,
depending on the number of bits). For an IEEE double, for
example, you'd do something like:

MyOStream&
operator<<( MyOStream& dest, double value )
{
unsigned long long const&
u
= reinterpret_cast< unsigned long long const& >( value );
dest << ((u & 0x8000000000000000) != 0 ? '-' : '+')
<< (u & 0x000FFFFFFFFFFFFF) << 'b'
<< (((u >> 52) & 0x0FFF) | 0x8000);
return dest;
}

--
James Kanze
Gerhard Fiedler
2009-10-31 11:19:44 UTC
Permalink
Post by Gerhard Fiedler
Post by James Kanze
Having done that, however, I suspect that on most machines,
outputting the different fields in decimal, rather than hex,
would probably not make a significant different.
That may well be. But the rounding aspect is still a problem.
Ah, of course... :) <slap on forehead>

Gerhard
Jorgen Grahn
2009-11-04 21:47:48 UTC
Permalink
Post by Gerhard Fiedler
Post by James Kanze
Post by Gerhard Fiedler
Re the precision issue: When writing out text, there isn't really a
need to go decimal, too. Hex or octal numbers are also text. Speeds
up the conversion (probably not by much, but still) and provides a
way to write out the exact value that is in memory (and recreate
that exact value -- no matter the involved precisions).
But it defeats one of the major reasons for using text: human
readability.
Not that much. For (casual, not precision) reading, a few digits are
usually enough, and most people who read this type of output (meant to
be communication between programs) are programmers, hence typically
reasonably fluent in octal and hex.
I disagree there, in two ways:

- I belong to the school that claims protocols should be human-readable,
because, well, it opens them up. They get so much easier to
manipulate, and even talk about. Take HTTP as an example, or SMTP.

- I doubt that programmers are that good with hex. Even if I limit
myself to unsigned int, I can't tell what 0xbabe is. Probably 40000
or so. Or 30000? Who knows? There is a reason decimal is the default
base in pretty much every language I know of ... including assembly
languages.

...
Post by Gerhard Fiedler
Since what we're talking about is only relevant for huge amounts of
data, doing anything more with that data than just a cursory look at
some numbers (which IMO is fine in octal or hex) generally needs a
program anyway.
But for the text version of the data, that "program" is often a Unix
pipeline involving tools like grep, sort and uniq, or a Perl one-liner
you make up as you go. Or it can be fed directly into gnuplot or
Excel. If the data is binary, you probably simply won't bother.

I think we have been misled a bit here, too. I haven't read the whole
thread, but it started with something like "dump a huge array of
floats to disk, collect it later". If you take the more common case
"take this huge complex data structure and dump it to disk in a
portable format", you have a completely different situation, where the
non-text format isn't that much smaller or faster.

/Jorgen
--
// Jorgen Grahn <grahn@ Oo o. . .
\X/ snipabacken.se> O o .
Brian
2009-11-05 23:36:04 UTC
Permalink
Post by Jorgen Grahn
Post by Gerhard Fiedler
Post by James Kanze
Post by Gerhard Fiedler
Re the precision issue: When writing out text, there isn't really a
need to go decimal, too. Hex or octal numbers are also text. Speeds
up the conversion (probably not by much, but still) and provides a
way to write out the exact value that is in memory (and recreate
that exact value -- no matter the involved precisions).
But it defeats one of the major reasons for using text: human
readability.
Not that much. For (casual, not precision) reading, a few digits are
usually enough, and most people who read this type of output (meant to
be communication between programs) are programmers, hence typically
reasonably fluent in octal and hex.
- I belong to the school that claims protocols should be human-readable,
  because, well, it opens them up.  They get so much easier to
  manipulate, and even talk about.  Take HTTP as an example, or SMTP.
- I doubt that programmers are that good with hex.  Even if I limit
  myself to unsigned int, I can't tell what 0xbabe is.  Probably 40000
  or so. Or 30000?  Who knows?  There is a reason decimal is the default
  base in pretty much every language I know of ... including assembly
  languages.
...
Post by Gerhard Fiedler
Since what we're talking about is only relevant for huge amounts of
data, doing anything more with that data than just a cursory look at
some numbers (which IMO is fine in octal or hex) generally needs a
program anyway.
But for the text version of the data, that "program" is often a Unix
pipeline involving tools like grep, sort and uniq, or a Perl one-liner
you make up as you go.  Or it can be fed directly into gnuplot or
Excel. If the data is binary, you probably simply won't bother.
I think we have been misled a bit here, too. I haven't read the whole
thread, but it started with something like "dump a huge array of
floats to disk, collect it later".  If you take the more common case
"take this huge complex data structure and dump it to disk in a
portable format", you have a completely different situation, where the
non-text format isn't that much smaller or faster.
I guess you're saying that the results are closer in some
cases because there's a lot of non-numeric data involved
in those complex data structures. But aren't you ignoring
scientific applications where the majority of the data is
numeric?

Much earlier in the thread, Allnor wrote, "Binary files
are usually about 20%-70% of the size of the text file,
depending on numbers of significant digits and other
formatting text glyphs." I don't think anyone has
directly disagreed with that statement yet.


Brian Wood
Ebenezer Enterprises
www.webEbenezer.net

"How much better is it to get wisdom than gold! and to
get understanding rather to chosen than silver!"
Proverbs 16:16
James Kanze
2009-11-06 09:03:33 UTC
Permalink
[...]
Post by Brian
Post by Jorgen Grahn
I think we have been misled a bit here, too. I haven't read
the whole thread, but it started with something like "dump a
huge array of floats to disk, collect it later". If you
take the more common case "take this huge complex data
structure and dump it to disk in a portable format", you
have a completely different situation, where the non-text
format isn't that much smaller or faster.
I guess you're saying that the results are closer in some
cases because there's a lot of non-numeric data involved in
those complex data structures. But aren't you ignoring
scientific applications where the majority of the data is
numeric?
He spoke of the "more common case". Certainly, most common
cases do include a lot of text data. On the other hand, the
origine of this thread was dumping doubles: purely numeric data.
And while perhaps less common, they do exist, and aren't really
rare either. (I've encountered them once or twice in my career,
and I'm not a numerics specialist.)
Post by Brian
Much earlier in the thread, Allnor wrote, "Binary files
are usually about 20%-70% of the size of the text file,
depending on numbers of significant digits and other
formatting text glyphs." I don't think anyone has
directly disagreed with that statement yet.
The original requirement, if I remember correctly, included
rereading the data with no loss of precision. This means 17
digits precision for an IEEE double, with an added sign, decimal
point and four or five characters for the exponent (using
scientific notation). Add a separator, and that's 24 or 25
bytes, rather than 8. So the 20% is off; 33% seems to be the
lower limit. But in a lot of cases, that's a lot; it's
certainly something that has to be considered in some
applications.

--
James Kanze
Rune Allnor
2009-11-06 16:51:03 UTC
Permalink
    [...]
Post by Brian
Post by Jorgen Grahn
I think we have been misled a bit here, too. I haven't read
the whole thread, but it started with something like "dump a
huge array of floats to disk, collect it later".  If you
take the more common case "take this huge complex data
structure and dump it to disk in a portable format", you
have a completely different situation, where the non-text
format isn't that much smaller or faster.
I guess you're saying that the results are closer in some
cases because there's a lot of non-numeric data involved in
those complex data structures.  But aren't you ignoring
scientific applications where the majority of the data is
numeric?
He spoke of the "more common case".
As I recall, I started by a purely technical question about
binary typecasts. Others started bringing in text formats.
I have only attempted to explain - in vain, it seems - why
text-based numerical formats is a no-go in technical
applications.
 Certainly, most common
cases do include a lot of text data.
I am not talking about 'common' cases. I am talking about heavy-duty
work. Once you are talking about numeric data in the hundreds of
MBytes
(regardless of the storage format), any amount of accompagnying text
is irrelevant. One page of plain text takes about 2 kbytes.

There was, in fact, an 'improvment' to the ancient SEG-Y seismic
data format,

http://en.wikipedia.org/wiki/SEG_Y

the SEG-2,

http://diwww.epfl.ch/lami/detec/seg2.html

where a lot of the auxillary (numeric) information was specificed
to be stored on text format. I first saw the SEG-2 spec about ten
years ago, but I have never heard that it has actually been used.
The speed losses involved with converting data back and forth from
text to binary would fully explain why SEG-2 does not gain wide-
spread acceptence among the heavy-duty users.

Rune
James Kanze
2009-11-08 14:27:41 UTC
Permalink
Post by Rune Allnor
Post by James Kanze
[...]
Post by Brian
Post by Jorgen Grahn
I think we have been misled a bit here, too. I haven't read
the whole thread, but it started with something like "dump a
huge array of floats to disk, collect it later". If you
take the more common case "take this huge complex data
structure and dump it to disk in a portable format", you
have a completely different situation, where the non-text
format isn't that much smaller or faster.
I guess you're saying that the results are closer in some
cases because there's a lot of non-numeric data involved in
those complex data structures. But aren't you ignoring
scientific applications where the majority of the data is
numeric?
He spoke of the "more common case".
As I recall, I started by a purely technical question about
binary typecasts.
Which, of course, raises the question as to why. They're not
very useful unless you're doing exceptionally low level work.
Post by Rune Allnor
Others started bringing in text formats.
The original comment was just that---a parenthetical comment.
Text formats have many advantages, WHEN you can use them. It's
also obvious that they have additional overhead---not nearly as
much as you claimed in terms of CPU, but they aren't free
either, neither in CPU time nor in data size.
Post by Rune Allnor
I have only attempted to explain - in vain, it seems - why
text-based numerical formats is a no-go in technical
applications.
And you blew it by giving exagerated figures:-). Other than
that: they're not a no-go in technical applications. They do
have too much overhead for some applications (not all), and in
such cases, you have to use a binary format. Depending on other
requirements (portability, external requirements, etc.), you may
need a more or less complicated binary format.
Post by Rune Allnor
Post by James Kanze
Certainly, most common cases do include a lot of text data.
I am not talking about 'common' cases. I am talking about
heavy-duty work. Once you are talking about numeric data in
the hundreds of MBytes (regardless of the storage format), any
amount of accompagnying text is irrelevant. One page of plain
text takes about 2 kbytes.
Yes. I understand that.

In fact, now that you've mentionned seismic data, I agree that a
text format is probably not going to cut it. I've actually
worked on one project in the field, and I know just how much
floating point data they can generate.

--
James Kanze
Rune Allnor
2009-11-08 17:11:23 UTC
Permalink
On 8 Nov, 15:27, James Kanze <***@gmail.com> wrote:

I'm getting tired with re-iterating this for people who
are not interested in actually evaluating the numbers.

Look for an upcomimg post on comp.lang.c++.moderated, where
I distill the problem statement a bit, as well as present
a C++ test to see what kind of timing ratios I am talking about.

Rune
Brian Wood
2009-11-08 22:15:48 UTC
Permalink
Post by Rune Allnor
I'm getting tired with re-iterating this for people who
are not interested in actually evaluating the numbers.
Look for an upcomimg post on comp.lang.c++.moderated, where
I distill the problem statement a bit, as well as present
a C++ test to see what kind of timing ratios I am talking about.
Rune
I took the liberty of copying your post from clc++m to here
as this newsgroup is faster as far as getting the posts out
there.


Hi all.

A couple of weeks ago I posted a question on comp.lang.c++ about some
technicality
about binary file IO. Over the course of the discussion, I discovered
to my
amazement - and, quite frankly, horror - that there seems to be a
school of
thought that text-based storage formats are universally preferable to
binary text
formats for reasons of portability and human readability.

The people who presented such ideas appeared not to appreciate two
details that
counter any benefits text-based numerical formats might offer:

1) Binary files are about 70-20% of the file size of the text files,
depending
on the number of significant digits stored in the text files and
other
formatting text glyphs.
2) Text-formatted numerical data take significantly longer to read and
write
than binary formats.

Timings are difficult to compare, since the exact numbers depend on
buffering
strategies, buffer sizes, disk speeds, network bandwidths and so on.

I have therefore sketched a 'distilled' test (code below) to test what
overheads
are involved with formatting numerical data back and forth between
text and
binary formats. To eliminate the impact of peripherical devices, I
have used
a std::stringstream to store the data. The binary bufferes are
represented
by vectors, and I have assumed that a memcpy from the file buffer to
the
destination memory location is all that is needed to import the binary
format
from the file buffer. (If there are significant run-time overheads
associated with
moving NATIVE binary formats to the destination, please let me
know.)

The output on my computer is (do note the _different_ numbers of IO
cycles in the two cases!):

Sun Nov 08 19:48:54 2009 : Binary IO cycles started
Sun Nov 08 19:49:00 2009 : 1000 Binary IO cycles completed
Sun Nov 08 19:49:00 2009 : Text-format IO cycles started
Sun Nov 08 19:49:16 2009 : 100 Text-format IO cycles completed

A little bit of math produces *average*, *crude* numbers for IO
cycles:

Text: 6 seconds / (1000 * 1e6) read/write cycles = 6e-9 s per r/w
cycle
Binary: 16 seconds / (100 * 1e6) read/write cycles = 160e-9 s per r/w
cycle

which in turn means there is an overhead on the order of of
160e-9/6e-9 = 26x
associated with the text formats.

Add a little bit of other overheads, e.g. caused by the significantly
larger text file sizes in combination with suboptimal buffering
strategies,
and the relative numbers easily hit the triple digits. Not at all
insignificant when one works with large amounts of data under tight
deadlines.

So please: Shoot this demo down! Give it your best, and prove me
and my numbers wrong.

And to the textbook authors who might be lurking: Please include a
chapter on relative binary and text-based IO speeds in your upcoming
editions. Binary file formats might not fit into your overall
philosophies about human readability and universal portability of C++
code, but some of your readers might appreciate being made aware of
such practical details.

Rune

/
***************************************************************************/
#include <iostream>
#include <sstream>
#include <time.h>
#include <vector>

int main()
{
const size_t NumElements = 1000000;
std::vector<double> SourceBuffer;
std::vector<double> DestinationBuffer;

for (size_t n=0;n<NumElements;++n)
{
SourceBuffer.push_back(n);
DestinationBuffer.push_back(0);
}

time_t rawtime;
struct tm * timeinfo;

time( &rawtime );
timeinfo = localtime( & rawtime );
std::string message( asctime (timeinfo) );
message.erase(message.size()-1);

std::cout << message.c_str() << " : Binary IO cycles started"
<< std::endl;

size_t NumBinaryIOCycles = 1000;
for (size_t n = 0; n < NumBinaryIOCycles; ++n)
{
for (size_t m = 0; m<NumElements; ++m )
{
DestinationBuffer[m] = SourceBuffer[m];
}
}

time( &rawtime );
timeinfo = localtime( & rawtime );
message=std::string( asctime (timeinfo) );
message.erase(message.size()-1);

std::cout << message.c_str() << " : " << NumBinaryIOCycles
<< " Binary IO cycles completed " << std:: endl;

std::stringstream ss;
const size_t NumTextFormatIOCycles = 100;

time( &rawtime );
timeinfo = localtime( & rawtime );
message=std::string( asctime (timeinfo) );
message.erase(message.size()-1);

std::cout << message.c_str() << " : Text-format IO cycles
started"
<< std::endl;

for (size_t n = 0; n < NumTextFormatIOCycles; ++n)
{
size_t m;
for (m = 0; m < NumElements; ++m)
ss << SourceBuffer[m];

m = 0;
while(!ss.eof())
{
ss >> DestinationBuffer[m];
++m;
}
}

time( &rawtime );
timeinfo = localtime( & rawtime );
message=std::string( asctime (timeinfo) );
message.erase(message.size()-1);

std::cout << message.c_str() << " : " << NumTextFormatIOCycles
<< " Text-format IO cycles completed " << std:: endl;

return 0;

}


Brian Wood
Brian Wood
2009-11-08 22:44:35 UTC
Permalink
Post by Brian Wood
Post by Rune Allnor
I'm getting tired with re-iterating this for people who
are not interested in actually evaluating the numbers.
Look for an upcomimg post on comp.lang.c++.moderated, where
I distill the problem statement a bit, as well as present
a C++ test to see what kind of timing ratios I am talking about.
Rune
I took the liberty of copying your post from clc++m to here
as this newsgroup is faster as far as getting the posts out
there.
Hi all.
A couple of weeks ago I posted a question on comp.lang.c++ about some
technicality
about binary file IO. Over the course of the discussion, I discovered
to my
amazement - and, quite frankly, horror - that there seems to be a
school of
thought that text-based storage formats are universally preferable to
binary text
formats for reasons of portability and human readability.
That seems to me an inaccurate description of this thread.
Kanze has pointed out the strengths of text formats, but
has also noted that there are times when binary formats
are needed. Who has been saying that text formats are
"universally preferable" to binary formats?


Brian Wood
James Kanze
2009-11-09 01:14:42 UTC
Permalink
[...]
Post by Brian Wood
Post by Brian Wood
A couple of weeks ago I posted a question on comp.lang.c++
about some technicality about binary file IO. Over the
course of the discussion, I discovered to my amazement -
and, quite frankly, horror - that there seems to be a school
of thought that text-based storage formats are universally
preferable to binary text formats for reasons of portability
and human readability.
That seems to me an inaccurate description of this thread.
Kanze has pointed out the strengths of text formats, but
has also noted that there are times when binary formats
are needed. Who has been saying that text formats are
"universally preferable" to binary formats?
I think he missed a "when possible", or something similar.
Binary formats are an optimization: you sometimes need this
optimization (and you certainly should be aware of the
possibility of using it), but you don't use them unless timing
or data size constraints make it necessary.

--
James Kanze
Rune Allnor
2009-11-09 10:57:48 UTC
Permalink
    [...]
Post by Brian Wood
Post by Brian Wood
A couple of weeks ago I posted a question on comp.lang.c++
about some technicality about binary file IO. Over the
course of the discussion, I discovered to my amazement -
and, quite frankly, horror - that there seems to be a school
of thought that text-based storage formats are universally
preferable to binary text formats for reasons of portability
and human readability.
That seems to me an inaccurate description of this thread.
Kanze has pointed out the strengths of text formats, but
has also noted that there are times when binary formats
are needed.  Who has been saying that text formats are
"universally preferable" to binary formats?
I think he missed a "when possible", or something similar.
*You* are accusing *me* of missing the fine print??!!

Let's see what I have written. From my post

http://groups.google.no/group/comp.lang.c++/msg/1c4004bbac86a046

[RA] > > File I/O operations with text-formatted floating-point data
Post by Brian Wood
take time. A *lot* of time.
[JK] > A lot of time compared to what?

[RA] Wall clock time. Relative time, compared to dumping
binary data to disk. Any way you want.

...
[RA] > > The rule-of-thumb is 30-60 seconds per 100 MBytes of
Post by Brian Wood
text-formatted FP numeric data, compared to fractions of a
second for the same data (natively) binary encoded (just try
it).
[JK] > Try it on what machine:-).

[RA] Any machine. The problem is to decode text-formatted numbers
to binary.

...
Here is a test I wrote in matlab a few years ago, to demonstrate
the problem (WinXP, 2.4GHz, no idea about disk):

[matlab code snipped]

Output:
------------------------------------
Wrote ASCII data in 24.0469 seconds
Read ASCII data in 42.2031 seconds
Wrote binary data in 0.10938 seconds
Read binary data in 0.32813 seconds
------------------------------------

Binary writes are 24.0/0.1 = 240x faster than text write.
Binary reads are 42.2/0.32 = 130x faster than text read.

...
The timing numbers (both absolute and relative) would be of
similar orders of magnitude if you repeated the test with C++.
...
The application I'm working with would need to crunch through
some 10 GBytes of numerical data per hour.

I think these excerpts should be sufficient to sketch what
kind of world I am living and working in.

Do note thet I never - unlike some other paricipants in this
thread - claimed my numbers to be exact. I am fairly certain
my English is good enough that the above would reasonably be
expected to be interpreted by a reader as *representative*
numbers. If you look closely, I also commented that coding
up a program in C++ instead of matlab as I had done, would
result in *different* numbers, but not solve the fundamental
problem.

So I can't see any reason why you attack me for my numbers
being "wrong"; I never stated they were exact.

A few posts further out:

http://groups.google.no/group/comp.lang.c++/msg/0abdc440e78f98d6

[RA] So what does text-based formats actually buy you?
[JK] Shorter development times, less expensive development, greater
reliability...

In sum, lower cost.

[RA] As long as you keep two factors in mind:
1) The user's time is not yours (the programmer) to waste.
2) The users's storage facilities (disk space, network
bandwidth etc) are not yours (the programmer) to waste.

[JK] The user pays for your time. Spending it to do something
which
results in a less reliable program, and that he doesn't need,
is
irresponsible, and borders on fraud.

This one really pissed me off. Here I had explained to you
what application I am working with, made you aware of the users
requirements in the operational situation, and you explicitly
state that paying attention to such concerns is 'borderline fraud'!

So I can not interpret this in any other way than that you will
use text-based formats, come hell or high water. Which essentially
invalidate any otherwise relevant arguments you might have presented
throughout thread.
No, it's not. The selection of file formats is a strategic desing
decision on a par with using binary O(lgN) or linear O(N) search
engines; like choosing betweene a O(NlgN) quick sort or a O(N^2)
bubble sort algorithm.

Such factors govern what problems can be handled by the software
with reasonable effort and within reasonable time.

True, both binary and text-based numerical IO are O(N), but since
text-based numerical IO is orders of magnitude slower, the strategic
impact on design decisions is the same.
you sometimes need this
optimization (and you certainly should be aware of the
possibility of using it), but you don't use them unless timing
or data size constraints make it necessary.
Hipocrate!

This is exactly what I have been arguing for days and weeks already.
What changed?

Rune
James Kanze
2009-11-09 13:37:53 UTC
Permalink
Post by Rune Allnor
Post by James Kanze
[...]
Post by Brian Wood
Post by Brian Wood
A couple of weeks ago I posted a question on comp.lang.c++
about some technicality about binary file IO. Over the
course of the discussion, I discovered to my amazement -
and, quite frankly, horror - that there seems to be a school
of thought that text-based storage formats are universally
preferable to binary text formats for reasons of portability
and human readability.
That seems to me an inaccurate description of this thread.
Kanze has pointed out the strengths of text formats, but
has also noted that there are times when binary formats
are needed. Who has been saying that text formats are
"universally preferable" to binary formats?
I think he missed a "when possible", or something similar.
*You* are accusing *me* of missing the fine print??!!
[...]
Post by Rune Allnor
I think these excerpts should be sufficient to sketch what
kind of world I am living and working in.
I fully understand what kind or world you're working in. As a
consultant, I've worked on seismic applications too, albeit not
recently.
Post by Rune Allnor
Do note thet I never - unlike some other paricipants in this
thread - claimed my numbers to be exact.
Off by more than an order of magnitude is not just a question of
"exact".
Post by Rune Allnor
I am fairly certain my English is good enough that the above
would reasonably be expected to be interpreted by a reader as
*representative* numbers. If you look closely, I also
commented that coding up a program in C++ instead of matlab as
I had done, would result in *different* numbers, but not solve
the fundamental problem.
So I can't see any reason why you attack me for my numbers
being "wrong"; I never stated they were exact.
First, I didn't "attack" you. On the whole, I understand your
problem. Stating that the difference is some 100 times is
misleading, however.
Post by Rune Allnor
http://groups.google.no/group/comp.lang.c++/msg/0abdc440e78f98d6
[RA] So what does text-based formats actually buy you?
[JK] Shorter development times, less expensive development, greater
reliability...
In sum, lower cost.
1) The user's time is not yours (the programmer) to waste.
2) The users's storage facilities (disk space, network
bandwidth etc) are not yours (the programmer) to waste.
[JK] The user pays for your time. Spending it to do something
which
results in a less reliable program, and that he doesn't need,
is
irresponsible, and borders on fraud.
This one really pissed me off. Here I had explained to you
what application I am working with, made you aware of the
users requirements in the operational situation, and you
explicitly state that paying attention to such concerns is
'borderline fraud'!
I didn't say that. I said that ignoring issues of development
time and reliability is fraud. You have to make a trade off; if
text based IO isn't sufficiently fast for the users needs, or
requires too much additional space, then you use binary. But
you consider the cost of doing so, and weigh it against the
other costs.
Post by Rune Allnor
So I can not interpret this in any other way than that you
will use text-based formats, come hell or high water.
How do you read that into anything I've said. I've simply
pointed out that using text does buy you something, or in other
words, using binary has a cost. There's no doubt that using
text has other costs. Engineering is about weighing the
difference costs; if you don't know what text based formats buy
you, then you can't weigh the costs accurately.
Post by Rune Allnor
Which essentially invalidate any otherwise relevant arguments
you might have presented throughout thread.
No, it's not. The selection of file formats is a strategic desing
decision on a par with using binary O(lgN) or linear O(N) search
engines; like choosing betweene a O(NlgN) quick sort or a O(N^2)
bubble sort algorithm.
Which are also optimizations:-).

There are optimizations and optimizations. Sometimes you do
know up front that you'll need the optimization; if you know
that you'll have to deal with millions of elements, you know up
front that a quadratic algorithm won't do the trick.

In the case of choosing binary, the motivation for doing so up
front is a bit different---after all, the difference will never
be other than linear. Partially, the motivation can be
calculated: if you know the number of elements, you can
calculate the disk space needed up front. In many cases,
however, you know that you'll be locked into the format you
choose, so you have to consider performance issues earlier.
Once you start considering performance issues, however, you're
talking about optimization.
Post by Rune Allnor
Such factors govern what problems can be handled by the software
with reasonable effort and within reasonable time.
True, both binary and text-based numerical IO are O(N), but since
text-based numerical IO is orders of magnitude slower, the strategic
impact on design decisions is the same.
There you go exagerating again. It's not orders of magnitude
slower. At the most, it's around 10 times slower, and often the
difference is even less. That doesn't mean that its irrelevant,
and sometimes you will have to use a binary format (and
sometimes, you'll have to adapt the binary format, to make it
quicker).

--
James Kanze
Brian
2009-11-09 22:31:20 UTC
Permalink
Post by Rune Allnor
Such factors govern what problems can be handled by the software
with reasonable effort and within reasonable time.
True, both binary and text-based numerical IO are O(N), but since
text-based numerical IO is orders of magnitude slower, the strategic
impact on design decisions is the same.
There you go exagerating again.  It's not orders of magnitude
slower.  At the most, it's around 10 times slower, and often the
difference is even less.  That doesn't mean that its irrelevant,
and sometimes you will have to use a binary format (and
sometimes, you'll have to adapt the binary format, to make it
quicker).
This Gianni Mariani quote indicates he saw some
differences of more than 10x.

"However, reading and writing binary files can have HUGE
performance gains. I once came across some numerical code
where it would read and write large datasets. These datasets
were 40-100MB. The performance was horrendus. Using mapped
files and binary data made the reading and writing virtually
zero cost and it improved the performance of the product by
nearly 10x times and in some tests over 1000x. Be careful -
this is one application and the bottle neck was clearly
identified. This may not be where your application spends
its time."


I hope to beef up the C++ Middleware Writer's support
for writing and reading data more generally. To begin
with I'm going to focus on integral types and assume
8 bit bytes. Currently we don't have support for uint8_t,
uint16_t, etc. I guess those are the types I'll start with.
I'm going through the newsgroup archives to find snippets
that are helpful in this area. If anyone has a link wrt
this, I'm interested.


Brian Wood
http://www.webEbenezer.net
James Kanze
2009-11-09 22:57:46 UTC
Permalink
Post by Brian
Post by Rune Allnor
Such factors govern what problems can be handled by the software
with reasonable effort and within reasonable time.
True, both binary and text-based numerical IO are O(N), but since
text-based numerical IO is orders of magnitude slower, the strategic
impact on design decisions is the same.
There you go exagerating again.  It's not orders of magnitude
slower.  At the most, it's around 10 times slower, and often the
difference is even less.  That doesn't mean that its irrelevant,
and sometimes you will have to use a binary format (and
sometimes, you'll have to adapt the binary format, to make it
quicker).
This Gianni Mariani quote indicates he saw some
differences of more than 10x.
"However, reading and writing binary files can have HUGE
performance gains.  I once came across some numerical code
where it would read and write large datasets. These datasets
were 40-100MB.  The performance was horrendus.  Using mapped
files and binary data made the reading and writing virtually
zero cost and it improved the performance of the product by
nearly 10x times and in some tests over 1000x.  Be careful -
this is one application and the bottle neck was clearly
identified.  This may not be where your application spends
its time."
I hope to beef up the C++ Middleware Writer's support
for writing and reading data more generally.  To begin
with I'm going to focus on integral types and assume
8 bit bytes.  Currently we don't have support for uint8_t,
uint16_t, etc.  I guess those are the types I'll start with.
I'm going through the newsgroup archives to find snippets
that are helpful in this area.  If anyone has a link wrt
this, I'm interested.
Brian Woodhttp://www.webEbenezer.net
James Kanze
2009-11-09 01:10:58 UTC
Permalink
Post by Rune Allnor
I'm getting tired with re-iterating this for people who
are not interested in actually evaluating the numbers.
I actually did some measures, to check the numbers. Your
numbers were wrong. More to the point, actual numbers will vary
enormously from one implemenation to the next.
Post by Rune Allnor
Look for an upcomimg post on comp.lang.c++.moderated,
Not every one reads that group. Not everyone agrees with its
moderation policy (as currently practiced).

--
James Kanze
Alf P. Steinbach
2009-11-09 05:06:09 UTC
Permalink
Post by James Kanze
Post by Rune Allnor
I'm getting tired with re-iterating this for people who
are not interested in actually evaluating the numbers.
I actually did some measures, to check the numbers. Your
numbers were wrong. More to the point, actual numbers will vary
enormously from one implemenation to the next.
Post by Rune Allnor
Look for an upcomimg post on comp.lang.c++.moderated,
Not every one reads that group. Not everyone agrees with its
moderation policy (as currently practiced).
Would you care to elaborate on that hinting, please.


Cheers,

- Alf
James Kanze
2009-11-09 13:41:43 UTC
Permalink
Post by Alf P. Steinbach
Post by James Kanze
Post by Rune Allnor
I'm getting tired with re-iterating this for people who
are not interested in actually evaluating the numbers.
I actually did some measures, to check the numbers. Your
numbers were wrong. More to the point, actual numbers will
vary enormously from one implemenation to the next.
Post by Rune Allnor
Look for an upcomimg post on comp.lang.c++.moderated,
Not every one reads that group. Not everyone agrees with
its moderation policy (as currently practiced).
Would you care to elaborate on that hinting, please.
"Not everyone" means "at least me". I stopped participating in
the group because I found the moderation was becoming too heavy
in some cases. Others, I know, aren't bothered with it. To
each his own.

--
James Kanze
Brian
2009-11-06 19:54:01 UTC
Permalink
    [...]
Post by Brian
Post by Jorgen Grahn
I think we have been misled a bit here, too. I haven't read
the whole thread, but it started with something like "dump a
huge array of floats to disk, collect it later".  If you
take the more common case "take this huge complex data
structure and dump it to disk in a portable format", you
have a completely different situation, where the non-text
format isn't that much smaller or faster.
I guess you're saying that the results are closer in some
cases because there's a lot of non-numeric data involved in
those complex data structures.  But aren't you ignoring
scientific applications where the majority of the data is
numeric?
He spoke of the "more common case".  Certainly, most common
cases do include a lot of text data.  On the other hand, the
origine of this thread was dumping doubles: purely numeric data.
And while perhaps less common, they do exist, and aren't really
rare either.  (I've encountered them once or twice in my career,
and I'm not a numerics specialist.)
I've worked on one scientific application for a little over
six months. I hope to work with/on more scientific projects
in the future.
Post by Brian
Much earlier in the thread, Allnor wrote, "Binary files
are usually about 20%-70% of the size of the text file,
depending on numbers of significant digits and other
formatting text glyphs."  I don't think anyone has
directly disagreed with that statement yet.
The original requirement, if I remember correctly, included
rereading the data with no loss of precision.  This means 17
digits precision for an IEEE double, with an added sign, decimal
point and four or five characters for the exponent (using
scientific notation).  Add a separator, and that's 24 or 25
bytes, rather than 8.  So the 20% is off; 33% seems to be the
lower limit.  But in a lot of cases, that's a lot; it's
certainly something that has to be considered in some
applications.
Yes. I brought it up because I wasn't sure if Grahn was
agreeing with something Fiedler said about it being just a few
more bytes. Even if it were 70% I wouldn't describe that as
a minor difference.


Brian Wood
http://www.webEbenezer.net
Rune Allnor
2009-10-18 10:07:35 UTC
Permalink
Post by Maxim Yegorushkin
Post by Rune Allnor
Hi all.
I have used the method from this page,
http://www.cplusplus.com/reference/iostream/istream/read/
to read some binary data from a file to a char[] buffer.
The 4 first characters constitute the binary encoding of
a float type number. What is the better way to transfer
the chars to a float variable?
The naive C way would be to use memcopy. Is there a
better C++ way?
This is the correct way since memcpy() allows you to copy unaligned data
into an aligned object.
     float f;
     stream.read(reinterpret_cast<char*>(&f), sizeof f);
The naive

std::vector<float> v;
for (n=0;n<N;++n)
{
file.read(reinterpret_cast<char*>(&f), sizeof f);
v.push_back(v);
}

doesn't work as expected. Do I need to call 'seekg'
inbetween?

Rune
Alf P. Steinbach
2009-10-18 10:26:31 UTC
Permalink
Post by Rune Allnor
Post by Maxim Yegorushkin
Post by Rune Allnor
Hi all.
I have used the method from this page,
http://www.cplusplus.com/reference/iostream/istream/read/
to read some binary data from a file to a char[] buffer.
The 4 first characters constitute the binary encoding of
a float type number. What is the better way to transfer
the chars to a float variable?
The naive C way would be to use memcopy. Is there a
better C++ way?
This is the correct way since memcpy() allows you to copy unaligned data
into an aligned object.
float f;
stream.read(reinterpret_cast<char*>(&f), sizeof f);
The naive
std::vector<float> v;
for (n=0;n<N;++n)
{
file.read(reinterpret_cast<char*>(&f), sizeof f);
v.push_back(v);
}
doesn't work as expected. Do I need to call 'seekg'
inbetween?
post complete code

cheers & hth

- alf
Rune Allnor
2009-10-18 10:42:41 UTC
Permalink
Post by Alf P. Steinbach
Post by Rune Allnor
Post by Maxim Yegorushkin
Post by Rune Allnor
Hi all.
I have used the method from this page,
http://www.cplusplus.com/reference/iostream/istream/read/
to read some binary data from a file to a char[] buffer.
The 4 first characters constitute the binary encoding of
a float type number. What is the better way to transfer
the chars to a float variable?
The naive C way would be to use memcopy. Is there a
better C++ way?
This is the correct way since memcpy() allows you to copy unaligned data
into an aligned object.
     float f;
     stream.read(reinterpret_cast<char*>(&f), sizeof f);
The naive
std::vector<float> v;
for (n=0;n<N;++n)
{
   file.read(reinterpret_cast<char*>(&f), sizeof f);
   v.push_back(v);
}
doesn't work as expected. Do I need to call 'seekg'
inbetween?
post complete code
Never mind. The project was compiled in 'release mode'
with every optimization flag I could find set to 11.
No reason to expect the source code to have anything
whatsoever to do with what actually goes on.

Once I switched back to debug mode, I was able to
track the progress.

Rune
Loading...