Discussion:
The tellg bug
(too old to reply)
Eivind Grimsby Haarr
2004-09-02 17:48:39 UTC
Permalink
I know that this has been posted before on several other newsgroups, but I
need to make sure I got this right, so I hope you can forgive me for
posting this.

In MVSC6.0, and also in several Borland c++ compilers from what I can see
from newsgroup postings, ifstream::tellg() alters the position of the file
reading pointer when reading UNIX files (only LF character, not CRLF) in
text mode. I can see why it does this, keeping consistency while treating
CRLF as a single character.

Using subsequent getline(...)-calls, no problems arises, but once I need
to save a position with tellg, to be able to seek back to this position
with seekg later, problems arises if the file accidentially has been
converted to UNIX LF-format. I know I can solve this by opening the file
in binary mode, but then I have to write my own code handling the
reading of lines and different newline characters.

My questions are:
* Is this compiler-dependent, or a general problem with text-mode file
reading? Does the standard specify anything about this?
* Is it impossible to write a program using only standard library
functions, that handles tellg/seekg positioning with both UNIX/DOS files
in text mode? (Not to mention Mac-files...)

I know I'm not the first one that has encountered this problem, so I would
expect that somewhere someone has solved this before...

Finally, another question: Do anyone know about a good online
tutorial/reference for Windows programming with C++? Or can
someone alternatively tell me which newsgroup I rather should have posted
that question to...


- Eivind Grimsby Haarr

"Trying is the first step towards failure."
- Homer Simpson
Mike Wahler
2004-09-02 18:24:33 UTC
Permalink
Post by Eivind Grimsby Haarr
I know that this has been posted before on several other newsgroups, but I
need to make sure I got this right, so I hope you can forgive me for
posting this.
In MVSC6.0, and also in several Borland c++ compilers from what I can see
from newsgroup postings, ifstream::tellg() alters the position of the file
reading pointer when reading UNIX files (only LF character, not CRLF) in
text mode. I can see why it does this, keeping consistency while treating
CRLF as a single character.
Using subsequent getline(...)-calls, no problems arises, but once I need
to save a position with tellg, to be able to seek back to this position
with seekg later, problems arises if the file accidentially has been
converted to UNIX LF-format. I know I can solve this by opening the file
in binary mode, but then I have to write my own code handling the
reading of lines and different newline characters.
* Is this compiler-dependent, or a general problem with text-mode file
reading? Does the standard specify anything about this?
* Is it impossible to write a program using only standard library
functions, that handles tellg/seekg positioning with both UNIX/DOS files
in text mode? (Not to mention Mac-files...)
I know I'm not the first one that has encountered this problem, so I would
expect that somewhere someone has solved this before...
Since I have little experience with 'tellg()', I'll let
someone else address that issue.
Post by Eivind Grimsby Haarr
Finally, another question: Do anyone know about a good online
tutorial/reference for Windows programming with C++?
I like the tutorials at www.relisoft.com
YMMV. In any case, I'd recommend going through the Petzold book
(5th edition) first (which uses C) for learning the fundamentals.
Post by Eivind Grimsby Haarr
Or can
someone alternatively tell me which newsgroup I rather should have posted
that question to...
Good advice r.e. Windows programming is available at newsgroup
comp.os.ms-windows.programmer.win32

-Mike
John Harrison
2004-09-02 19:00:13 UTC
Permalink
Post by Eivind Grimsby Haarr
I know that this has been posted before on several other newsgroups, but I
need to make sure I got this right, so I hope you can forgive me for
posting this.
In MVSC6.0, and also in several Borland c++ compilers from what I can see
from newsgroup postings, ifstream::tellg() alters the position of the file
reading pointer when reading UNIX files (only LF character, not CRLF) in
text mode. I can see why it does this, keeping consistency while treating
CRLF as a single character.
Using subsequent getline(...)-calls, no problems arises, but once I need
to save a position with tellg, to be able to seek back to this position
with seekg later, problems arises if the file accidentially has been
converted to UNIX LF-format. I know I can solve this by opening the file
in binary mode, but then I have to write my own code handling the
reading of lines and different newline characters.
* Is this compiler-dependent, or a general problem with text-mode file
reading? Does the standard specify anything about this?
The standard specfies that if you open a file in text mode then only four
versions of seekg are going to work.

1) Seek to the start of a file
2) Seek to the end of a file
3) Seek to the current position
4) Seek to a position previously saved with tellg.

This last one seems to be the one you are interested in. Although I don't
get the bit about 'accidentally converted to UNIX LF-format'. If you're
writing the program you should be able to stop anything being accidentally
converted.

One some systems with some compilers you may get other possibilites to work,
but these are the only ones guaranteed by the standard.
Post by Eivind Grimsby Haarr
* Is it impossible to write a program using only standard library
functions, that handles tellg/seekg positioning with both UNIX/DOS files
in text mode? (Not to mention Mac-files...)
It's prefectly possible provided you stick to the four possibilites above.

john
Eivind Grimsby Haarr
2004-09-02 20:15:28 UTC
Permalink
I can see I did not explain the problem thoroughly enough in the previous
posting.

The problem arises when reading a UNIX text file, where line feeds are
represented by the line feed character (one byte, '\n' or LF) only. In
DOS text files, the line feeds are represented by two characters ("\r\n",
carriage return and line feed).

An example:

If I have a file in UNIX text format, whith line feed represented by a
single character, e.g:

Line 1 in file\n
Line 2 in file\n
Line 3 in file

Using this code:

--------------

std::ifstream fstrm("filename.txt");
std::ios::pos_type tellg_result(0);
std::string str("");

// Save position in file before reading the line
tellg_result = fstrm.tellg();
getline(fstrm, str);
std::cout << str << std::endl;
// Save position again
tellg_result = fstrm.tellg();
getline(fstrm, str);
std::cout << str << std::endl;

--------------

This code would output:
Line 1 in file
ine 2 in file

Without the calls to tellg(), the ouput would be correct, similar to
the file. Since the stream expects line feed to consist of two characters,
tellg() actually moves the internal file pointer one byte when
encountering the UNIX type single line feed character.

Usually, somewhere internally in the stream classes, the two-character
line-feed in DOS files is converted to the single line feed character '\n'
when writing and reading. I guess this is done for portability, and it
also suggests that it should be possible to enable/disable this feature.

I'm reading a big set of text files that is shared on the net among many
users, and it often occurs that the files are converted to and from UNIX
and DOS formats, some files ending up in UNIX format on my Windows system.
It seems very bothersome to have to write my own binary mode
read-functions, especially since I want my classes to be general-purpose,
accepting only an istream-reference, leaving to the client to open the
file. Without knowing if the istream is an ifstream or something else, it
is impossible to test whether it is opened in binary mode or text mode.
(Or is it?)

I hope this made more sense, and I appreciate feedback of any type.


-eivind
Post by John Harrison
Post by Eivind Grimsby Haarr
I know that this has been posted before on several other newsgroups, but I
need to make sure I got this right, so I hope you can forgive me for
posting this.
In MVSC6.0, and also in several Borland c++ compilers from what I can see
from newsgroup postings, ifstream::tellg() alters the position of the file
reading pointer when reading UNIX files (only LF character, not CRLF) in
text mode. I can see why it does this, keeping consistency while treating
CRLF as a single character.
Using subsequent getline(...)-calls, no problems arises, but once I need
to save a position with tellg, to be able to seek back to this position
with seekg later, problems arises if the file accidentially has been
converted to UNIX LF-format. I know I can solve this by opening the file
in binary mode, but then I have to write my own code handling the
reading of lines and different newline characters.
* Is this compiler-dependent, or a general problem with text-mode file
reading? Does the standard specify anything about this?
The standard specfies that if you open a file in text mode then only four
versions of seekg are going to work.
1) Seek to the start of a file
2) Seek to the end of a file
3) Seek to the current position
4) Seek to a position previously saved with tellg.
This last one seems to be the one you are interested in. Although I don't
get the bit about 'accidentally converted to UNIX LF-format'. If you're
writing the program you should be able to stop anything being accidentally
converted.
One some systems with some compilers you may get other possibilites to work,
but these are the only ones guaranteed by the standard.
Post by Eivind Grimsby Haarr
* Is it impossible to write a program using only standard library
functions, that handles tellg/seekg positioning with both UNIX/DOS files
in text mode? (Not to mention Mac-files...)
It's prefectly possible provided you stick to the four possibilites above.
john
John Harrison
2004-09-03 06:40:17 UTC
Permalink
Post by Eivind Grimsby Haarr
I can see I did not explain the problem thoroughly enough in the previous
posting.
The problem arises when reading a UNIX text file, where line feeds are
represented by the line feed character (one byte, '\n' or LF) only. In
DOS text files, the line feeds are represented by two characters ("\r\n",
carriage return and line feed).
If I have a file in UNIX text format, whith line feed represented by a
Line 1 in file\n
Line 2 in file\n
Line 3 in file
--------------
std::ifstream fstrm("filename.txt");
std::ios::pos_type tellg_result(0);
std::string str("");
// Save position in file before reading the line
tellg_result = fstrm.tellg();
getline(fstrm, str);
std::cout << str << std::endl;
// Save position again
tellg_result = fstrm.tellg();
getline(fstrm, str);
std::cout << str << std::endl;
--------------
Line 1 in file
ine 2 in file
Without the calls to tellg(), the ouput would be correct, similar to
the file. Since the stream expects line feed to consist of two characters,
tellg() actually moves the internal file pointer one byte when
encountering the UNIX type single line feed character.
My compiler does not do that. Its smart enough to treat this case correctly.
However you have a file without correct line endings, which you are trying
to read as if it did have correct line endings, so I think all bets are off
and you shouldn't be too surprised that things don't work. So I'm not sure
I'd call this a bug but I'd certainly call it a deficiency in your library.
Post by Eivind Grimsby Haarr
Usually, somewhere internally in the stream classes, the two-character
line-feed in DOS files is converted to the single line feed character '\n'
when writing and reading. I guess this is done for portability, and it
also suggests that it should be possible to enable/disable this feature.
That's correct (assuming that you are working on a DOS system of course).
And of course you disable it by opening the file in binary mode.
Post by Eivind Grimsby Haarr
I'm reading a big set of text files that is shared on the net among many
users, and it often occurs that the files are converted to and from UNIX
and DOS formats, some files ending up in UNIX format on my Windows system.
It seems very bothersome to have to write my own binary mode
read-functions, especially since I want my classes to be general-purpose,
accepting only an istream-reference, leaving to the client to open the
file. Without knowing if the istream is an ifstream or something else, it
is impossible to test whether it is opened in binary mode or text mode.
(Or is it?)
It is impossible in standard C++.

I think you are going to have to write you own version of a getline routine.
One that can cope with different line ending styles and/or files open in
binary or text mode. It also wouldn't hurt to document to your clients that
they should open files in binary mode. You might also need to use a
different compiler and/or C++ library, I don't like the way yours is
behaving.

john
Jack Klein
2004-09-04 04:50:45 UTC
Permalink
Post by Eivind Grimsby Haarr
Post by John Harrison
Post by Eivind Grimsby Haarr
I'm reading a big set of text files that is shared on the net among many
users, and it often occurs that the files are converted to and from UNIX
and DOS formats, some files ending up in UNIX format on my Windows
system.
Post by John Harrison
Post by Eivind Grimsby Haarr
It seems very bothersome to have to write my own binary mode
read-functions, especially since I want my classes to be
general-purpose,
Post by John Harrison
Post by Eivind Grimsby Haarr
accepting only an istream-reference, leaving to the client to open the
file. Without knowing if the istream is an ifstream or something else,
it
Post by John Harrison
Post by Eivind Grimsby Haarr
is impossible to test whether it is opened in binary mode or text mode.
(Or is it?)
It is impossible in standard C++.
Nonsense.
Post by John Harrison
I think you are going to have to write you own version of a getline
routine.
Post by John Harrison
One that can cope with different line ending styles and/or files open in
binary or text mode. It also wouldn't hurt to document to your clients
that
Post by John Harrison
they should open files in binary mode. You might also need to use a
different compiler and/or C++ library, I don't like the way yours is
behaving.
1) those that end each line with CR/LF (standard DOS format)
2) those that end each line with LF (standard Unix format)
If he reads all files in binary mode, each will have an LF at the
end, which is the standard internal line terminator in C/C++
('\n'). Existing getline, etc. will work fine. The only issues I
1) Do any CRs at the end of lines matter, or can they just be carried
along? Worst case is you delete all CRs and hope that no text plays
overstrike games with embedded CRs.
2) Do you want to produce canonical (CR/LF terminated) output from
such arbitrary input? In that case CRs *do* matter and you have to
be sure to write new files in text mode.
No big deal.
I've had to deal with this quite a bit in communications routines in
the old days.

The simplest solution I found was to consider every '\r' as a newline.
Any '\n' immediately proceeded by a '\r' is ignored, any '\n'
proceeded by any other character is considered a newline.

Works quite well for '\r\n' (was CP/M in those days, MS-DOS wasn't
around yet), '\r' only (Apple and some others, the others mostly
defunct now), and Unix '\n' only. Even handled files produced by a
few perverse utilities on '\r\n' that would skip the '\r' on repeated
blank lines. That is:

line1
line2

line3

...would appear as:

"line1\r\nline2\r\n\nline3\n"

This would not correctly handle something that used '\n\r' to end
lines, but I knew of no such systems and never heard from any users
that ran into one.

In any case, this logic is quite simple to perform on files opened in
binary mode.
--
Jack Klein
Home: http://JK-Technology.Com
FAQs for
comp.lang.c http://www.eskimo.com/~scs/C-faq/top.html
comp.lang.c++ http://www.parashift.com/c++-faq-lite/
alt.comp.lang.learn.c-c++
http://www.contrib.andrew.cmu.edu/~ajo/docs/FAQ-acllc.html
Owen Jacobson
2004-09-04 07:57:38 UTC
Permalink
The simplest solution I found was to consider every '\r' as a newline. Any
'\n' immediately proceeded by a '\r' is ignored, any '\n' proceeded by any
other character is considered a newline.
Works quite well for '\r\n' (was CP/M in those days, MS-DOS wasn't around
yet), '\r' only (Apple and some others, the others mostly defunct now),
and Unix '\n' only. Even handled files produced by a few perverse
utilities on '\r\n' that would skip the '\r' on repeated blank lines.
line1
line2
line3
"line1\r\nline2\r\n\nline3\n"
That's only perverse if you're not familiar with the origins of "carriage
return" versus "line feed". (It is perverse in the modern sense of "line
break" as a separator between lines, but that's newer than ASCII.)
--
Some say the Wired doesn't have political borders like the real world,
but there are far too many nonsense-spouting anarchists or idiots who
think that pranks are a revolution.
Loading...