Line endings and lazy programming.

Admin

I wrote this a while ago, in my blog, but thought it might be of interest to people here. It's something I've hat to bear in mind with the existing client.

Note that I'm not advocating allowing sloppy line endings on your output: only that you should deal with sloppy line endings on your input.

Sure, the RFC relating to the serverside requires that we have a certain kind of line ending. But we can't know what kind of bizarre proxies the data's been through in the meantime, and we can't be sure that the server's always goingto be compliant.

====

Line endings are hell. So are delete/backspace/erase, the | (solid pipe) vs � (broken pipe) character, tabs, file endings, keyboard layouts, currency characters and other localisation settings, and many other things. But they are grumbles for another time. Today, I rant about line endings.

To begin, let's talk about "carriage return" (also known as CR and \r) and "line feed" (aka \n and LF and "new line" and NL).

Those of you who have used a manual typewriter will remember the lever you used to whack on the side that would push the carriage (the paper roller) along to the right, so the hammers would hit at the left of the carriage, the beginning of the line. That's what a "carriage return" (CR) character theoretically does - moves the insert point to the beginning of the current line.

Once the carriage was at the far right, pressing the lever further would then ratchet the roller, and feed the paper on it up by one line: a "line feed" (LF).

More sophisticated typewriters let you do "double spacing" by flicking a lever which would ratchet it twice as far. Simpler ones, you had to press the lever twice. That would be a CR-LF-LF sequence. More on that later.

To indicate the end of a line, Windows/Dos uses CR-LF, which seems obvious and sensible, except that it generally copes badly with LF-CR.

*nix uses LF only, because it's a bit silly having two characters when one will do.

The Mac uses CR. Because different is good, mm-kay?

LF-CR hasn't, to my knowledge, been the standard on any machine since the Teletype.

However, a scary number of utils get it wrong/do it differently, so dealing with all four possible combinations is sensible/safest.

The problem's described well here:
http://www.freebsd.org/doc/en_US.ISO8859-1/books/developers-handbook/x86-files.html

"The problem is that not all non UNIX text files end their line with the carriage return / line feed sequence. Some use carriage returns without line feeds. Others combine several blank lines into a single carriage return followed by several line feeds. And so on."

What the author describes there is a sensible and obvious compression, IF you're interpreting CR and LF from the literal typewriter/terminal/teletype point of view: a line feed is where you move down one line in a straight line, and a carriage return is where you move the carriage to the beginning of the current line.

You only need to return the carriage when there was something on the previous line to displace the carriage from the start of the line. Blank lines don't. So as you saw above, doublespacing can be reduced to CR-LF-LF.

In such a literal editor, both CR-LF and LF-CR would seem equally logical.

Also, other Windows apps, including some of Microsoft's own apps, use CR or LF on their own, to indicate a 'soft linebreak' where the line was wrapped, rather than being forced by the user, as a 'hard linebreak'.

So, what, then?

So applications should be coded to deal with line endings in any format that the user throws at them, including mixing them together in the file. This is particularly important when parsing source code, config files, and so on, which could have been edited in various editors, had bits cut-n-pasted in from anywhere, and so on.

Priority should be given to the double-character ones: CRLF and LFCR should both be interpreted as a single end-of-line indicator. Then \r and \n on their own.

In perl-style regexps, that this is doable as:

(\r\n?)|(\n\r?)

This is because perl's matching finds the leftmost match, and then gets the match as long as it can by filling the optional items.

So, let there be less lazy coding out there! Play nice! Razz

====

OK, building on that, as a side-rant, HTML also has <P>, </P>, <BR>.

<P> was like the cr-lf-lf, and <BR> like cr-lf (in most browsers anyway).

Well, back in the olden days, we just used <P> for a paragraph break, and <BR> for a line break. They were terminators, stuff you put at the end.

But then style sheets came in, and we had to start using <P></P> tags to enclose a paragraph, rather than just terminate it.

And now XHTML has come along, so now they need to be all lowercase <p></p> and the <br> needs to be CLOSED! Like: <br />.

So it's even worse than the situation above. In HTML we can't guess what strange weirdness they'll come up with next, we can't plan to avoid it, we can only code reactively to fix the stuff when they change it, and in the meantime try to comply as well as possible with the standards of today.
[/url]

--Yet another geek.

Development Notes
Login or register to tag items