What unexpected things can happen if a user runs commands expecting a text file on input lacking a file-final newline?

−0

It is often taught that in Unix/Linux text files should end with newline characters. The reason given (orally) to me by various sources was that "some commands (such as wc) assume or require a newline at the end of a text file". (Other commands I am aware of are: sed, read, cron/crontab, and (in terms of output:) sort.) This led me to ask the following question:

Which commands in Unix/Linux assume that text files end in a newline?

It seems that the proper answer to that question is something like this:

all commands that require a "text file" according to the "INPUT FILES" section of their POSIX specification

This is because POSIX-compliant text files should end in a newline character '\n' (U+0010). (The rationale has been discussed at length on Stack Overflow and is not the subject of this question.)

That is, if I understand correctly, if one only works within Unix & Linux, one only rarely encounters files with textual data that don't end in a newline. Furthermore, the statement that "some commands assume or require a newline at the end of a text file" is misleading, because in fact all commands that manipulate text files have such a requirement, simply because having a file-final newline is a POSIX requirement for text files.

With all that said, I would hereby like to ask:

What are some unexpected things that can happen if a Unix/Linux user forgets to check textual input to commands for file-final newlines?

To elaborate on "unexpected", let <lines> denote a sequence of POSIX lines and <def-line> a defective line (defined as the result of subtracting the final newline character from a non-empty POSIX line). In what follows, def-line thus represents the trailing characters after the input's last newline.

Let us assume that the expected behavior is
f(<lines><def-line>) = f(<lines><def-line>\n) or
f(<lines><def-line>) ≅ f(<lines><def-line>\n).
Here, approximate equality (≅) covers the case of "the same" newline being missing from the output on the left-hand side. Unexpected behavior would then be
f(<lines><def-line>) ≠ f(<lines><def-line>\n) or
f(<lines><def-line>) ≇ f(<lines><def-line>\n).
A special case of this is
f(<lines><def-line>) = f(<lines>).

The answer needn't be a complete list; a list of common "gotchas" might be enough to teach a new user to be careful to check for text-file-final newlines. Historical gotchas are welcome; I am aware that commands change over time (and also exist in different implementations).

Here are some realistic sources of files that are POSIX-compliant text files, except for the fact that they are missing a file-final newline:

text files saved with various text editors that don't enforce a file-final newline, such as Sublime Text
code saved with IDEs that don't enforce a file-final newline
files with textual data from governments, used for research purposes
files from other OSes that were transferred in a way that doesn't append missing file-final newlines

Note also that word processors in general don't enforce document-final newlines, so users who are new to handling text files in Unix/Linux might find it surprising that text-file-final newlines are a POSIX requirement. They probably haven't given this issue much thought and will (if questioned) likely guess that a text file is either

lines separated by newline characters or
what POSIX defines, except with the last line's final newline character being optional.

It is easy to point users to the standard and state that behavior on malformed input is undefined, but in reality:

tools don't behave randomly on malformed input (they certainly don't normally erase your hard drive),
most users learn Unix/Linux by example and by trial-and-error, not by reading standards documents, and
many tools make allowances for malformed input, because it occurs in the wild – perhaps tools should emit warnings more often, but that is another topic.

I firmly believe that people should be using commands with properly formatted input, and I am not endorsing usage of POSIX commands in a non-POSIX-conformant way. Perhaps the actual issue is that POSIX conventions aren't taught well. However, a list of pitfalls might help casual users of Unix/Linux learn about what could go (or already might have gone) wrong when input isn't properly formatted.

posix text-processing newlines

posted almost 2 years ago

CC BY-SA 4.0

Lover of Structure‭

36 reputation 1 0 7 0

Raw

Markdown

History

is a duplicate

This question has been asked before and has already been answered. It should be marked as a duplicate.

Please enter the URL of the proposed duplicate in the details field below.

not constructive

This question cannot be answered in a way that is helpful to anyone. It's not possible to learn something from possible answers, except for the solution for the specific problem of the asker.

3 comment threads

My question is verbose because it was first asked on *Stack Exchange: Unix & Linux*, where it was clo... (5 comments)

Agreed, but too much of a rant (3 comments)

Nothing bad happens, it's a made up problem (1 comment)

3 answers

Score Active Age

−0

I would consider your own tools as the central use case. Seriously, do you want to litter every while read (or your language's equivalent) with this pesky corner case handling?

while read -r line || [ -n "${line-}" ]
do
    : something with "$line"
done <"$input_file"

Having a single well-defined behavior also reduces the need for speculation. With malformed files, what exactly should tail -n 1 return? Which line(s) should the $ address in sed (which matches on the final line in well-formed files) match on?

You might argue what you think the "obvious" answer to these questions should be, but since that's non-standard behavior, you can never rely on these constructs to work portably, and so you are needlessly restricting the usefulness of your scripts.

As noted elsewhere, cat and wc -l on malformed files are another two use cases which commonly fail; but the vast bulk of broken tools will probably be local or informal tools which were never tested for this particular corner case.

posted almost 2 years ago

CC BY-SA 4.0

tripleee‭

121 reputation 0 4 12 7

Copy Link

Raw

Markdown

History

0 comment threads

−0

I tried to think of some for a while, but couldn't find any good ones. That said, there's plenty of programs I don't know. For reference:

$ echo -en "hello\nworld" | tee 1.txt | bat -A
───────┬─────────────────────────────────────────
       │ STDIN
       │ Size: -
───────┼─────────────────────────────────────────
   1   │ hello␊
   2   │ world
───────┴─────────────────────────────────────────

$ echo -e "hello\nworld" | tee 2.txt | bat -A
───────┬─────────────────────────────────────────
       │ STDIN
       │ Size: -
───────┼─────────────────────────────────────────
   1   │ hello␊
   2   │ world␊
───────┴─────────────────────────────────────────

wc appears to count newlines, so wc 1.txt incorrectly returns 1 for line count (you've mentioned this already)
sort and uniq behave sensibly
Python readlines() returns a list where the last item doesn't have a line ending, but this wouldn't normally break anything since usually people call strip() on each line (or set keepends=False)

It's hard to think of an example which is not contrived.

From a programmer's perspective, when your program has some specific requirement for input, you would want to either detect it and fail, or automatically correct it. Both are quite easy for missing terminal newlines. Standards are nice, but when you have a few thousand users or more, you simply cannot assume that not a single one would flout them. I would be surprised if there are any real cases where this is a problem, but I'm curious to see what others will post.

I do know one common case: It can mess up your shell prompt to print such files:

$ bash
$ cat 1.txt 
hello
world$

Of course, this is easily resolved by pressing Enter to get a new prompt. Note also how I had to open bash for this - my normal shell, fish, handles it:

$ cat 1.txt 
hello
world⏎                                                                                                                                                                        
$

As does zsh:

$ cat 1.txt 
hello
world%
$

And of course, cat behaves a bit surprisingly (well, not really):

$ cat * | bat -A
───────┬─────────────────────────────────────────
       │ STDIN
       │ Size: -
───────┼─────────────────────────────────────────
   1   │ hello␊
   2   │ worldhello␊
   3   │ world␊
───────┴─────────────────────────────────────────

posted almost 2 years ago

CC BY-SA 4.0

2y ago

matthewsnyder‭

1881 reputation 125 77 298 110

Copy Link

Raw

Markdown

History

1 comment thread

cat works the way it does because it treats the file as a stream of characters. It is only if you par... (1 comment)

−0

I’ve had some tools drop the last incomplete line on some OSes.

For example:

$ printf 'foo\nbar\nbaz' | sed 's/x/y/'
foo
bar
$ _

I don’t recall which systems exactly these were, but I know I ran into it multiple times over the years, and POSIX allows this (I checked, it’s the application’s/user’s duty to provide the final newline).

posted about 1 year ago

CC0

mirabilos‭

11 reputation 0 2 1 0

Copy Link

Raw

Markdown

History

Communities

What unexpected things can happen if a user runs commands expecting a text file on input lacking a file-final newline?

3 comment threads

3 answers

0 comment threads

1 comment thread

0 comment threads