Communities

Writing
Writing
Codidact Meta
Codidact Meta
The Great Outdoors
The Great Outdoors
Photography & Video
Photography & Video
Scientific Speculation
Scientific Speculation
Cooking
Cooking
Electrical Engineering
Electrical Engineering
Judaism
Judaism
Languages & Linguistics
Languages & Linguistics
Software Development
Software Development
Mathematics
Mathematics
Christianity
Christianity
Code Golf
Code Golf
Music
Music
Physics
Physics
Linux Systems
Linux Systems
Power Users
Power Users
Tabletop RPGs
Tabletop RPGs
Community Proposals
Community Proposals
tag:snake search within a tag
answers:0 unanswered questions
user:xxxx search by author id
score:0.5 posts with 0.5+ score
"snake oil" exact phrase
votes:4 posts with 4+ votes
created:<1w created < 1 week ago
post_type:xxxx type of post
Search help
Notifications
Mark all as read See all your notifications »
Q&A

Comments on Why does the file command fail to recognize non-text files as such?

Post

Why does the file command fail to recognize non-text files as such?

+3
−0

POSIX defines

  • Text file as

    A file that contains characters organized into zero or more lines. The lines do not contain NUL characters and none can exceed {LINE_MAX} bytes in length, including the <newline> character.

  • Line as

    A sequence of zero or more non- <newline> characters plus a terminating <newline> character.

  • Character as

    A sequence of one or more bytes representing a single graphic symbol or control code.

Consider then six files, each with two bytes, created with these Printf commands (using octals):

printf "\101\012" > file1 #A<newline>
printf "\010\012" > file2 #<backspace><newline>
printf "\101\101" > file3 #AA
printf "\200\012" > file4
printf "\200\200" > file5
printf "\000\012" > file6 #<null><newline>

Now, in the UTF-8 encoding, the octal 012 (0x0A) is the newline character, 101 (0x41) is the graphic symbol A, 010 (0x08) is the backspace control character and 200 (0x80) is a continuation byte that never occurs as the first byte of a multi-byte sequence, so it does not form a valid character.

Hence, I would regard files 1 and 2 as text files, but the remaining as non-text files, because file 3 is not newline terminated, files 4 and 5 have an invalid character and file 6 contains a null byte.

However, the file command does not seem to completely agree with me; it lists files 3, 4 and 5 as text files,

$ file --mime-type file*
file1: text/plain
file2: text/plain
file3: text/plain
file4: text/plain
file5: text/plain
file6: application/octet-stream

Why does the file command fail to identify files 3, 4 and 5 as non-text files (I'm assuming it can't possibly be a bug) even though I use en_US.UTF-8 as my locale, or else what did I incorrectly understand?

History
Why does this post require attention from curators or moderators?
You might want to add some details to your flag.
Why should this post be closed?

1 comment thread

General comments (6 comments)
General comments
Moshi‭ wrote over 3 years ago · edited over 3 years ago

Why don't you consider 3,4, and 5 as text files? 3 fist the definitions given. I'm not quite sure about 4 and 5, but my first guess would be that they just didn't put that much error checking into it (0x80 is a valid continuation byte, so it can appear in valid text files)

Quasímodo‭ wrote over 3 years ago · edited over 3 years ago

@Moshi True, I said 0x80 was straightforwardly invalid but it is not. Still, it cannot be the first byte of a valid character. It forcefully follows that neither file 4 nor file 5 are newline terminated or that they have an invalid character. File 3 is also not newline terminated (even in ASCII encoding).

Moshi‭ wrote over 3 years ago

They don't have to be newline terminated. A newline termination defines a line, yes, but a text file can have zero lines.

celtschk‭ wrote over 3 years ago

Note that file is not a POSIX utility; the question it answers is not “does this conform to POSIX's idea of what is a text file” but “is this file likely to contain human-readable text”.

Quasímodo‭ wrote over 3 years ago

@Moshi But then any kind of file would be a text-file, since you could say it contained zero lines. Even a file with a NUL would be a text-file. Instead, I interpret that if the file contains non-lines, then it is not a text-file. In that sense, an empty text file would be the only case for which "zero lines" apply.

Quasímodo‭ wrote over 3 years ago

@celtschk Well, file is POSIX specified, so I would suppose it conformed to POSIX idea of what a text-file is.