Comments on Why does the file command fail to recognize non-text files as such?
Post
Why does the file command fail to recognize non-text files as such?
-
Text file as
A file that contains characters organized into zero or more lines. The lines do not contain NUL characters and none can exceed {LINE_MAX} bytes in length, including the <newline> character.
-
Line as
A sequence of zero or more non- <newline> characters plus a terminating <newline> character.
-
Character as
A sequence of one or more bytes representing a single graphic symbol or control code.
Consider then six files, each with two bytes, created with these Printf commands (using octals):
printf "\101\012" > file1 #A<newline>
printf "\010\012" > file2 #<backspace><newline>
printf "\101\101" > file3 #AA
printf "\200\012" > file4
printf "\200\200" > file5
printf "\000\012" > file6 #<null><newline>
Now, in the UTF-8 encoding, the octal 012 (0x0A) is the newline character, 101 (0x41) is the graphic
symbol A
, 010 (0x08) is the backspace control character and 200 (0x80) is a continuation byte that never occurs as the first byte of a multi-byte sequence, so it does not form a valid character.
Hence, I would regard files 1 and 2 as text files, but the remaining as non-text files, because file 3 is not newline terminated, files 4 and 5 have an invalid character and file 6 contains a null byte.
However, the file
command does not seem to completely agree
with me; it lists files 3, 4 and 5 as text files,
$ file --mime-type file*
file1: text/plain
file2: text/plain
file3: text/plain
file4: text/plain
file5: text/plain
file6: application/octet-stream
Why does the file
command fail to identify files 3, 4 and 5 as non-text files
(I'm assuming it can't possibly be a bug) even though I use en_US.UTF-8
as my locale, or else what did I incorrectly
understand?
1 comment thread