Communities

Writing
Writing
Codidact Meta
Codidact Meta
The Great Outdoors
The Great Outdoors
Photography & Video
Photography & Video
Scientific Speculation
Scientific Speculation
Cooking
Cooking
Electrical Engineering
Electrical Engineering
Judaism
Judaism
Languages & Linguistics
Languages & Linguistics
Software Development
Software Development
Mathematics
Mathematics
Christianity
Christianity
Code Golf
Code Golf
Music
Music
Physics
Physics
Linux Systems
Linux Systems
Power Users
Power Users
Tabletop RPGs
Tabletop RPGs
Community Proposals
Community Proposals
tag:snake search within a tag
answers:0 unanswered questions
user:xxxx search by author id
score:0.5 posts with 0.5+ score
"snake oil" exact phrase
votes:4 posts with 4+ votes
created:<1w created < 1 week ago
post_type:xxxx type of post
Search help
Notifications
Mark all as read See all your notifications »
Q&A

Post History

71%
+3 −0
Q&A Why does the file command fail to recognize non-text files as such?

POSIX defines Text file as A file that contains characters organized into zero or more lines. The lines do not contain NUL characters and none can exceed {LINE_MAX} bytes in length, inclu...

2 answers  ·  posted 3y ago by Quasímodo‭  ·  last activity 3y ago by LawrenceC‭

Question posix file
#3: Post edited by user avatar Quasímodo‭ · 2021-05-27T19:57:48Z (almost 3 years ago)
Octal 200 cannot be the first byte of a character
  • [POSIX defines][1]
  • - Text file as
  • > A file that contains **characters** organized into zero or more lines. The
  • lines do not contain NUL characters and none can exceed {LINE_MAX} bytes in
  • length, including the \<newline\> character.
  • - Line as
  • > A sequence of zero or more non- \<newline\> characters plus a terminating
  • \<newline\> character.
  • - Character as
  • > A sequence of one or more bytes representing a single graphic symbol or
  • control code.
  • Consider then six files, each with two bytes,
  • created with these Printf commands (using octals):
  • printf "\101\012" > file1 #A<newline>
  • printf "\010\012" > file2 #<backspace><newline>
  • printf "\101\101" > file3 #AA
  • printf "\200\012" > file4
  • printf "\200\200" > file5
  • printf "\000\012" > file6 #<null><newline>
  • Now, in the [UTF-8 encoding][5], the octal 012 (0x0A) is the newline character, 101 (0x41) is the graphic
  • symbol `A`, 010 (0x08) is the backspace control character and 200 (0x80) is a continuation byte.
  • Hence, I would regard files 1 and 2 as text files, but the remaining as non-text files, because files 3, 4 and 5 are not newline terminated and file 6 contains a null byte.
  • However, the `file` command does not seem to completely agree
  • with me; it lists files 3, 4 and 5 as text files,
  • $ file --mime-type file*
  • file1: text/plain
  • file2: text/plain
  • file3: text/plain
  • file4: text/plain
  • file5: text/plain
  • file6: application/octet-stream
  • Why does the `file` command fail to identify files 3, 4 and 5 as non-text files
  • (I'm assuming it can't possibly be a bug), or else what did I incorrectly
  • understand?
  • [1]: https://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap03.html
  • [5]: https://en.wikipedia.org/wiki/UTF-8#Codepage_layout
  • [POSIX defines][1]
  • - Text file as
  • > A file that contains **characters** organized into zero or more lines. The
  • lines do not contain NUL characters and none can exceed {LINE_MAX} bytes in
  • length, including the \<newline\> character.
  • - Line as
  • > A sequence of zero or more non- \<newline\> characters plus a terminating
  • \<newline\> character.
  • - Character as
  • > A sequence of one or more bytes representing a single graphic symbol or
  • control code.
  • Consider then six files, each with two bytes,
  • created with these Printf commands (using octals):
  • printf "\101\012" > file1 #A<newline>
  • printf "\010\012" > file2 #<backspace><newline>
  • printf "\101\101" > file3 #AA
  • printf "\200\012" > file4
  • printf "\200\200" > file5
  • printf "\000\012" > file6 #<null><newline>
  • Now, in the [UTF-8 encoding][5], the octal 012 (0x0A) is the newline character, 101 (0x41) is the graphic
  • symbol `A`, 010 (0x08) is the backspace control character and 200 (0x80) is a continuation byte that never occurs as the first byte of a multi-byte sequence, so it does not form a valid character.
  • Hence, I would regard files 1 and 2 as text files, but the remaining as non-text files, because file 3 is not newline terminated, files 4 and 5 have an invalid character and file 6 contains a null byte.
  • However, the `file` command does not seem to completely agree
  • with me; it lists files 3, 4 and 5 as text files,
  • $ file --mime-type file*
  • file1: text/plain
  • file2: text/plain
  • file3: text/plain
  • file4: text/plain
  • file5: text/plain
  • file6: application/octet-stream
  • Why does the `file` command fail to identify files 3, 4 and 5 as non-text files
  • (I'm assuming it can't possibly be a bug) even though I use `en_US.UTF-8` as my locale, or else what did I incorrectly
  • understand?
  • [1]: https://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap03.html
  • [5]: https://en.wikipedia.org/wiki/UTF-8#Codepage_layout
#2: Post edited by user avatar Quasímodo‭ · 2021-05-27T19:46:51Z (almost 3 years ago)
Fix link that pointed to a irrelevant section; address Moshi's comment explaining why I think files 3 to 6 are non-text files.
  • [POSIX defines][1]
  • - Text file as
  • > A file that contains **characters** organized into zero or more lines. The
  • lines do not contain NUL characters and none can exceed {LINE_MAX} bytes in
  • length, including the \<newline\> character.
  • - Line as
  • > A sequence of zero or more non- \<newline\> characters plus a terminating
  • \<newline\> character.
  • - Character as
  • > A sequence of one or more bytes representing a single graphic symbol or
  • control code.
  • Consider then six files, each with two bytes,
  • created with these Printf commands (using octals):
  • printf "\101\012" > file1
  • printf "\010\012" > file2
  • printf "\101\101" > file3
  • printf "\200\012" > file4
  • printf "\200\200" > file5
  • printf "\000\012" > file6
  • Now, in the [UTF-8 encoding][5], the octal 012 (0x0A) is the newline character, 101 (0x41) is the graphic
  • symbol `A`, 010 (0x08) is the backspace control character and 200 (0x80) is not a valid
  • character.
  • Hence, I would regard files 1 and 2 as text files, while the remaining ones as
  • non-text files. However, the `file` command does not seem to completely agree
  • with me; it lists files 3, 4 and 5 as text files,
  • $ file --mime-type file*
  • file1: text/plain
  • file2: text/plain
  • file3: text/plain
  • file4: text/plain
  • file5: text/plain
  • file6: application/octet-stream
  • Why does the `file` command fail to identify files 3, 4 and 5 as non-text files
  • (I'm assuming it can't possibly be a bug), or else what did I incorrectly
  • understand?
  • [1]: https://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap03.html#tag_03_92
  • [5]: https://en.wikipedia.org/wiki/UTF-8#Codepage_layout
  • [POSIX defines][1]
  • - Text file as
  • > A file that contains **characters** organized into zero or more lines. The
  • lines do not contain NUL characters and none can exceed {LINE_MAX} bytes in
  • length, including the \<newline\> character.
  • - Line as
  • > A sequence of zero or more non- \<newline\> characters plus a terminating
  • \<newline\> character.
  • - Character as
  • > A sequence of one or more bytes representing a single graphic symbol or
  • control code.
  • Consider then six files, each with two bytes,
  • created with these Printf commands (using octals):
  • printf "\101\012" > file1 #A<newline>
  • printf "\010\012" > file2 #<backspace><newline>
  • printf "\101\101" > file3 #AA
  • printf "\200\012" > file4
  • printf "\200\200" > file5
  • printf "\000\012" > file6 #<null><newline>
  • Now, in the [UTF-8 encoding][5], the octal 012 (0x0A) is the newline character, 101 (0x41) is the graphic
  • symbol `A`, 010 (0x08) is the backspace control character and 200 (0x80) is a continuation byte.
  • Hence, I would regard files 1 and 2 as text files, but the remaining as non-text files, because files 3, 4 and 5 are not newline terminated and file 6 contains a null byte.
  • However, the `file` command does not seem to completely agree
  • with me; it lists files 3, 4 and 5 as text files,
  • $ file --mime-type file*
  • file1: text/plain
  • file2: text/plain
  • file3: text/plain
  • file4: text/plain
  • file5: text/plain
  • file6: application/octet-stream
  • Why does the `file` command fail to identify files 3, 4 and 5 as non-text files
  • (I'm assuming it can't possibly be a bug), or else what did I incorrectly
  • understand?
  • [1]: https://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap03.html
  • [5]: https://en.wikipedia.org/wiki/UTF-8#Codepage_layout
#1: Initial revision by user avatar Quasímodo‭ · 2021-05-27T18:34:13Z (almost 3 years ago)
Why does the file command fail to recognize non-text files as such?
[POSIX defines][1]

- Text file as

  > A file that contains **characters** organized into zero or more lines. The
  lines do not contain NUL characters and none can exceed {LINE_MAX} bytes in
  length, including the \<newline\> character.

- Line as

  > A sequence of zero or more non- \<newline\> characters plus a terminating
  \<newline\> character.

- Character as

  > A sequence of one or more bytes representing a single graphic symbol or
  control code.

Consider then six files, each with two bytes,
created with these Printf commands (using octals):

    printf "\101\012" > file1
    printf "\010\012" > file2
    printf "\101\101" > file3
    printf "\200\012" > file4
    printf "\200\200" > file5
    printf "\000\012" > file6

Now, in the [UTF-8 encoding][5], the octal 012 (0x0A) is the newline character, 101 (0x41) is the graphic
symbol `A`, 010 (0x08) is the backspace control character and 200 (0x80) is not a valid
character. 

Hence, I would regard files 1 and 2 as text files, while the remaining ones as
non-text files. However, the `file` command does not seem to completely agree
with me; it lists files 3, 4 and 5 as text files,

    $ file --mime-type file*
    file1: text/plain
    file2: text/plain
    file3: text/plain
    file4: text/plain
    file5: text/plain
    file6: application/octet-stream

Why does the `file` command fail to identify files 3, 4 and 5 as non-text files
(I'm assuming it can't possibly be a bug), or else what did I incorrectly
understand?

[1]: https://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap03.html#tag_03_92
[5]: https://en.wikipedia.org/wiki/UTF-8#Codepage_layout