Post History
POSIX defines Text file as A file that contains characters organized into zero or more lines. The lines do not contain NUL characters and none can exceed {LINE_MAX} bytes in length, inclu...
#3: Post edited
- [POSIX defines][1]
- - Text file as
- > A file that contains **characters** organized into zero or more lines. The
- lines do not contain NUL characters and none can exceed {LINE_MAX} bytes in
- length, including the \<newline\> character.
- - Line as
- > A sequence of zero or more non- \<newline\> characters plus a terminating
- \<newline\> character.
- - Character as
- > A sequence of one or more bytes representing a single graphic symbol or
- control code.
- Consider then six files, each with two bytes,
- created with these Printf commands (using octals):
- printf "\101\012" > file1 #A<newline>
- printf "\010\012" > file2 #<backspace><newline>
- printf "\101\101" > file3 #AA
- printf "\200\012" > file4
- printf "\200\200" > file5
- printf "\000\012" > file6 #<null><newline>
- Now, in the [UTF-8 encoding][5], the octal 012 (0x0A) is the newline character, 101 (0x41) is the graphic
symbol `A`, 010 (0x08) is the backspace control character and 200 (0x80) is a continuation byte.Hence, I would regard files 1 and 2 as text files, but the remaining as non-text files, because files 3, 4 and 5 are not newline terminated and file 6 contains a null byte.- However, the `file` command does not seem to completely agree
- with me; it lists files 3, 4 and 5 as text files,
- $ file --mime-type file*
- file1: text/plain
- file2: text/plain
- file3: text/plain
- file4: text/plain
- file5: text/plain
- file6: application/octet-stream
- Why does the `file` command fail to identify files 3, 4 and 5 as non-text files
(I'm assuming it can't possibly be a bug), or else what did I incorrectly- understand?
- [1]: https://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap03.html
- [5]: https://en.wikipedia.org/wiki/UTF-8#Codepage_layout
- [POSIX defines][1]
- - Text file as
- > A file that contains **characters** organized into zero or more lines. The
- lines do not contain NUL characters and none can exceed {LINE_MAX} bytes in
- length, including the \<newline\> character.
- - Line as
- > A sequence of zero or more non- \<newline\> characters plus a terminating
- \<newline\> character.
- - Character as
- > A sequence of one or more bytes representing a single graphic symbol or
- control code.
- Consider then six files, each with two bytes,
- created with these Printf commands (using octals):
- printf "\101\012" > file1 #A<newline>
- printf "\010\012" > file2 #<backspace><newline>
- printf "\101\101" > file3 #AA
- printf "\200\012" > file4
- printf "\200\200" > file5
- printf "\000\012" > file6 #<null><newline>
- Now, in the [UTF-8 encoding][5], the octal 012 (0x0A) is the newline character, 101 (0x41) is the graphic
- symbol `A`, 010 (0x08) is the backspace control character and 200 (0x80) is a continuation byte that never occurs as the first byte of a multi-byte sequence, so it does not form a valid character.
- Hence, I would regard files 1 and 2 as text files, but the remaining as non-text files, because file 3 is not newline terminated, files 4 and 5 have an invalid character and file 6 contains a null byte.
- However, the `file` command does not seem to completely agree
- with me; it lists files 3, 4 and 5 as text files,
- $ file --mime-type file*
- file1: text/plain
- file2: text/plain
- file3: text/plain
- file4: text/plain
- file5: text/plain
- file6: application/octet-stream
- Why does the `file` command fail to identify files 3, 4 and 5 as non-text files
- (I'm assuming it can't possibly be a bug) even though I use `en_US.UTF-8` as my locale, or else what did I incorrectly
- understand?
- [1]: https://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap03.html
- [5]: https://en.wikipedia.org/wiki/UTF-8#Codepage_layout
#2: Post edited
- [POSIX defines][1]
- - Text file as
- > A file that contains **characters** organized into zero or more lines. The
- lines do not contain NUL characters and none can exceed {LINE_MAX} bytes in
- length, including the \<newline\> character.
- - Line as
- > A sequence of zero or more non- \<newline\> characters plus a terminating
- \<newline\> character.
- - Character as
- > A sequence of one or more bytes representing a single graphic symbol or
- control code.
- Consider then six files, each with two bytes,
- created with these Printf commands (using octals):
printf "\101\012" > file1printf "\010\012" > file2printf "\101\101" > file3- printf "\200\012" > file4
- printf "\200\200" > file5
printf "\000\012" > file6- Now, in the [UTF-8 encoding][5], the octal 012 (0x0A) is the newline character, 101 (0x41) is the graphic
symbol `A`, 010 (0x08) is the backspace control character and 200 (0x80) is not a validcharacter.Hence, I would regard files 1 and 2 as text files, while the remaining ones asnon-text files. However, the `file` command does not seem to completely agree- with me; it lists files 3, 4 and 5 as text files,
- $ file --mime-type file*
- file1: text/plain
- file2: text/plain
- file3: text/plain
- file4: text/plain
- file5: text/plain
- file6: application/octet-stream
- Why does the `file` command fail to identify files 3, 4 and 5 as non-text files
- (I'm assuming it can't possibly be a bug), or else what did I incorrectly
- understand?
[1]: https://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap03.html#tag_03_92- [5]: https://en.wikipedia.org/wiki/UTF-8#Codepage_layout
- [POSIX defines][1]
- - Text file as
- > A file that contains **characters** organized into zero or more lines. The
- lines do not contain NUL characters and none can exceed {LINE_MAX} bytes in
- length, including the \<newline\> character.
- - Line as
- > A sequence of zero or more non- \<newline\> characters plus a terminating
- \<newline\> character.
- - Character as
- > A sequence of one or more bytes representing a single graphic symbol or
- control code.
- Consider then six files, each with two bytes,
- created with these Printf commands (using octals):
- printf "\101\012" > file1 #A<newline>
- printf "\010\012" > file2 #<backspace><newline>
- printf "\101\101" > file3 #AA
- printf "\200\012" > file4
- printf "\200\200" > file5
- printf "\000\012" > file6 #<null><newline>
- Now, in the [UTF-8 encoding][5], the octal 012 (0x0A) is the newline character, 101 (0x41) is the graphic
- symbol `A`, 010 (0x08) is the backspace control character and 200 (0x80) is a continuation byte.
- Hence, I would regard files 1 and 2 as text files, but the remaining as non-text files, because files 3, 4 and 5 are not newline terminated and file 6 contains a null byte.
- However, the `file` command does not seem to completely agree
- with me; it lists files 3, 4 and 5 as text files,
- $ file --mime-type file*
- file1: text/plain
- file2: text/plain
- file3: text/plain
- file4: text/plain
- file5: text/plain
- file6: application/octet-stream
- Why does the `file` command fail to identify files 3, 4 and 5 as non-text files
- (I'm assuming it can't possibly be a bug), or else what did I incorrectly
- understand?
- [1]: https://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap03.html
- [5]: https://en.wikipedia.org/wiki/UTF-8#Codepage_layout
#1: Initial revision
Why does the file command fail to recognize non-text files as such?
[POSIX defines][1] - Text file as > A file that contains **characters** organized into zero or more lines. The lines do not contain NUL characters and none can exceed {LINE_MAX} bytes in length, including the \<newline\> character. - Line as > A sequence of zero or more non- \<newline\> characters plus a terminating \<newline\> character. - Character as > A sequence of one or more bytes representing a single graphic symbol or control code. Consider then six files, each with two bytes, created with these Printf commands (using octals): printf "\101\012" > file1 printf "\010\012" > file2 printf "\101\101" > file3 printf "\200\012" > file4 printf "\200\200" > file5 printf "\000\012" > file6 Now, in the [UTF-8 encoding][5], the octal 012 (0x0A) is the newline character, 101 (0x41) is the graphic symbol `A`, 010 (0x08) is the backspace control character and 200 (0x80) is not a valid character. Hence, I would regard files 1 and 2 as text files, while the remaining ones as non-text files. However, the `file` command does not seem to completely agree with me; it lists files 3, 4 and 5 as text files, $ file --mime-type file* file1: text/plain file2: text/plain file3: text/plain file4: text/plain file5: text/plain file6: application/octet-stream Why does the `file` command fail to identify files 3, 4 and 5 as non-text files (I'm assuming it can't possibly be a bug), or else what did I incorrectly understand? [1]: https://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap03.html#tag_03_92 [5]: https://en.wikipedia.org/wiki/UTF-8#Codepage_layout