From christos at deshaw.com Sat Jun 22 18:19:54 1996 From: christos at deshaw.com (Christos Zoulas) Date: Sat Mar 5 00:36:54 2005 Subject: File 3.20 is now available Message-ID: This version fixes/adds: - support relative offsets of the form >& - fix bug with truncating magic strings that contain \n - file -f - did not read from stdin as documented - support elf file parsing using our own elf support. - as always magdir fixes and additions. Look on ftp.deshaw.com:/pub/file/file-3.20.tar.gz christos From christos at deshaw.com Sat Oct 5 14:19:36 1996 From: christos at deshaw.com (Christos Zoulas) Date: Sat Mar 5 00:36:54 2005 Subject: file 3.21 is now available Message-ID: In ftp.deshaw.com:/pub/file/file-3.21.tar.gz. * Revision 1.21 1996/10/05 18:15:29 christos * Segregate elf stuff and conditionally enable it with -DBUILTIN_ELF * More magic fixes christos From quinlan at proton.pathname.com Sun Oct 20 23:43:00 1996 From: quinlan at proton.pathname.com (Daniel Quinlan) Date: Sat Mar 5 00:36:54 2005 Subject: file(1) specification in POSIX 2 Message-ID: A new version of the POSIX 2 specification, including file(1), is under development. The last drafting I saw had some problems and I began writing a critique, but didn't go anywhere with it because of time constraints. I just mailed hlj@posix.com, the editor of POSIX 2, to ask for an updated version of the draft (of the file(1) section). I'll forward that to here when I get it. If I can manage to locate my last draft I had (draft 11), I'll send that here too. Anyway, I might as well include the critique as it stands. I plan on finishing it up soon and would appreciate any comments. ------- start of cut text -------------- Comments concerning POSIX 1003.2B, draft 11, section 5.14: 1. byte-order ------------- The current draft has no provision for the differentiation between magic numbers originating from big-endian (MSB first or Motorola-order) and little-endian (LSB first or Intel-order) machines. With the exception of string matches, this places the validity of any match under serious question. It also makes the porting of magic files between big-endian and little-endian architectures impossible. (Note: byte matches are even worse, Darwin file only uses them for subtests.) Darwin file was extended in 1993 to provide byte-order handling. It was accomplished through the addition of several new magic types: "beshort", "leshort", "belong", and "lelong" (where the "be" and "le" prefixes refer to big-endian and little-endian, respectively). There was been a lot of work to update the Darwin file collection of magic entries to byte-order specific forms, where possible. 2. Strings ---------- According to the current draft, any non-ASCII characters included in string magic must be written with octal escapes. Providing a mechanism to support hexadecimal escape sequences would be beneficial to writers of magic files. A common coding practice is to specify magic in a hexadecimal format; manual conversion from hexadecimal numbers to "hexadecimal strings" is very simple (especially on big-endian machines). Additionally, this is a very simple extension to support in file. Darwin file supports the use of \x?? or \X?? to specify an escaped character, where '??' is the value of the character in hexadecimal notation. Without this extension, SGI had to do use this magic, which has the potential to match much more than IRIX vmcore dumps, especially since "belong" isn't being used. # New style crash dump file 0 long 0x43727368 >4 long 0x44756d70 IRIX vmcore dump of >36 addr x '%s' Darwin file was able to use: # New style crash dump file 0 string \x43\x72\x73\x68\x44\x75\x6d\x70 IRIX vmcore dump of >36 string >\0 '%s' Translating from the "long long" 0x4372736844756d70 to the octal string notation is an awkward and time-consuming task. 3. Multiple levels of subtests ------------------------------ Multiple levels of subtests should be supported in magic files. This is an essential facility for the accurate recognition of files. The Solaris 5.4 magic(4) manual page lists the lack of multiple levels of subtests as a bug. +---- | BUGS | There should be more than one level of subtests, with the | level indicated by the number of `>' at the beginning of the | line. +---- Darwin file supports multiple file levels, in the manner proposed by Sun's magic(4) manual page. For example, the following section of magic file, intended for ELF objects, becomes impossible without multiple test levels. Please also note the use of byte-order specific magic. [insert ELF magic here] 4. Required strings -------------------------- Each string is completely in lower case, but several of the words are almost always written with some (or all) upper case characters. The left column of the table even lists several items in upper case: "FIFO", "C program text", and "FORTRAN program text". POSIX should allow either upper or lower case output or perhaps it should specify a sensible usage of capital letters. 5. Types -------- Section 5.14.7, subsection "type", contains very serious problems. The number of bytes for any particular type must not be implementation defined. Doing so would make magic files completely non-portable. Files should be identifiable on any system. This section of the specification also breaks historical implementations in a drastic way, such as Darwin file, Sun's file, and other implementations. ------- end ---------------------------- From quinlan at proton.pathname.com Tue Oct 22 02:04:00 1996 From: quinlan at proton.pathname.com (Daniel Quinlan) Date: Sat Mar 5 00:36:54 2005 Subject: current draft of POSIX 1003.2B file Message-ID: POSIX 1003.2B is a set of changes that are being made to the base 1003.2 document that was published in 1992. Here I have taken those changes and applied them to the original standard. There are two areas that readers of this list may want to focus on: 1. Incompatibilities between "our" file and this specification. 2. Problems in the standard itself. It looks quite similar to SVR4 version of file. It looks even more similar to the Sun version. I would like to collect any comments and relay them to the POSIX standards group. Since we can't expect another revision of POSIX.2 for a while, we should really push POSIX to get it right this time. The first RATIONALE is a "rationale" of the changes. The second is the RATIONALE for the file specification. I can detect no changes since the drafting I received a year ago. ========================================================================== BEGIN_RATIONALE Rationale: The changes in this clause, except for those related to symbolic links, satisfy the following requirement from ISO/IEC 9945- 2:1993 Annex H.1: (12) The file utility should allow user-specified algorithms for file type recognition, similar to those used in the historical /etc/magic file. END_RATIONALE ========================================================================== 5.14 file - Determine file type 5.14.1 Synopsis file [-dhi] [-M file] [-m file] file ... 5.14.2 Description The file utility shall perform a series of tests on each specified file in an attempt to classify it. (1) If the file is not a regular file, its file type shall be identified. The file types directory, FIFO, block special, and character special shall be identified as such. Other implementation-defined file types may also be identified. (2) If the file is a regular file, and (a) The file is zero-length, it shall be identified as an empty file. (b) The file is not zero-length, file shall examine an initial segment of the file and shall make a guess at identifying its contents or whether it is an executable binary file. (The answer is not guaranteed to be correct.) If file does not exist, cannot be read, or its file status could not be determined, the output shall indicate that the file was processed, but that its type could not be determined. If file is a symbolic link, by default the link shall be resolved and file shall test the type of file referenced by the symbolic link. 5.14.3 Options The file utility shall conform to the utility argument syntax guidelines described in 2.10.2. The following options shall be supported by the implementation: -d Apply any default system tests to the file. -h When a symbolic link is encountered, identify the file as a symbolic link. If -h is not specified and file is a symbolic link that refers to a nonexistent file, file shall identify the file as a symbolic link, as if -h had been specified. -i If a file is a regular file, do not attempt to classify the type of the file further, but identify the file as specified in 5.14.6.1, using a string that contains the string regular file. -M file Specify the name of a file containing tests that shall be applied to a file in order to classify it (see 5.14.7). No default system tests shall be applied. -m file Specify the name of a file containing tests that shall be applied to a file in order to classify it (see 5.14.7). If multiple instances of the -m, -d, or -M options are specified, the concatenation of the tests specified, in the order specified, shall be the set of tests that are applied. If a -M option is specified, no tests other than those specified using the -d, -M, and -m options shall be applied to the file. If neither the -d nor -M options are specified, any default system tests shall be applied after any tests specified using the -m option. 5.14.4 Operands The following operand shall be supported by the implementation: file A pathname of a file to be tested. 5.14.5 External Influences 5.14.5.1 Standard Input None. 5.14.5.2 Input Files The file can be any file type. 5.14.5.3 Environment Variables The following environment variables shall affect the execution of file: LANG This variable shall determine the locale to use for the locale categories when both LC_ALL and the corresponding environment variable (beginning with LC_) do not specify a locale. See 2.6. LC_ALL This variable shall determine the locale to be used to override any values for locale categories specified by the settings of LANG or any environment variables beginning with LC_. LC_CTYPE This variable shall determine the interpretation of sequences of bytes of text data as characters (e.g., single- versus multibyte characters in arguments and input files). LC_MESSAGES This variable shall determine the language in which messages should be written. 5.14.5.4 Asynchronous Events Default. 5.14.6 External Effects 5.14.6.1 Standard Output In the POSIX Locale, the following format shall be used to identify each file operand specified: "%s: %s\n", , The values for are unspecified, except that in the POSIX Locale, if file is identified as one of the types listed in Table 5-1, shall contain (but is not limited to) the corresponding string. Each space shown in the strings shall be exactly one character. Table 5-1 - file Output Strings _________________________________________________________________________ | If file is a | shall contain the string | |_____________________________________|__________________________________| | Directory | directory | | FIFO | fifo | | Block special | block special | | Character special | character special | | Symbolic link | symbolic link to | | Executable binary | executable | | Empty regular file | empty | | ar archive library (see 6.1) | archive | | Extended cpio format (see Section | cpio archive | | 10.1.2 of POSIX.1 {8}) | | | Extended tar format (see Section | tar archive | | 10.1.1 of POSIX.1 {8}) | | | Shell script | commands text | | C-language source | c program text | | FORTRAN source | fortran program text | | Other text file | text | |_____________________________________|__________________________________| If file is identified as a symbolic link (see -h), the following alternative output format shall be used: "%s: %s %s\n", , , If the file named by the file operand does not exist or cannot be read, the string cannot open shall be included as part of the field, but this shall not be considered an error that affects the exit status. If the type of the file named by the file operand cannot be determined, the string unknown type shall be included as part of the field, but this shall not be considered an error that affects the exit status. 5.14.6.2 Standard Error Used only for diagnostic messages. 5.14.6.3 Output Files None. 5.14.7 Extended Description A file specified as an option-argument to the -m or -M options shall contain one test per line, which shall be applied to the file. If the test succeeds, the message field of the line shall be printed and no further tests shall be applied, with the exception that tests on immediately following lines beginning with a single > character shall be applied. Each line shall be composed of the following four -separated fields: offset An unsigned number (optionally preceded by a single > character) specifying the offset, in bytes, of the value in the file that is to be compared against the value field of the line. If the file is shorter than the specified offset, the test shall fail. If the offset begins with the character >, the test contained in the line shall not be applied to the file unless the test on the last line for which the offset did not begin with a > was successful. By default, the offset shall be interpreted as an unsigned decimal number. With a leading 0x or 0X, the offset shall be interpreted as a hexadecimal number; otherwise, with a leading 0, the offset shall be interpreted as an octal number. type The type of the value in the file to be tested. The type shall consist of the type specification characters c, d, f, s, and u, specifying character, signed decimal, floating point, string, and unsigned decimal, respectively. The type string shall be interpreted as the bytes from the file starting at the specified offset and including the same number of bytes specified by the value field. If insufficient bytes remain in the file past the offset to match the value field, the test shall fail. The type specification characters d, f, and u can be followed by an optional unsigned decimal integer that specifies the number of bytes represented by the type. The type specification character f can be followed by an optional F, D, or L, indicating that the value is of type float, double, or long double, respectively. The type specification characters d and u can be followed by an optional C, S, I, or L, indicating that the value is of type char, short, int, or long, respectively. The default number of bytes represented by the type specifiers d, f, and u shall correspond to their respective C-language types as follows. If the system claims conformance to the C-Language Development Utilities Option, those specifiers shall correspond to the default sizes used in the c89 utility. Otherwise, the default sizes shall be implementation defined. For the type specifier characters d and u, the default number of bytes shall correspond to the size of the basic integral data type of the implementation. For these specifier characters, the implementation shall support values of the optional number of bytes to be converted corresponding to the number of bytes in the C-language types char, short, int, or long. These numbers can also be specified by an application as the characters C, S, I, and L, respectively. The byte order used when interpreting numeric values is implementation defined, but shall correspond to the order in which a constant of the corresponding type is stored in memory on the system. For the type specifier f, the default number of bytes shall correspond to the number of bytes in the basic double precision floating-point data type of the underlying implementation. The implementation shall support values of the optional number of bytes to be converted corresponding to the number of bytes in the C-language types float, double, and long double. These numbers can also be specified by an application as the characters F, D, and L, respectively. All type specifiers, except for s, can be followed by a mask specifier of the form &number. The mask value shall be ANDed with the value before the comparison with the value from the file is made. By default, the mask shall be interpreted as an unsigned decimal number. With a leading 0x or 0X, the mask shall be interpreted as a unsigned hexadecimal number; otherwise, with a leading 0, the mask shall be interpreted as an unsigned octal number. The strings byte, short, long, and string shall also be supported as type fields, being interpreted as dC, dS, dL, and s, respectively. value The value to be compared with the value from the file. Any value that contains a character that is not a digit, other than a leading sign (+ or -) or a leading 0x or 0X, shall be interpreted as a string. The test shall succeed only when a string value exactly matches the bytes from the file. If the value is a string, it can contain the following sequences: \character The backslash-escape sequences in Table 2-16 (see 2.12). The results of using any other character, other than an octal digit, following the backslash are unspecified. \octal Octal sequences that can be used to represent characters with specific coded values. An octal sequence shall consist of a backslash followed by the longest sequence of one, two, or three octal-digit characters (01234567). If the size of a byte on the system is greater than 9 b, the valid escape sequence used to represent a byte is implementation defined. By default, any value that is not a string shall be interpreted as a signed decimal number. Any such value, with a leading 0x or 0X, shall be interpreted as an unsigned hexadecimal number; otherwise, with a leading zero, the value shall be interpreted as an unsigned octal number. If the value is not a string, it can be preceded by a character indicating the comparison to be performed. Permissible characters and the comparisons they specify are as follows: = The test shall succeed if the value from the file equals the value field. < The test shall succeed if the value from the file is less than the value field. > The test shall succeed if the value from the file is greater than the value field. & The test shall succeed if all of the bits in the value field are set in the value from the file. ^ The test shall succeed if at least one of the bits in the value field is not set in the value from the file. x The test shall succeed if there is any value in the file. message The message to be printed if the test succeeds. The message shall be interpreted using the notation for the printf formatting specification; see 4.50.7. If the value field was a string, the the value from the file shall be the b argument for the printf formatting specification; otherwise, the value from the file shall be the argument. 5.14.8 Exit Status The file utility shall exit with one of the following values: 0 Successful completion. >0 An error occurred. 5.14.9 Consequences of Errors Default. ========================================================================== Editor's Note: The rationale in E.5.14 (IEEE Std 1003.2-1992 pages 987- 88, lines 9703-49) will be replaced by the following: BEGIN_RATIONALE file_Rationale._(This_subclause_is_not_a_part_of_P1003.2) Historical systems have used a ``magic file'' named /etc/magic to help identify file types. Because it is generally useful for users and scripts to be able to identify special file types, the -m flag and a portable format for user-created magic files has been specified. No requirement is made that an implementation of file use this method of identifying files, only that users be permitted to add their own classifying tests. In addition, three options have been added to historical practice. The - d flag has been added to permit users to cause their tests to follow any default system tests. The -i flag has been added to permit users to test portably for regular files in shell scripts. The -M flag has been added to permit users to ignore any default system tests. The historical -c option was omitted as not particularly useful to users or portable shell scripts. In addition, a reasonable implementation of the file utility would report any errors found each time the magic file is read. The historical format of the magic file was the same as that specified by the rationale in the previous version of this standard for the offset, value, and message fields; however, it used less precise type fields than the format specified by the current normative text. The new type field values are a superset of the historical ones. The following is an example magic file: 0 short 070707 cpio archive 0 short 0143561 byte-swapped cpio archive 0 string 070707 ASCII cpio archive 0 long 0177555 very old archive 0 short 0177545 old archive 0 short 017437 old packed data 0 string \037\036 packed data 0 string \377\037 compacted data 0 string \037\235 compressed data >2 byte&0x80 >0 block compressed >2 byte&0x1f x %d bits 0 string \032\001 Compiled Terminfo Entry 0 short 0433 Curses screen image 0 short 0434 Curses screen image 0 string System V Release 1 archive 0 string !\n__.SYMDEF archive random library 0 string ! archive 0 string ARF_BEGARF PHIGS clear text archive 0 long 0x137A2950 scalable OpenFont binary 0 long 0x137A2951 encrypted scalable OpenFont binary END_RATIONALE From newt at pobox.com Tue Oct 22 10:54:51 1996 From: newt at pobox.com (Greg Roelofs) Date: Sat Mar 5 00:36:54 2005 Subject: current draft of POSIX 1003.2B file Message-ID: >I would like to collect any comments and relay them to the POSIX >standards group. Since we can't expect another revision of POSIX.2 >for a while, we should really push POSIX to get it right this time. My comments are pretty much restricted to the magic file, so this message goes back out to both lists. Aside from the older comments Dan collected and posted yesterday, the only missing item I noticed was a comment on the identification of "byte" with "signed character"; this seems counter- intuitive to me (i.e., it should be unsigned). Not a big deal, though. In terms of updating Dan's old comments, it might also be useful to offer a shorthand version of the endian prefixes (they should probably be suffixes now): dS -> dSb dL -> dLb d4 -> d4l etc. Also note the part about an optional integer to indicate the length of the type; this renders Dan's old comment about non-portability at least partially moot (although it will require retrofitting such support both to Darwin file and to its magic data, and it would be *nice* if the older types could be identified with a fixed size to avoid breaking portability with most older implementations--that won't help Crays or Alphas, though). The new POSIX draft doesn't specify whether the integer suffix can be used with the other suffixes (e.g., C, S, I, L are redundant, and note the possibilities/compli- cations for dealing with wide characters; also, how does the integer value affect the interpretation of the floating-point suffixes?). That should be clarified. (Did it say anything about floating-point numbers being in IEEE format? I didn't notice.) That's all that came to mind offhand. The endianness issue is the critical one, of course. Greg