file/type list/libmagic overhaul
Christos Zoulas
christos at zoulas.com
Tue Aug 19 12:01:19 EEST 2008
On Aug 17, 1:57pm, filemaillist at adaptivetime.com (Gravis) wrote:
-- Subject: file/type list/libmagic overhaul
| in short:
| im proposing a massive update to the file type file as well as change
| it's format to something faster.
|
| my story:
| in my programming endeavor im dealing with files with no names so
| naturally i turned to using a library to do magic numbers/patterns to
| identify the types. however, my instinct to use something in the LSB
| was sorely met by a rustic and out of date list of file types when i
| found it cant even detect PNG images. my alternatives were to use a
| desktop specific lib that had a MUCH more complete list of types or
| write my own, neither of which are appealing. however, i did notice
| that both desktop specific and libmagic are SLOW specifically because of
| how they are organized.
|
| my proposal:
| 1. an online form in which people can submit types to add which would
| require occasional updates to the list of course (every 3 months?).
That is a good idea, although you'll need a lot of manual fixes because
people usually give weak magic descriptions that match ~everything. I
would also suggest that people submit sample files so that we can write
unit-tests.
| 2. change the file type list to a binary format (like some of the
| alternatives to libmagic)
libmagic uses binary files. The magic.mgc file is pre-parsed during compilation
time and it is mmapped at runtime by libmagic.
| 3. make the file type list more like a specialized database (sorted by
| popularity) for SIGNIFICANTLY faster type lookup times.
This does not work because the popular magic may match a more specialized
entry pre-maturely and give you the wrong results. Magic is sorted by
strength. Finally lots of the performance loss is caused by magic regex's
whitespace and string case insensitivity tests, not by load time.
| 4. try to get desktop environments to use libmagic (which they havent
| due to it's inadequacy)
Many of them do.
| other implimentations:
| this is how ROX, Gnome and soon(already?) KDE does stores file types:
| http://www.termalkristaly.hu/doc/shared-mime-info/shared-mime-info-spec.html/x34.html#AEN214
| more info at:
| http://www.termalkristaly.hu/doc/shared-mime-info/shared-mime-info-spec.html
|
| because this is in the LSB i think that it should include every file
| type under the sun including commercial products, game files and even
| ROM formats for emulators. if this is just too much, an official (RFC
| documented formats) list and an unofficial DB could be used though i
| really hope you would consider having all formats in one.
This is a subset of the functionality that the magic format provides now.
For example, this does not provide indirect or relative offsets.
| i'm willing to do all the programming myself but i really want to get it
| into the LSB so that once and for all the file type lookup can be
| unified for all applications.
Well, file and the magic format specification has a POSIX definition. Most
commercial and non commercial OS's use this implementation of file and I
doubt it that they would appreciate a change in the magic format.
I appreciate that you want to work on improving file, but please do some
more research and come up with concrete ways of improving it without
breaking compatibility.
christos
More information about the File
mailing list