Bomstrip is a very simple tool that removes BOM's
(byte-order-marks) from utf-8 files. Actually, it is a set
of tools that all do the same thing, but - for added
entertainment value - in multiple programming languages
(python, c, java, brainfuck, ook!, perl (twice), sed, postscript,
pascal, unlambda, limbo, haskell, ocaml, php, ruby, c++,
forth, awk). You want to always have this tool within
hand-reach, no matter where you are and which
compilers/interpreters you keep close to you.
Each tool reads from stdin and writes to stdout. It accepts
no options or arguments. It never writes into files directly.
All files are public domain. It exists for the purpose of
noting how stupid BOM's in utf-8 files are.
Oh, in case you didn't know yet: utf-8 does not have
byte-ordering issues, so there is absolutely no need to
have three bytes (the utf-8-BOM) that do not say anything
about the byte-order (since there is nothing to say).
Wow, you are impatient! But you're lucky! You can have
it! It's free! Get the latest version now: bomstrip-9.tgz. YEAH.
The utf-8 BOM can be found at the start of some files. It
consists of three bytes: EF BB BF. This is the utf-8
encoding of unicode character FEFF.
-
It indicates the file is utf-8 encoded, e.g. for tools like
file(1). This is meta data though, and should not be part
of the contents. There is no similar marking for other
character encodings. And, after all, this is more a
side-effect: the byte-order-mark exists to mark byte
orders.
- It breaks shellscripts (files will not start with #!
but with a BOM). (Note: somehow, some Linux distributions
seem to execute files without valid hashbang by the shell
(which seems to be able to ignore that first line). Running
non-shell programs by the shell usually makes for some nice
though undesired special effects.)
- It breaks all kind of text processing.
- It takes up three whole bytes!
- It looks ugly in your editor. Unless it thinks it
should be smart and decides it needs to hide it from you.
- The utf-8 BOM is illegal in ASCII-encoded files. It
breaks compatibility with ASCII.
Honestly, I don't really know. This is one of those
mysteries that might never get solved. Oh, there is one
lead: it seems to be generated mostly (exclusively?) by
Windows systems. Really, who would have thought?
Of course you want to help in the noble quest of removing
all utf-8 BOM's around. WE NEED YOUR HELP! Write bomstrip in
your favorite language and send it to me at
<mechiel@ueber.net> for inclusion in the next version.
We still need implementations in the following languages:
c#, whitespace, prolog, shakespeare, lisp, erlang, lua,
tcl, visual basic and so many more.
I do not guarantee that this program strips BOM's. I do
not guarantee that this program does anything at all. If
this program does or does not something to you or your files
that you do not or do want, I cannot be held responsible.
Okay, that feels much safer.
- 24-06-2008 - bomstrip-9.tgz
-
Fresh implementations by Peter Pentchev! Bomstrip-9 now comes with another perl implementation, a one-liner. And a c++ implementation, and a forth implementation, and an awk implementation (well, a cripled one, since it does not run on the one true awk). Peter Pentchev also gave some improvements to the c & python implementations. And changed the test script to make testing easier. Many thanks!
- 11-02-2008 - bomstrip-8.tgz
-
After some time inactivity, a new version thanks to Andrew
Gerrand! Many thanks indeed for his version of bomstrip
in PHP! For added bonus, I've thrown in a little ruby
implementation. Keep them coming!
- 18-09-2005 - bomstrip-7.tgz
-
Second release today! Just created an ocaml implementation. Enjoy!
- 18-09-2005 - bomstrip-6.tgz
-
Wow, we're on a roll. Today brings implementations in limbo
(nice language) and in haskell. Both by yours truly.
- 17-09-2005 - bomstrip-5.tgz
-
Now with implementation in unlambda by Matthijs Bomhoff.
Thanks a lot! This is getting more impressive each release.
But remember, we are not there yet. More!
- 10-09-2005 - bomstrip-4.tgz
-
New implementations in Postscript and Pascal. Thanks
to Berteun Damman. Great! Keep them coming!
- 07-09-2005 - bomstrip-3.tgz
-
New release. The previous java version has been replaced
by one that is more java-style (not the C rewrite it was
in version 2). Thanks go to Ruben Smelik for java-ifying!
- 06-09-2005 - bomstrip-2.tgz
-
Second release. Now with implementation in sed (thanks
Andreas Gohr), java (by me), brainfuck (thanks Berteun
Damman; run it with interpreters bff or nbfc or another
interpreter that reads -1 at EOF), perl (thanks Matthijs
Bomhoff) and ook! (thanks Berteun Damman). Enjoy the ride.
- 06-07-2005 - bomstrip-1.tgz
- First beta-pre-alpha release! Unfortunately, I'm too
lazy to make a sourceforge account, CVS repository, mailing
lists, issue trackers, freshmeat announcements, precompiled
binaries for all linux distributions, packages for debian,
gentoo, *bsd and all the others. In short, this project
is not yet as cool as it could and should be.
sha1(bomstrip-9.tgz): 70c8b03df90e66c745fe9b5b5ff6790a0ecd32a1
md5(bomstrip-9.tgz): 93184de71a25831fa03ec49f0bca3e34