Mar 18 2009 ∞
I love character encoding!
Tonight’s goal: Make a simple PHP class.
- Input: a URL pointing to an HTML document.
- Output: a UTF-8 version, regardless of what encoding it’s really in.
Sounds easy, right?
Nope. Because some pages specify encoding via HTTP header, some specify via meta tag, some specify both but they disagree, and some don’t specify at all. Sometimes, the encoding is specified with an unusual variant of its name (e.g. X-GBK, MS939). And often, the specified encoding is wrong.
But I think I got it, finally.
This is so useful, albeit to a relatively narrow range of programmers, that I feel bad not releasing it to the world, except that I assume that someone else has already done this and I just didn’t bother looking for it. (My experiences with PHP-community code are not good, so I almost always roll my own.) Any interest?
All original content is licensed under the Creative Commons Attribution 3.0 U.S. License except that which is quoted from elsewhere or attributed to others. In short, you may reproduce, reblog, and modify my content, but you must provide proper attribution.