How does a browser detemine a page’s encoding

It is important that a spider should “read” a page as same as a end user. At the very beginning, it should determine a page’s encoding as a browser does.

For most popular browser, like IE, FireFox, Safari, by default, it will check the http header replied by web server ( which is invisible to end user). You can configure your web server always reply a default encoding (usually UTF-8), and then you make sure all pages in this server are encoded to UTF-8. This is the default behavior of Apache, i.e. setting “AddDefaultCharset UTF-8″ in httpd.conf file.

If you’d rather letting the browsers select the encoding by page’s HTTP-EQUIV meta tag, you can comment out the “AddDefaultCharset”. In such case, there will not be encoding information in http header. The browser will try to read the HTTP-EQUIV meta tag.

If no encoding information available in http header and meta tag, browser will try to use the default encoding, like ISO-8859-1 for English version XP’s IE or GB2312 for Simplified Chinese version XP’s IE, or it will use its own way auto-determine the encoding.

Share and Enjoy:
  • Digg
  • Sphinn
  • del.icio.us
  • Facebook
  • Mixx
  • Google Bookmarks
  • Share/Bookmark

Leave a Reply