It is important that a spider should “read” a page as same as a end user. At the very beginning, it should determine a page’s encoding as a browser does.
For most popular browser, like IE, FireFox, Safari, by default, it will check the http header replied by web server ( which is invisible to end user). You can configure your web server always reply a default encoding (usually UTF-8), and then you make sure all pages in this server are encoded to UTF-8. This is the default behavior of Apache, i.e. setting “AddDefaultCharset UTF-8″ in httpd.conf file.
If you’d rather letting the browsers select the encoding by page’s HTTP-EQUIV meta tag, you can comment out the “AddDefaultCharset”. In such case, there will not be encoding information in http header. The browser will try to read the HTTP-EQUIV meta tag.
If no encoding information available in http header and meta tag, browser will try to use the default encoding, like ISO-8859-1 for English version XP’s IE or GB2312 for Simplified Chinese version XP’s IE, or it will use its own way auto-determine the encoding.