Archive for August, 2007

Categories of Web Data Mining

Friday, August 31st, 2007

We talked about data mining for quite a while. Up till now, we begin focusing on web data mining. The research currently focuses on:

  1. Web Structure Mining, which includes: Information Retrieval and Web search, Hyper-link based  ranking.
  2. Web Content Mining, which includes: Clustering, Classification.
  3. Web Usage Mining.

This categorization comes from Z. Markov and D. Larose ’s excellent book: Data mining the Web -uncovering patterns in web content, structure and usage.

  • Share/Bookmark

Solution for web pages stop loading in the middle

Sunday, August 26th, 2007

After upgrade my Linux box to Fedora 7, I found that some web pages will stop loading in the middle. This renders a page only with header or even an empty page. With Wireshark, I can see only one package returned by the server, the rests are lost.

At first, I suspected this is a MTU problem, but with no luck after trying hundreds MTU settings. Finally, I remembered some notes I have took before about Linux’s TCP fix, and it did the magic.

Here is the settings in /etc/sysctl.conf

net.ipv4.tcp_moderate_rcvbuf = 0
net.ipv4.tcp_window_scaling = 0

Also, it is not a bad idea to turn off ipv6 support, add a file includes following to /etc/modproble.d/

alias net-pf-10 off
alias ipv6 off
  • Share/Bookmark

Browser behavior while sending referrer URL

Saturday, August 25th, 2007

Some times we need to determine the referrer URL. For example, an online advertising company wants to display special information based on which page its audience is looking at. A common way is checking the referrer URL of the current scripts page then embedded in the publisher’s pages.

However, different browser may have different behavior while sending the referrer URL in HTTP request header. Most of browsers except IE will escape the path and query part to % form, and this is what the standard required. IE will convert the query part to UTF-8 encoding before sending it, while other browsers will just leave it as the current encoding using for display. For Firefox, it will be a little bit more complicated, since it can be configured by the about:config page.

Here are some examples:

  • search.php?q=one two, will be sent as it is by IE, and be sent as search.php?q=one%20two by others.
  • search.php?q=中文, will be sent as it is by IE, however, the Chinese charactes are encoding to UTF-8 no matter what encoding the page is. For other browser, it will be sent as search.php?q=%E4%B8%AD%E6%96%87 if search.php is displayed using UTF-8 or as search.php?q=%D6%D0%CE%C4 if search.php is displayed using GB2312.

I will try to give in depth explanation on this issue in the coming articles. Stay tune!

  • Share/Bookmark

How does a browser detemine a page’s encoding

Monday, August 20th, 2007

It is important that a spider should “read” a page as same as a end user. At the very beginning, it should determine a page’s encoding as a browser does.

For most popular browser, like IE, FireFox, Safari, by default, it will check the http header replied by web server ( which is invisible to end user). You can configure your web server always reply a default encoding (usually UTF-8), and then you make sure all pages in this server are encoded to UTF-8. This is the default behavior of Apache, i.e. setting “AddDefaultCharset UTF-8″ in httpd.conf file.

If you’d rather letting the browsers select the encoding by page’s HTTP-EQUIV meta tag, you can comment out the “AddDefaultCharset”. In such case, there will not be encoding information in http header. The browser will try to read the HTTP-EQUIV meta tag.

If no encoding information available in http header and meta tag, browser will try to use the default encoding, like ISO-8859-1 for English version XP’s IE or GB2312 for Simplified Chinese version XP’s IE, or it will use its own way auto-determine the encoding.

  • Share/Bookmark

Hello world!

Wednesday, August 15th, 2007

Hi, there! Welcome to my new blog space. After fooling around in a boring summer afternoon, I set up this blog.

In this blog, I am going to share with you guys my researches on Computerized Linguistic Processing and Web crawling. It will cover technologies of online advertising, more precisely contextual advertising. This is what my linguistic processing project will be used for. Also, we will cover the computer programming tools, like C/C++, Perl, Java, PHP, javascript as well as the OS platform I am using: Linux and Mac OS X.

Sometimes, I will be off-topic and talking about the news and other funny stuffs.

Hope you enjoy it!

  • Share/Bookmark