Archive for the ‘Spider’ Category

Create an Eclipse Project by Maven

Monday, April 28th, 2008

You will install this 2 plugin: eclipse plugin for maven and m2eclipse( Maven 2 Eclipse plugin ).

First, you create a Maven project layout:

mvn archetype:create -DgroupId=com.langcode \
                     -DartifactId=Test

This will create a folder named Test ( artifactId, which is your release jar file base name).

After that, we will create the Eclipse project file:

cd Test

mvn eclipse:eclipse

Now, we can start the Eclipse and import the project we created. And do the following steps:

  1. Right click on the project and enable Maven Package management.
  2. Update Maven dependency.
  3. Update Maven Source folder.

Now, you have an Eclipse development set up. You will want to add some common maven phase into Run menu using the “Run As” Dialog. Or, you can do it manually in CLI.

Some common tasks are:

Clean up:

mvn clean

Build Jar:

mvn package

Copy all dependencies into target folder:

mvn dependency:copy-dependencies

Run Test:

mvn test
  • Share/Bookmark

Browser behavior while sending referrer URL

Saturday, August 25th, 2007

Some times we need to determine the referrer URL. For example, an online advertising company wants to display special information based on which page its audience is looking at. A common way is checking the referrer URL of the current scripts page then embedded in the publisher’s pages.

However, different browser may have different behavior while sending the referrer URL in HTTP request header. Most of browsers except IE will escape the path and query part to % form, and this is what the standard required. IE will convert the query part to UTF-8 encoding before sending it, while other browsers will just leave it as the current encoding using for display. For Firefox, it will be a little bit more complicated, since it can be configured by the about:config page.

Here are some examples:

  • search.php?q=one two, will be sent as it is by IE, and be sent as search.php?q=one%20two by others.
  • search.php?q=中文, will be sent as it is by IE, however, the Chinese charactes are encoding to UTF-8 no matter what encoding the page is. For other browser, it will be sent as search.php?q=%E4%B8%AD%E6%96%87 if search.php is displayed using UTF-8 or as search.php?q=%D6%D0%CE%C4 if search.php is displayed using GB2312.

I will try to give in depth explanation on this issue in the coming articles. Stay tune!

  • Share/Bookmark

How does a browser detemine a page’s encoding

Monday, August 20th, 2007

It is important that a spider should “read” a page as same as a end user. At the very beginning, it should determine a page’s encoding as a browser does.

For most popular browser, like IE, FireFox, Safari, by default, it will check the http header replied by web server ( which is invisible to end user). You can configure your web server always reply a default encoding (usually UTF-8), and then you make sure all pages in this server are encoded to UTF-8. This is the default behavior of Apache, i.e. setting “AddDefaultCharset UTF-8″ in httpd.conf file.

If you’d rather letting the browsers select the encoding by page’s HTTP-EQUIV meta tag, you can comment out the “AddDefaultCharset”. In such case, there will not be encoding information in http header. The browser will try to read the HTTP-EQUIV meta tag.

If no encoding information available in http header and meta tag, browser will try to use the default encoding, like ISO-8859-1 for English version XP’s IE or GB2312 for Simplified Chinese version XP’s IE, or it will use its own way auto-determine the encoding.

  • Share/Bookmark