To do this, I need to strip out the margins and headers of the various images I get from browsershots.org (my preferred way right now of getting pictures of websites).
This isn't easy, but I am working on different approaches starting of by trying to segment on the color of the surrounding element. My goal is to use as little information as possible about the image to extract what I want.
A problem with the current images is that each image includes the whole browser window including scrollbars, addressbars and other browser specific stuff. It is a goal to be able to remove as much of this clutter as possible. We want a clean image of just the rendered page.
As a test, I started with four images from four different browsers, firefox 1.5, firefox 2.0, safari and opera. Somehow I forgot to include IE in the dataset.
You can find the images here:
http://folk.uio.no/tarjeih/matlab/
For now I've done removal of horizontal components in the image. To do this, I used matlab to get the variance of each horizontal line in the image. Then I set a cutoff point wrt to variance and also cutoff points that I was only interested in lines 120 or 50 pixels from the top or bottom respectively.
For each picture I got two sets of lines, one for the lines on top with variance bellow the set cutoff point and one for the lines close to the bottom.
I then used the line closest to the center for each of these sets as the line where I thought the webpage started.
The images in the directory show the results.
The file.png is the original image and the file cut-.png shows the image with red lines drawn where I found lines with variance bellow the cutoff point.
Files named figure-.png.eps shows the two images side by
side.
Relevant matlab code is also in the directory:
- cutPic.m is the function that finds the cut off points.
- getCutPicLines.m is the script that goes through the images and
crops them.
- colorLines is a script to color the relevant lines in the image.
- tmp_rmLines .m is a starting attempt at making the script work in
both dimensions.
Questions/Problems/Ideas/Comments:
* It is fairly obvious that this approach works best at the top and bottom.
* I'm wondering if I should use information about which browser this is when cropping the image. I was hoping I didn't have to, but I seem to be wrong.
* I will need an approach to deciding if there is stuff surrounding the image, esp if there is a scrollbar in the image.
You can find the images here:
http://folk.uio.no/tarjeih/matlab/
For now I've done removal of horizontal components in the image. To do this, I used matlab to get the variance of each horizontal line in the image. Then I set a cutoff point wrt to variance and also cutoff points that I was only interested in lines 120 or 50 pixels from the top or bottom respectively.
For each picture I got two sets of lines, one for the lines on top with variance bellow the set cutoff point and one for the lines close to the bottom.
I then used the line closest to the center for each of these sets as the line where I thought the webpage started.
The images in the directory show the results.
The file
Files named figure-
side.
Relevant matlab code is also in the directory:
- cutPic.m is the function that finds the cut off points.
- getCutPicLines.m is the script that goes through the images and
crops them.
- colorLines is a script to color the relevant lines in the image.
- tmp_rmLines .m is a starting attempt at making the script work in
both dimensions.
Questions/Problems/Ideas/Comments:
* It is fairly obvious that this approach works best at the top and bottom.
* I'm wondering if I should use information about which browser this is when cropping the image. I was hoping I didn't have to, but I seem to be wrong.
* I will need an approach to deciding if there is stuff surrounding the image, esp if there is a scrollbar in the image.