Web pages are getting more complex than ever. Thus, identifying different elements from web pages, such as main content, menus, user comments, advertising among others, becomes difficult. Web page segmentation refers to the process of dividing a Web page into visually and semantically coherent segments called Blocks or Segments. Detecting these different blocks is a crucial step for many applications, for example mobile devices content visualization, information retrieval and change detection between versions in the web archive context.
Web Page Segmentation at a Glance
For a web page (W) the output of its segmentation is the semantic tree of a web page (W'). Each node represents a data region in the web page, which is called a block. The root block represents the whole page. Each inner block is the aggregation of all its children blocks. All leaf blocks are atomic units and form a flat segmentation of the web page. Each block is identified by a block-id value (See Figure 1 for an example).
An efficient web page segmentation aproach is important for several issues:
-
Process different part of a web page accordingly to its type of content.
-
Assign importance to a region in a web page over the rest
-
Understand the structure of a web page
In this post, I will try to explain what web page segmentation does specially for pagelyzer. It provides information of about the web page content.
Web page Segmentation Algorithm
We present here the detail for the Block-o-Matic web page segmentation algorithm used by pagelyzer to perform the segmentation. It is an hybrid between the visual-based approach and document processing approach.
The segmentation process is divided in three phases: analysis, understanding and reconstruction. It comprise three taks: filter, mapping and combine. It produces three structures: DOM structure, content structure and logic structure. The main aspect of the whole process is producing this structures where the logic structure represent the final segmentation of the web page.
The DOM tree is obtained from the rendering of a web browser. The result of the analysis phase is the content structure (Wcont ), built from the DOM tree with the d2c algorithm. Mapping the content structure into a logical structure (Wlog ) is called document understanding. This mapping is performed by the c2l algorithm with a granularity parameter pG. Web page reconstruction gather the three structures (Rec function),
W' = Rec(DOM, d2c(DOM ), c2l(d2c(DOM, pG))).
For the integration of the segmentation outcome to pagelyzer it is used a XML representation: ViDIFF. It represent hierarchicaly the blocks, their geometric properties, the links and text in each block.
Implementation
Block-o-matic algorithm is available:
- through pagelyzer itself https://github.com/openplanets/pagelyzer),
- as a javascript library or as a chrome browser extension (http://www-poleia.lip6.fr/~sanojaa/BOM/)