Content extraction from web pages using Machine Learning

What is web scraping?

It is a technique of extracting the content of websites, via a script or a program, in order to transform it to allow its use in another context.

What are the existing scraping methods?

There are different techniques for retrieving web information. Such as Xpath queries, CSS code, visual recognition and others.

Is there another method of more efficient crawling via machine learning algorithms?

Let’s take an example of a real estate ad web page describing a property for sale.

Let’s look at the html structure of this ad. That is the DOM tree with its unit being an html tag that has its characteristics like id, name, class, textual content etc.

This page contains several characteristics such as title, price, description and descriptive list. This will be our classification to predict.

But is there any relation between the html tags containing its textual elements and the html structure in which they are located?

The purpose of this experimental project is to build the model of the supervised classification of html tags.

For this we classified manually the html tags of all the levels of a web page in the following way.

To annotate an HTML document, we have kept the same HTML document with some descendants of the <body> branch having manually entered labels in a <class> attribute (in addition to their original classes). And items that do not have these classes are considered noise.

For example, the price and title annotation will look like this:

Thus we obtain a manual classification of the tags.

Then selected factors that can predict this classification. After observing the html structure and its tags, we noticed that for example the title of an ad appears very often in a tag <header> or like his descendant. The description is frequently in a <p> tag, and the descriptive list appears as an descendants of a <li>, <ul>, <table> tag. For the price we analyzed the internal text of a tag and observed the presence of a number in the format of the price. Also we observed the length of the internal text of each tag and several other factors.

This allowed us to build the structured data with its factors for each tag in the html structure. Then obviously we filtered and cleaned some data to be able to analyze them. The global dataset contains 3374 tags and 26 features.

The performance of the classification model consists of selecting the variables that are most relevant or most correlated to the classification of the tags. Thus we proceed by the method of selecting variables via a feature_importance score of a Random Forest model.

Here it has been clearly distinguished that the length of the textual content of a html tag contributes most to predict the ranking of tags. Then we observe that the descendants of the tags <h>, <ul> and <table> have a certain importance.

Finally, the method of sub-sampling of the class distribution (title, price etc) enabled us to build and improve the Random Forest supervised classification model with a precision of 73% of the good predictions of the html tags classification.