large number of pages from about 7,000 different websites and extract some information.
What is the best way to do this?
It is not technically correct to say the template structure of each website is different. At the core, all HTMLs are based on Document Object Model (DOM) and you can go through the nodes of the document object recursively. The objects within the DOM can be broadly classified into two types: Containers and Contents. The containers have attributes which determine how it is displayed (or displays the contents inside). A scrapper by definition will be looking for contents, so you need to keep looping into each of the container and then get to the contents.
Обсуждают сегодня