Softwares / Web Scraping / Harvesting/ Data Extraction

It is data scraping used for extracting data from websites/ WWW directly using the Hypertext Transfer Protocol / through a web browser typically into a central local database / spreadsheet, for later retrieval or analysis.

Web Scraping Implemented Using
Manually Software User
Automated A Bot Or Web Crawler
Prevented by Detecting and disallowing bots from crawling (viewing) their pages
Preventions fails by DOM parsing, computer vision and natural language processing to simulate human browsing to enable gathering web page content for offline parsing.


S No Web Scraping Steps Details
1 Fetching / Downloading Using Web Crawling. Pages Data
2 Extracting spreadsheet


Web Scraping / Harvesting/ Data Extraction

Web scraping Techniques
Human copy and paste Here a human’s manual examination and copy and paste is required
Text pattern matching UNIX GREP command / regular expression-matching used to extract information from web pages
HTTP programming Static and dynamic web pages can be retrieved by posting HTTP requests to the remote web server using socket programming.
HTML parsing Semi structured data query languages, such as XQUERY and the HTQ can be used to parse HTML pages and to retrieve information (wrapper= similar) and transform page content.
DOM parsing Here Document Object Model concept used parse web pages into a DOM tree, based on which programs can retrieve parts of the pages.
Vertical aggregation Its platforms are created by companies with access to large-scale computing power to target specific verticals. Some companies even run these data harvesting platforms on the cloud.
Semantic annotation recognizing Annotations are stored and managed separately from the web pages, so the scrapers can retrieve data schema and instructions from that area before scraping the pages.
Computer vision webpage analysis Machine learning and computer vision that attempt to identify and extract information from web pages by interpreting pages visually as a human being might.


Home     Back