Web Scraping for Data Scientists (With No Web Programming Background)

Nowadays data has a very important role in the industry and the need to obtain this data is becoming increasingly high. A big part of this information can be obtained through the internet. Some data is expressed, for example, as comments or hashtags on social networks other can be found as plain texts on web pages and other can be tables at “old” websites with no more format than simple HTML.

But regardless of the format, what we really care is the information we can obtain from this source, information that we can analyze to learn about our customer for example or increase the amount of data in our database for a further use. Sometimes the sites provide us with the information through APIs or by making available documents that we can easily download, but sometimes there is no other way to obtain it than by inspecting and copying this information directly from the website.

So far everything is happiness, but what happens when the information we need to obtain is distributed in hundreds or thousands of different pages or routes and is also all messy.
That’s the time when automation of the inspection and extraction of the data comes to the scene!

Web Scraping

Basically scraping is a technique to capture data from a web site and store it into a machine.

We could get the whole code of a web page with a GET request and then manually inspect, order and store it on a database, but Python provides us with some useful solutions to make it easier. One of this options is the Web Crawling Framework Scrapy. One of it’s main advantages is that it was built on top of  Twisted which makes it asynchronous and faster. In the other hand we have Beautiful Soup its name comes from the expression “tag soup” which describes a messy and unstructured HTML. We would be using this library since it is vert handy and also helps a lot when the site we are trying to scrape is a real mess (which is a lot of the cases).

Beautiful Soup can get the whole HTML code of the page and store it as an object easier to filter by tags, id’s and DOM’s.

But first you might not know what a DOM is, so let’s start with a basic HTML class.

The DOM is a W3C (World Wide Web Consortium) standard. It’s basically a document object model that can define the properties, methods and events of an HTML element. So a DOM looks like this:

<p></p>

As said before, on a DOM we can have properties like an id, a class or a href and a really basic HTML page would look like this:

<!DOCTYPE html> 
  <html>     
    <body>      
      <h2>Finding HTML Elements by Tag Name</h2>         
      <div id="main">          
        <p>The DOM is very useful.</p>             
        <p>This example demonstrates the <b>getElementsByTagName</b> method.</p>         
      </div>         
      <p class="demo">demo</p>     
    </body> 
  </html>

 

The DOM is very useful. So for this HTML page the tags or properties would be:

  • id = “main”
  • class = “demo”

And our DOM’s would be:

  • html
  • body
  • h2
  • div
  • p
  • b

The combination of these DOM’s and its assigned properties would be the ones that will make our search easier.

If you want to learn a bit more about HTML you can find a lot of information on the W3Schools website.

Now we are ready to code!

First we need to install the Beautiful Soup library and a HTTP requests library in this case wi’ll use urlib3:

pip install bs4
pip install urllib3

For example purposes we’re going to use the following site for the scrape:
page_url = “http://example.webscraping.com/”

Before start doing requests to a web page we are going to manually navigate to the root of the site and look for robots.txt (https://example.webscraping.com/robots.txt).

The robots.txt is a file that every site should have to let bots know which pages are allowed to inspect and which are not. Also it can specify how many requests per second a crawler can do on the site. It is ethical to follow this instructions and not to hit unnecessarily the server while scraping it.

The first thing we need is to get the whole page source code. It’s recommendable to get it all on a single request, close the connection and then iterate multiple times if needed to find the information we are looking for. It is not only faster but also better for the server we are scraping.

from bs4 import BeautifulSoup
import urlib3
http = urllib3.PoolManager() #instanciate the http requests library
response = http.request('GET', page_url) #make the request to the page
soup = BeautifulSoup(response.data, 'html.parser') #store the code as a soup object

This Soup object has all of the page HTML code, so let’s try doing some search. To know what to look for we can do a dynamic inspection on the page making a right click and then select “Inspect the Element“. Almost all the web browsers has this option. There we can see that the whole block of HTML code containing the first countries has a tag class called span12, now applying that to our code we have:

dom_name = "div"
tag_name = "class"
tag_value = "span12"
span12_object_array = soup.find_all(dom_name, attrs={tag_name: tag_value})

This code will return an array containing all the divs found with a span12 class and all the children DOMs inside it, which visually would be the flags and the names of the countries.

If we want to get the link of the images of the flags we can apply the following search to our already filtered “div span12” array:

img_dom_name = "img"
img_attr_name = "src"
for div_item in span12_object_array:
  img_object_array = div_item.find_all(img_dom_name)
  img_src_array = [current_item.get(img_attr_name) for current_item in img_object_array] #This is a list comprehension to simply get each found image source url in order

Now we can print this to see what is every image link:

print(img_src_array)

Finally you can store the information to a data base or directly download the images.

Insights

A well done scraping won’t cause a problem for the server that stores the information we need, this is important principally for ethical reasons. In the same line we can optimize our code to perform faster operations locally and ask for different information in the future with small changes.

Also an efficient and reusable code will be easier to maintain and would minimize our code time in the future.

This is a basic approach to the scraping methods, after understanding this it would be easier to scrape any site even if we find some messy and not ordered HTML code containing the information.

0
0

Related Posts