Create Your Own Web Scraper only in 6 line of code with Python 3 - part 1

Create Your Own Web Scraper only in 6 line of code with Python 3 - part 1
Ali Akhtari March 28, 2020

Have you ever wondered how the search engines like Google, Bing, etc show you billion of web pages in less than a second? Do you think when you search a keyword how the search engines, search your keyword through all the web pages then shows you the result? Well, it's not the exact way they do it, if it was, the searching process will take a lifetime!!

So how do the search engines work?

All search engines have their robots called Web Crawler and Web Scraper. These bots search through every existing web pages and index all the data into their databases, when we search a word, the search engine will search through their indexed data and shows us the result. Today I'm going to show you how to create your web scraper using python 3 and get familiar with some of the libraries that are very helpful to write your web scraper .

Differences of Web Scraper and Web Crawler

Before we start, let me explain the differences between Web Crawler and Web Scraper, we shouldn't be picky about the differences because there isn't a general definition for the differences, the main difference is Web Crawler referred as a bot which does some indexing and storing the information for search engines while the Scraping is referred to extract data from a target and certain website for a certain purpose, and also it does some activity like submitting data so it can act like a real user.

How does Web Scraper work?

Look at the picture below:

Flow of requests, how does Web Scrapers work

it's the most simple flowchart of how we send a request and receive the response of a website.
Every website has its own web service/host which handle the requests. The request sends to the web service with some information like header which contains our identify like which browser we are using, etc then if our request is valid to the web service, we will get a response that contains the data which browser renders and shows it to us as a webpage.

First Step to write a Web Scraper

As I mentioned, when we want to visit a web page, we should send a request to the webserver and receive a response. In python we can send/post request and get response with and without the actual browser. with python Requests library we can GET and POST requests to the target, so why are we waiting for? let's install and know more about Python Requests.

Python Requests

According to the Requests documentation( you can read all about you need about this module, also I use this documentation for this post) Requests is a Python HTTP library, released under the Apache License 2.0. The goal of the project is to make HTTP requests simpler and more human-friendly. The current of python requests is version is 2.23.0. If you don't have Requests installed on your machine, you can install it through your CMD for windows users or Terminal for Mac/Linux Users using this command:

pip install requests

Now we are ready to continue, let's jump into work with Requests.
As usual, you can import any libraries using import keyword in your Script.py and in this case to import requests, you can use:

import requests

Get a Request

make a request using Python Requests is so easy, all you have to do is call a method named get:

response = requests.get('https://aliakhtari.com')

Now you sent a request to my website and my website returned response. to check if the webserver response , you can use this line of code that shows us the response status:

print(response.status_code)
# result should be: 200

every response code that starts with 2xx is telling us that your request was successful. You can read all different kind of status code in this web page: List of HTTP Status Codes Let's add more lines to our codes to check our status code more precisely using most common response codes using most common response codes:

if response.status_code == 200:
    print(" Success!")
elif response.status_code == 400:
    print("Bad Requests")
elif response.status_code == 401:
    print("Unauthorized")
elif response.status_code == 403:
    print("Forbidden")
elif response.status_code == 404:
     print("Not Found")
elif response.status_code == 503:
    print("Service Unavailable")

Now we can manage our response code to check our request status.

response.text

When we get our response, we will be able to read response content and In this case, we can access it with response.text
response.textincludes all the HTML, CSS, JS and etc files in the .html file that the webserver sent it to us.

import requests
response = requests.get('https://aliakhtari.com/robots.txt')
if response.status_code == 200:
    print(response.text)
# This code will print what is inside https://aliakhtari.com/robots.txt which is a .txt file, so output should be:
#    User-agent: *
#    Disallow: /admin_aliakhtari_com_page/
#    Allow: *

Great, now we have all the webpages in our hands, but how can we extract a certain tag from HTML/CSS code? For example how can we get the title of the web page? To processing response.text we could use the most popular package for parsing HTML and XML documents which is 'Beautiful Soup'.

Beautiful Soup

Beautiful Soup is a Python library designed for parsing HTML and XML documents. You can read all about Beautiful Soup on their website: BeautifulSoup Doc Same as before, use pip to install Beautiful Soup:

pip install beautifulsoup4

but before using beautiful soap, let's change the url to get actual html file not a simple text file.

response = requests.get('https://aliakhtari.com')

Beautiful Soup is a package which parse HTML code for us, so the first step is to pass the source code of our webpage to this package.

from bs4 import BeautifulSoup
soup = BeautifulSoup(response.text, 'html.parser')

The first parameter is the source code that we get from request, the second parameter is the parser, in this case that's html.parser, there are other parsers out there but for now we only use html.parser .

So what will be the output?

Now we have an instance of Beautiful Soup which has our source code. Let's find a certain tag inside of our source code. For instance, we want to get the title of the web page.
All we have to do is just telling to Beautiful Soup to find a tag named title ().

print(soup.find('title'))
# result: Ali Akhtari Official WebSite

see!! super easy.

conclusion

Till now we wrote a script that gets the source code of a webpage using requests, then parsing it using Beautiful Soup.

import requestsfrom bs4 import BeautifulSoup
response = requests.get('https://aliakhtari.com')
if response.status_code == 200:
    soup = BeautifulSoup(response.text, 'html.parser')
    print(soup.find('title'))
In the futures post, I will share how to use these packages to write an advanced web scraper using python.
Perfect, now we have web scraper which extracts the title of any webpages that we want only with 6 lines of codes. In this post I tried to show you how easy it could be to write a Web Scraper using Python😍. If you are interested to know more about web scraping, follow my blog. I'll post a lot of tutorials and source codes, this post was just an introduction and there is a lot more to know.

Download The Script

Use this link to download the source code that we wrote. WebScraper.py
Share On

About the Author

Comments

  • Search Results Web r
    May 05, 2020    12:04

    I really appreciate your knowledge. it was really an optional article

    Reply

Leave a Reply