Have you ever wondered how the search engines like Google, Bing, etc show you billion of web pages in less than a second? Do you think when you search a keyword how the search engines, search your keyword through all the web pages then shows you the result? Well, it's not the exact way they do it, if it was, the searching process will take a lifetime!!
So how do the search engines work?
All search engines have their robots called Web Crawler and Web Scraper. These bots search through every existing web pages and index all the data into their databases, when we search a word, the search engine will search through their indexed data and shows us the result. Today I'm going to show you how to create your web scraper using python 3 and get familiar with some of the libraries that are very helpful to write your web scraper .
Differences of Web Scraper and Web Crawler
Before we start, let me explain the differences between Web Crawler and Web Scraper, we shouldn't be picky about the differences because there isn't a general definition for the differences, the main difference is Web Crawler referred as a bot which does some indexing and storing the information for search engines while the Scraping is referred to extract data from a target and certain website for a certain purpose, and also it does some activity like submitting data so it can act like a real user.
How does Web Scraper work?
Look at the picture below:
it's the most simple flowchart of how we send a request and receive the response of a website.
Every website has its own web service/host which handle the requests. The request sends to the web service with some information like header which contains our identify like which browser we are using, etc then if our request is valid to the web service, we will get a response that contains the data which browser renders and shows it to us as a webpage.
First Step to write a Web Scraper
As I mentioned, when we want to visit a web page, we should send a request to the webserver and receive a response. In python we can send/post request and get response with and without the actual browser. with python Requests library we can GET and POST requests to the target, so why are we waiting for? let's install and know more about Python Requests.
According to the Requests documentation( you can read all about you need about this module, also I use this documentation for this post) Requests is a Python HTTP library, released under the Apache License 2.0. The goal of the project is to make HTTP requests simpler and more human-friendly. The current of python requests is version is 2.23.0. If you don't have Requests installed on your machine, you can install it through your CMD for windows users or Terminal for Mac/Linux Users using this command:
pip install requests
Now we are ready to continue, let's jump into work with Requests.
As usual, you can import any libraries using import keyword in your Script.py and in this case to import requests, you can use:
Get a Request
make a request using Python Requests is so easy, all you have to do is call a method named get:
response = requests.get('https://aliakhtari.com')
Now you sent a request to my website and my website returned response. to check if the webserver response , you can use this line of code that shows us the response status:
print(response.status_code) # result should be: 200
every response code that starts with 2xx is telling us that your request was successful. You can read all different kind of status code in this web page: List of HTTP Status Codes Let's add more lines to our codes to check our status code more precisely using most common response codes using most common response codes:
if response.status_code == 200: print(" Success!") elif response.status_code == 400: print("Bad Requests") elif response.status_code == 401: print("Unauthorized") elif response.status_code == 403: print("Forbidden") elif response.status_code == 404: print("Not Found") elif response.status_code == 503: print("Service Unavailable")
Now we can manage our response code to check our request status.
When we get our response, we will be able to read response content and In this case, we can access it with response.text
response.textincludes all the HTML, CSS, JS and etc files in the .html file that the webserver sent it to us.
import requests response = requests.get('https://aliakhtari.com/robots.txt') if response.status_code == 200: print(response.text) # This code will print what is inside https://aliakhtari.com/robots.txt which is a .txt file, so output should be: # User-agent: * # Disallow: /admin_aliakhtari_com_page/ # Allow: *
Great, now we have all the webpages in our hands, but how can we extract a certain tag from HTML/CSS code? For example how can we get the title of the web page? To processing response.text we could use the most popular package for parsing HTML and XML documents which is 'Beautiful Soup'.
Beautiful Soup is a Python library designed for parsing HTML and XML documents. You can read all about Beautiful Soup on their website: BeautifulSoup Doc Same as before, use pip to install Beautiful Soup:
pip install beautifulsoup4
but before using beautiful soap, let's change the url to get actual html file not a simple text file.
response = requests.get('https://aliakhtari.com')
Beautiful Soup is a package which parse HTML code for us, so the first step is to pass the source code of our webpage to this package.
from bs4 import BeautifulSoup soup = BeautifulSoup(response.text, 'html.parser')
The first parameter is the source code that we get from request, the second parameter is the parser, in this case that's html.parser, there are other parsers out there but for now we only use html.parser .
So what will be the output?
Now we have an instance of Beautiful Soup which has our source code.
Let's find a certain tag inside of our source code.
For instance, we want to get the title of the web page.
All we have to do is just telling to Beautiful Soup to find a tag named title (
print(soup.find('title')) # result:
Ali Akhtari Official WebSite
see!! super easy.
Till now we wrote a script that gets the source code of a webpage using requests, then parsing it using Beautiful Soup.
In the futures post, I will share how to use these packages to write an advanced web scraper using python.
import requestsfrom bs4 import BeautifulSoup response = requests.get('https://aliakhtari.com') if response.status_code == 200: soup = BeautifulSoup(response.text, 'html.parser') print(soup.find('title'))
Perfect, now we have web scraper which extracts the title of any webpages that we want only with 6 lines of codes. In this post I tried to show you how easy it could be to write a Web Scraper using Python😍. If you are interested to know more about web scraping, follow my blog. I'll post a lot of tutorials and source codes, this post was just an introduction and there is a lot more to know.