Introduction to Web Scrapping Using Python

Dhwanipanjwani
4 min readJul 29, 2021

Scraping Data Of IMDb Top 50 Highest Rated Web-Series.

It is said that by the end of 2020, 44 zettabytes(1 billion Tb) will make up the entire digital universe. But thee data available on thee internet is not always in the form one can use for analysis. Most of the data which is displayed on the website is dynamic i.e it comes from server. To fetch this data, we can use method of data extraction which is called web scraping. Web scraping is the process of extracting content and data from a website using automated scripts. Unlike screen scraping, which only copies pixels displayed onscreen, web scraping extracts underlying HTML code and, with it, data stored in a database. Web scraping is also known as web harvesting or web data extraction. Web scraping can be used for

  • Email address gathering: Most of the digital marketing companies use web scraping to collect the emailIDs and send bulk emails.
  • Research and Development: It can be used to collect a large set of data (Statistics, General Information, Temperature, etc.) from websites, which are analyzed and used for R&D.
  • Job listings: Details regarding job openings, interviews are collected from different websites and then listed in one place so that it is easily accessible to the user.
  • Price Comparison: Many services use web scraping Services such as ParseHub use web scraping to collect data from online shopping websites and use it to compare the prices of products.

Python Libraries Used For Web Scrapping

There are many different libraries available in python for web scrapping, but here we have used Requests, BeautifulSoup and Pandas.

  1. Requests: It allows you to send HTTP/1.1 requests with ease and it doesnot require to manually add query strings to your URLs, or to form-encode your POST data.
  2. BeautifulSoup : is used for web scraping purposes to pull the data out of HTML and XML files. It creates a parse tree from page source code that can be used to extract data in a hierarchical and more readable manner.
  3. Pandas: Pandas is mainly used for data analysis. Pandas allows importing data from various file formats such as comma-separated values, JSON, SQL, Microsoft Excel. Pandas allows various data manipulation operations such as merging, reshaping, selecting, as well as data cleaning, and data wrangling features.

Web Scraping involves 3 basic steps:

1. Find the URL: Select the website from where you want the data. For example, here I have used IMDb website to get the data of top 50 highest rated Web-Series. The url is: https://www.imdb.com/search/title/?title_type=tv_series&num_votes=100000,&sort=user_rating,desc

2. Inspect the page: The data is usually nested in tags. So, we inspect the page to see, under which tag the data we want to scrape is nested. To inspect the page, just right click on the element and click on “Inspect”.

3. Find the data which is to be extracted: In this example I’m going to extract data of name of web-series, their year of release, genre and IMDb ratings , which are under nested “div” tags.

4. Write the code: To do this, you can use any Python IDE. Here I have us Jupyter Notebook.

Import the required python libraries:

Create empty variables to store the scraped data:

Now enter the URL from where you want the data. Requests library is used to make html requests to the server.

Using the Find and Find All methods in BeautifulSoup extract the data from required tags and store it in the variables declared.

Now using Pandas Libary, create a dataframe in which the data is stored in structured way so that you can export it into the desiredd file format. Here I have exported the data in .csv format.

On running the whole code. Here is the snapshot of the csv file generated after running the code.

This is a basic program to perform web scraping. By performing this, you get to learn how to scrape data from the internet and format it for further analysis.

--

--