Scrape any website using Python in 7 Steps (Stock Market, Social Media, etc…)

Aryan Bajaj
6 min readSep 2, 2022

--

Scrap any website in just 7 steps

If you want to know more about web scraping and how to scrape a website, this article will walk you through terminology and how-to instructions on conducting these types of repetitive tasks.

Actual Data & Extracted Data

What is the actual meaning of Web Scraping?

Web scraping is a process of extracting data from websites. It can be used to collect data from online sources such as online retailers or social media platforms. Web scraping can be done manually or automatically. Python is a popular language for web scraping because it is easy to learn and has many libraries that can be used for web scraping.

What is the role of HTML Codes in Web Scraping?

Role of HTML in Web Scraping

If you’re looking to scrape websites using Python, there are a few different types of HTML codes you’ll need to know how to work with. In this blog post, we’ll go over some of the most common types of HTML code and how you can use Python to interact with them.

The first type of code you’ll need to know how to work with is HTML tags. These are the codes that tell your browser how to format the text on a web page. For example, the <p> tag tells your browser to start a new paragraph, and the <b> tag tells your browser to make the text bold. You can find a full list of HTML tags and their usage at W3Schools.com.

To interact with HTML tags using Python, you’ll first need to import the BeautifulSoup library. This library allows you to access and parse HTML code using Python. Once you have BeautifulSoup imported, you can use its find() method to locate specific tags on a web page.

Python is an incredibly powerful tool for web scraping. Once you’re finished with this blog post, you’ll be able to scrape any website you want!

It’s a lot of theory 😄. Now, the most awaited part Web Scraping using Python

Python Codes for Web Scraping

IDE: Visual Studio Code (VSC)

Step-1

The First and Foremost step is to load the necessary libraries:

import requests

Requests make it exceedingly simple to submit HTTP/1.1 requests. There’s no need to manually add query strings to your URLs or form-encode your PUT and POST data — just use the JSON method!

Requests is one of the most popular Python libraries today, with roughly 30 million downloads per week — according to GitHub, Requests is presently used by 1,000,000+ projects. You can absolutely rely on this code.

from bs4 import BeautifulSoup

Beautiful Soup is a package that makes it simple to scrape data from websites.

It sits on top of HTML or XML parser, giving Pythonic idioms for iterating over, searching through, and altering the parse tree.

import pandas as pd

Pandas is a Python library that provides quick, versatile, and expressive data structures that are intended to make dealing with “relational” or “labeled” data simple and intuitive. It aspires to be the basic high-level building block for conducting realistic, real-world data analysis in Python.

Furthermore, it aspires to be the most powerful and adaptable open source data analysis and manipulation tool accessible in any language. It is already well on its way to accomplishing this aim.

import lxml.html

lxml is a robust Pythonic binding for the libxml2 and libxslt libraries. Using the ElementTree API, it gives secure and simple access to these libraries.

It greatly expands the ElementTree API to include support for XPath, RelaxNG, XML Schema, XSLT, C14N, and many more languages.

Step-2

After calling the libraries. The next step is to get the URL of any website. For this Blog, I will be using yahoo finance’s URL — https://finance.yahoo.com/trending-tickers

url = input("Enter a website to extract the information")
Step-3

After entering the URL, the next step is to check whether the URL is working fine or not.

To check it use the below-mentioned code:

r = requests.get(url)
print(r)

This is used to get the response rate of the website. Different response rates have different Meanings

Output:

Here the output is 200 which means that the URL is working perfectly and we can extract data from it.

Complete HTTP status code List

Our code is 200 which means we are good to go.

Step-4

Now, we will use html.parse

soup = BeautifulSoup(r.content, 'html.parser')

Parsing is the process of breaking down a phrase or set of words into independent components, including the specification of each part’s purpose or form.

The technical meaning suggests the same idea.

All high-level programming languages employ parsing.

Step-5

In this example, we will be using these codes to extract the table.

To extract the table we will be using the code:

tabela = soup.find(name='table')
Step-6

Now, we have extracted the table but it is stored in tabela variable name. Now we will be using pandas to make a data frame, so that if needed then it can be further processes.

df = pd.read_html(str(tabela))[0].set_index('Name')
df.head()

We use “[0]” because the output would be a list with a single element, and we need that element, not the list.

Output:

The actual data had the images for Intraday, 52 Week range, and Day Chart. For now, we don’t want any images in the data as now we are not analyzing any data, we are only extracting information.

So we’ll drop these columns using the below-mentioned code:

data = df.drop(['Intraday High/Low',    '52 Week Range',    'Day Chart'],axis=1)
Step-7

Now the most awaited step. Showing the results.

We’ll use:

print("Page Title Extracted : ",soup.find('title').text)print("Page Heading Extracted : ",soup.find('h1').text)print("Data Table Extracted : ")data.head(10)

Output:

That is all.

That’s how you can scrape any website and make a downloadable file.

Conclusion

This Blog has shown you how to scrape any website using Python. Whether you’re looking for data on a specific topic or just want to download all the content from a website, Python can help you get the job done quickly and easily. So next time you need to scrape a website, don’t hesitate to give Python a try.

In case of questions, leave a Comment or Email me at aryanbajaj104@gmail.com

ABOUT THE AUTHOR

I recently completed BBA (BUSINESS ANALYTICS) from CHRIST University, Lavasa, Pune Campus.

Website — acumenfinalysis.com (CHECK THIS OUT)

CONTACTS:

If you have any questions or suggestions on what my next article should be about, please write to me at aryanbajaj104@gmail.com.

If you want to keep updated with my latest articles and projects, follow me on Medium.

Subscribe to my Medium Account: https://aryanbajaj13.medium.com/subscribe

CONNECT WITH ME VIA:

LinkedIn

A few Article Quality Reports

Text Spin Checker

Plagiarism Report

--

--

Aryan Bajaj
Aryan Bajaj

Written by Aryan Bajaj

Passionate about studying how to improve performance and automate tasks.

Responses (1)