Scrape Webpage with Python

Overview

Web scraping enables you to efficiently fetch and extract data from web pages, opening doors to information that might otherwise remain hidden or difficult to collect.

This workshop provides an introduction to web scraping using Python. Through a hands-on project, participants will learn how to use the Requests and BeautifulSoup Python package to scrape, parse and extract specific data from a webpage.

While the workshop is designed with beginners in mind, it also offers valuable learning opportunities for experienced Python users through a separate self-paced web-crawling project using Scrapy (a widely used Python web scraping framework).

Participants will learn:

Web scraping fundamentals.
The art of HTML parsing and extracting data using Beautiful Soup.
Precise data retrieval using regex.
Efficient data extraction using Scrapy (advanced).

Prior Knowledge

To fully appreciate and make the most of the workshop, a basic understanding of Python is beneficial. However, if you have little to no Python experience, the instructor will cover the basics at the beginning of the workshop to ensure everyone can participate.

Software Details

Make sure the Python3 (recommended version 3.9 or later) and IDE (recommended Jupyter) are installed.

You can refer to the Python_IDE_Setup to install the required apps.

The packages this workshop will use are:

Requests
Pandas
BeautifulSoup
Scrapy (optional)

You can install them through pip3 or conda. It will be demonstrated at the beginning of the workshop.

IMPORTANT INFORMATION

Web scraping is, in general, legal, given the scraping instance follows the Terms of Service of the targeted website, as well as the data is public. A good practice is to check the robots.txt of the website to be scraped.

Popular social media website such as Twitter, Facebook, Instagram have their own policy, some may offer API for direct data exchange.

In addition, websites listed below are out of the scope of this workshop:

Websites with paywall, such as: WSJ, New York Times, etc.
Websites deployed with CAPTCHA.
Data that requires credentials for access.
Google suite, such as Google Search, Google Map, etc.
Other websites that prohibits automated HTTP requests.
Data rendered by javascript

Workshop Plan and Date

The workshop is planned with an 1.5 hours session. Including a brief introduction, live demos. The instructor will answer questions and help with the debug during live-coding.

2:30-4pm Feb.15 CDS 246 (Hesburgh Library)

Name		Name	Last commit message	Last commit date
Latest commit History 27 Commits
.idea		.idea
source		source
workshop		workshop
.DS_Store		.DS_Store
Python_IDE_Setup.md		Python_IDE_Setup.md
README.md		README.md
WebScraping.md		WebScraping.md
intro_code.ipynb		intro_code.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Scrape Webpage with Python

Overview

Prior Knowledge

Software Details

IMPORTANT INFORMATION

Workshop Plan and Date

About

Releases

Packages

Languages

Lucy-Family-Institute/CSSR-Workshop-Scrapy

Folders and files

Latest commit

History

Repository files navigation

Scrape Webpage with Python

Overview

Prior Knowledge

Software Details

IMPORTANT INFORMATION

Workshop Plan and Date

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages