Skip to content

Lucy-Family-Institute/CSSR-Workshop-Scrapy

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

27 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Scrape Webpage with Python

Author: Yang Xu ([email protected])

Overview

Web scraping enables you to efficiently fetch and extract data from web pages, opening doors to information that might otherwise remain hidden or difficult to collect.

This workshop provides an introduction to web scraping using Python. Through a hands-on project, participants will learn how to use the Requests and BeautifulSoup Python package to scrape, parse and extract specific data from a webpage.

While the workshop is designed with beginners in mind, it also offers valuable learning opportunities for experienced Python users through a separate self-paced web-crawling project using Scrapy (a widely used Python web scraping framework).

Participants will learn:

  1. Web scraping fundamentals.
  2. The art of HTML parsing and extracting data using Beautiful Soup.
  3. Precise data retrieval using regex.
  4. Efficient data extraction using Scrapy (advanced).

Prior Knowledge

To fully appreciate and make the most of the workshop, a basic understanding of Python is beneficial. However, if you have little to no Python experience, the instructor will cover the basics at the beginning of the workshop to ensure everyone can participate.

Software Details

Make sure the Python3 (recommended version 3.9 or later) and IDE (recommended Jupyter) are installed.

You can refer to the Python_IDE_Setup to install the required apps.

The packages this workshop will use are:

  1. Requests
  2. Pandas
  3. BeautifulSoup
  4. Scrapy (optional)

You can install them through pip3 or conda. It will be demonstrated at the beginning of the workshop.

IMPORTANT INFORMATION

Web scraping is, in general, legal, given the scraping instance follows the Terms of Service of the targeted website, as well as the data is public. A good practice is to check the robots.txt of the website to be scraped.

Popular social media website such as Twitter, Facebook, Instagram have their own policy, some may offer API for direct data exchange.

In addition, websites listed below are out of the scope of this workshop:

  1. Websites with paywall, such as: WSJ, New York Times, etc.
  2. Websites deployed with CAPTCHA.
  3. Data that requires credentials for access.
  4. Google suite, such as Google Search, Google Map, etc.
  5. Other websites that prohibits automated HTTP requests.
  6. Data rendered by javascript

Workshop Plan and Date

The workshop is planned with an 1.5 hours session. Including a brief introduction, live demos. The instructor will answer questions and help with the debug during live-coding.

2:30-4pm Feb.15 CDS 246 (Hesburgh Library)

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published