Course

Skills

Extracting Data from HTML with BeautifulSoup

This course covers the important aspects of scraping websites using Beautiful Soup. You will learn to build, manipulate and traverse the parse tree, as well as to leverage advanced features such as working with filters, CSS and XPath.

Preview this course

What you'll learn

Web scraping is an important technique that is widely used as the first step in many workflows in data mining, information retrieval, and text-based machine learning.

In this course, Extracting Data from HTML with BeautifulSoup* you will gain the ability to build robust, maintainable web scraping solutions using the Beautiful Soup library in Python.

First, you will learn how regular expressions can be used to scrape web content, and how Beautiful Soup does better in important ways. Next, you will discover how Beautiful Soup parses HTML from web content, fixes up badly-formed tags, and builds a clean, easily traversable parse tree. You will then see how that parse tree can be used in order to find and retrieve specific patterns.

Finally, you will round out your knowledge by leveraging advanced features of beautiful soup such as working with CSS and XPath. When you’re finished with this course, you will have the skills and knowledge to implement robust web scraping using Beautiful Soup.

Course Overview

2mins

Course Overview 2m

Getting Started with BeautifulSoup

44mins

Navigating the Parse Tree

40mins

Module Overview 1m
Parsing Web Pages with Beautiful Soup 5m
Tags, Attributes, NavigableStrings, Comments 4m
Navigating Using Tags and Contents 4m
Navigating Children, Descendants, and Parents 6m
Navigating Sideways Using Next and Previous Sibling 4m
Navigating Sideways Using Next Element and Previous Element 3m
Filter by Tags and Attributes Using Regular Expressions and Custom Functions 7m
Extracting Absolute and Relative Links from HTML 5m
Module Summary 1m

Searching for Elements in the Parse Tree

30mins

Module Overview 1m
XML and XPath 4m
Performing Advanced Search on the Parse Tree 7m
Searching Using Variations of Find and Find All 4m
CSS Selectors Using Soup Sieve 7m
Using XPath to Navigate an XML Tree 5m
Module Summary 2m

Leveraging Advanced Features of BeautifulSoup

30mins

Module Overview 1m
Modifying the HTML Parse Tree 6m
Exploring Beautiful Soup Functions to Modify the Parse Tree 6m
Miscellaneous Operations Using Beautiful Soup 6m
Working with Different Parsers 4m
Using the Soup Strainer to Parse Parts of a Document 2m
Encodings in Beautiful Soup 3m
Summary and Further Study 2m

About the author

Janani Ravi

Janani has a Masters degree from Stanford and worked for 7+ years at Google. She was one of the original engineers on Google Docs and holds 4 patents for its real-time collaborative editing framework. After spending years working in tech in the Bay Area, New York, and Singapore at companies such as Microsoft, Google, and Flipkart, Janani finally decided to combine her love for technology with her passion for teaching. She is now the co-founder of Loonycorn, a content studio focused on providing ... more

See more courses by Janani Ravi

Ready to upskill? Get started

Contact Sales

Extracting Data from HTML with BeautifulSoup

What you'll learn

Table of contents

About the author

Ready to skill up
your entire team?

With your Pluralsight plan, you can:

With your 30-day pilot, you can:

Ready to skill up
your entire team?

With your Pluralsight plan, you can:

With your 30-day pilot, you can:

Support

Community

Company

Industries

Newsletter

Contact Sales

Extracting Data from HTML with BeautifulSoup

What you'll learn

Table of contents

About the author

Get access now

Ready to skill upyour entire team?

With your Pluralsight plan, you can:

With your 30-day pilot, you can:

Ready to skill upyour entire team?

With your Pluralsight plan, you can:

With your 30-day pilot, you can:

Support

Community

Company

Industries

Newsletter

Ready to skill up
your entire team?

Ready to skill up
your entire team?