2022/06/30

【Python學堂】新手入門第十一篇-BeautifulSoup4 的基本操作

快速導覽 BeautifulSoup4
載入 BeautifulSoup 套件
開始導覽目前網頁的局部內容

各位同學們好～透過上一篇「【Python學堂】新手入門第十篇-如何利用Python下載網路資料」學會了如何使用 python 下載網頁資料，此篇文章帶領你如何利用 BeautifulSoup4 套件解析網頁架構，取得我們想要的局部資料，繼續看下去。

目前的學習進度地圖如下：

當可以從 internet 下載 html 資料後，接下來的動作必須解析 html 的所有架構，由於下載的 html 架構全是 string 型別，透過 string 型別的操作，取得有用的局部資料將會非常困難。

感謝 BeautifulSoup4 套件，BeautifulSoup4 套件的主要功能就是可以全面解析 HTML 或 XML 的架構，透過簡單的操作，可取得想要的資料。

BeautifulSoup4 是一個外部套件，專案內必需透過 PyPI(Python Package Index) 安裝 BeautifulSoup4 套件，安裝方法如下:

$ pip install beautifulsoup4

範例1: 快速導覽 BeautifulSoup4

以下範例是使用 VSCode 延伸套件 Jupyter Notebook，安裝方式請參考先前文章介紹

每個程式區塊代表的是一個儲存格

如下方圖示

#手動建立html格式文字
html_doc = """
<html>
    <head>
        <title>A Useful Page</title>
    </head>
    <body>
        <h1>An Interesting Title</h1>
        <p class="title"><b>The Dormouse's story</b></p>
        <p class="story">Once upon a time there were three little sisters; and their names were
        <a class="sister" href="http://example.com/elsie" id="link1">
            Elsie
        </a>
        ,
        <a class="sister" href="http://example.com/lacie" id="link2">
        Lacie
        </a>
        and
        <a class="sister" href="http://example.com/tillie" id="link3">
        Tillie
        </a>
        and they lived at the bottom of a well.
        </p>

        <div class='article'>
        Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor 
        incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud 
        exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute 
        irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla 
        pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia 
        deserunt mollit anim id est laborum.
        </div>
    </body>
</html>
"""

以下為目前 html 的大約架構

載入 BeautifulSoup 套件

使用 BeautifulSoup 實體方法 prettify(), 輸出修飾過的 html

from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc,'html.parser')
print(soup.prettify())

輸出結果:

開始導覽目前網頁的局部內容

1. 取出 title 元素內容(以下是 title 元素的架構)

soup.title

輸出:===========
<title>A Useful Page</title>

2. 取出 title 元素標籤名稱

soup.title.name

輸出:===========
'title'

3. 取出 title 元素內容

soup.title.string

輸出:===========
'A Useful Page'

4. 取出 title 父元素標籤名稱

soup.title.parent.name

輸出:===========
'head'

5. 取出 h1 元素(以下是 h1 元素的架構)

soup.h1

輸出:===========
<h1>An Interesting Title</h1>

6. 取出第一個 p 元素(以下是 p 元素的架構)

當有多個相同元素時，取得第一個元素。

soup.p

輸出:===========
<p class="title"><b>The Dormouse's story</b></p>

7. 取出第一個 p 元素，class 屬性的屬性值

soup.p['class']

輸出:===========
['title']

8. 取出第一個 a 元素(以下是 a 元素的架構)

soup.a

輸出:===========
<a class="sister" href="http://example.com/elsie" id="link1">
    Elsie
</a>

9. 取出所有的 a 元素

soup.find_all('a')

輸出:===========
[<a class="sister" href="http://example.com/elsie" id="link1">
     Elsie
 </a>,
 <a class="sister" href="http://example.com/lacie" id="link2">
  Lacie
 </a>,
 <a class="sister" href="http://example.com/tillie" id="link3">
 Tillie
 </a>]

10. 取出 id 屬性為 link3 的元素

soup.find(id='link3')

輸出:===========
<a class="sister" href="http://example.com/tillie" id="link3">
Tillie
</a>

11. 取出標籤所有連結網址

for link in soup.find_all('a'):
    print(link.get('href'))

輸出:===========
http://example.com/elsie
http://example.com/lacie
http://example.com/tillie

以上，今天的教學就到這邊，大家可以嘗試使用這個方式，利用 BeautifulSoup4 篩選出想要的資料。下次見～