在知道目標之後,接著就要決定如何達成。網路爬蟲的概念很簡單,就是進入網頁之後,將網站中的原始碼擷取下來,透過Xpath或是Html節點的方式來找到目標字串。
根據上次[Python][教學] 網路爬蟲(crawler)實務(上)--網頁元件解析分析的內容,我們的爬網策略大致上會是:
- 進入搜尋頁面>找到店家網址>進入店家頁面>擷取資料
根據這樣的流程,將他拆解成更符合爬蟲程式的邏輯:
- 進入搜尋頁面
- 搜尋頁面有多個頁面,透過參數一次抓取n個搜尋頁面(搜尋網頁數n)
- 從搜尋頁面中解析出店家的網址,每頁有m個店家網址(店家網頁數m)
- 進入店家網址,解析出需要用的資訊(搜尋網頁數=n * m)
廢話不多說直接看code:
##import 必要套件 | |
import requests | |
from bs4 import BeautifulSoup | |
import HTMLParser | |
import time | |
from random import randint | |
import sys | |
from IPython.display import clear_output | |
##從搜尋頁面擷取店家網址(因為搜尋頁面的電話是圖片不好抓) | |
links = ['http://www.ipeen.com.tw/search/all/000/1-100-0-0/?p=' + str(i+1) + 'adkw=東區&so=commno' for i in range(10)] | |
shop_links=[] | |
for link in links: | |
res = requests.get(link) | |
soup = BeautifulSoup(res.text.encode("utf-8")) | |
shop_table = soup.findAll('h3',{'class':'name'}) | |
##關在a tag裡的網址抓出來 | |
for shop_link in shop_table: | |
link = 'http://www.ipeen.com.tw' + [tag['href'] for tag in shop_link.findAll('a',{'href':True})][0] | |
shop_links.append(link) | |
##避免被擋掉,小睡一會兒 | |
time.sleep(1) | |
##建立變項檔案的header | |
title = "shop" + "," + "category" + "," + "tel" + "," + "addr" + "," + "cost" + "," + "rank" + "," + "counts" + "," + "share" + "," + "collect" | |
shop_list = open('shop_list.txt','w') | |
##先把header寫進去 | |
shop_list.write(title.encode('utf-8') + "\n") | |
for i in range(len(shop_links)): | |
res = requests.get(shop_links[i]) | |
soup = BeautifulSoup(res.text.encode("utf-8")) | |
header = soup.find('div',{'class':'info'}) | |
shop = header.h1.string.strip() | |
##做例外處理 | |
try: | |
category = header.find('p', {'class':'cate i'}).a.string | |
except Exception as e: | |
category = "" | |
try: | |
tel = header.find('p',{'class': 'tel i'}).a.string.replace("-","") | |
except Exception as e: | |
tel = "" | |
try: | |
addr = header.find('p', {'class': 'addr i'}).a.string.strip() | |
except Exception as e: | |
addr = "" | |
try: | |
cost = header.find('p', {'class':'cost i'}).string.split()[1] | |
except Exception as e: | |
cost = "" | |
try: | |
rank = header.find('span', {'itemprop': 'average'}).string | |
except Exception as e: | |
rank = "" | |
try: | |
counts = header.find_all('em')[0].string.replace(',','') | |
except Exception as e: | |
counts = "" | |
try: | |
share = header.find_all('em')[1].string.replace(',','') | |
except Exception as e: | |
share = "" | |
try: | |
collect = header.find_all('em')[2].string.replace(',','') | |
except Exception as e: | |
collect = "" | |
##串起來用逗號分格(應該有更好的方法,但是先將就用用) | |
result = shop + "," + category + "," + tel + "," + addr + "," + cost + "," + rank + "," + counts + "," + share + "," + collect | |
shop_list.write(result.encode('utf-8') + "\n") | |
##隨機睡一下 | |
time.sleep(randint(1,5)) | |
clear_output() | |
print i | |
sys.stdout.flush() | |
shop_list.close() | |
- 這次的爬網流程雖然簡單,但是還是有幾個要注意的地方:
- time.sleep: 這次總共抓了n * m 個網頁,短時間的大量抓取會消耗網站資源,影響網站運行,所以通常有品的爬網者會設定睡眠時間,避免造成對方主機的負擔
- try except: 當需要自動抓大量欄位時,一定要考慮或是注意到你要抓的欄位可能不是每個頁面都有提供,所以要加上例外處理才能避免錯誤而跳出程式
- 更為普遍的xpath設定: 如果只抓一兩個頁面,xpath要怎樣設都可以,也可以很簡單的利用數數的方式去取得標籤。但是如果要抓大量網頁,每個網頁的每個節點的數量可能會不一樣,最好多看幾個網頁原始碼,找到每個標籤在結構上的固定位置,避免抓錯欄位。
資料來源:Bryan的行銷研究及資料分析筆記
留下你的回應
以訪客張貼回應