GoogleColaboratoryでchromedriverを利用する(2021年9月版)

2021年9月7日2021年9月20日

以下のサイトの通り！

https://python-to.hateblo.jp/entry/2021/06/24/000000

Contents

1. 開発環境
2. Google Colaboratoryを開く
3. Colaboratoryにseleniumとchromium-chromedriverをインストール
4. スクレイピングを実行
5. スクレイピングしたものをテキストファイルとしてダウンロードする
6. 参考

開発環境

Windows 10 Pro 1903
Chrome
Google Colaboratory (Googleアカウントが必要)

Google Colaboratoryを開く

https://colab.research.google.com/?hl=ja#create=true を開き、Googleアカウントでログインします。

Colaboratoryにseleniumとchromium-chromedriverをインストール

以下を入力して実行します。

# original code from https://python-to.hateblo.jp/entry/2021/06/24/000000
# 日本語フォントインストール
!apt install fonts-ipafont-gothic

# Seleniumインストール
!pip install selenium

# chromedriverインストール
!apt update
!apt install chromium-chromedriver
!cp /usr/lib/chromium-browser/chromedriver /usr/bin

だいぶ時間がかかります。

46秒でインストールが完了しました。

スクレイピングを実行

以下のコードを入力して、実行（セルを選択した状態で、Shift+Enter）します。今回は、Googleのトップページをスクレイピングしてみたいと思います。

# original code from https://qiita.com/ftoyoda/items/fe3e2fe9e962e01ac421
#SeleniumとBeautifulSoupのライブラリをインポート
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from bs4 import BeautifulSoup

# ブラウザをheadlessモード（バックグラウンドで動くモード）で立ち上げてwebsiteを表示、生成されたhtmlを取得し、BeautifulSoupで綺麗にする。
options = webdriver.ChromeOptions()
options.add_argument('--headless')
options.add_argument('--no-sandbox')
options.add_argument('--disable-dev-shm-usage')
driver = webdriver.Chrome('chromedriver',options=options)
driver.implicitly_wait(10)
driver.get("https://www.google.co.jp")
html = driver.page_source.encode('utf-8')
soup = BeautifulSoup(html, "html.parser")
print(soup.prettify())

あれ？2秒でできました。

実行結果の部分を上に移動していくと、以下のようになっており、大丈夫そうです。

スクレイピングしたものをテキストファイルとしてダウンロードする

以下のコードを実行します。すると、先ほど表示した内容が、scraping.txtという名前でダウンロードされます。

# ファイルを保存
open("scraping.txt","w").write(soup.prettify())
from google.colab import files
files.download("scraping.txt")

ダウンロードしたscraping.txtの中身は、以下のようになりました。

参考

https://python-to.hateblo.jp/entry/2021/06/24/000000

https://qiita.com/ftoyoda/items/fe3e2fe9e962e01ac421
@ftoyoda が2020年03月02日に更新
ColaboratoryでSeleniumが使えた：JavaScriptで生成されるページも簡単スクレイピング

https://enjoy-a-lot.com/google-colaboratory-selenium/
google ColaboratoryでSeleniumを使う
2020.6.19

GoogleColaboratory,Pythonchromedriver,selenium,webスクレイピング

Posted by twosquirrel