今回したこと

BeautifulSoupを使ってCVPR2020 Scheduleから[1st author, paper title]形式のcsvリストを作成した．

今回の作業notebook(google colab)

参考サイト

10分で理解する Beautiful Soup - Qiita
【Python】BeautifulSoupを使ってテーブルをスクレイピング - Qiita
基本的に今回書いたコードは上記2サイトから流用している．

要件

1st author, paper titleのみをまとめたリストが欲しい．
あんまり難しいことをしたくないのでpythonで適当にやりたい．

色々作業をする前に事前準備

import requests
from bs4 import BeautifulSoup

# スクレイピング対象の URL にリクエストを送り HTML を取得する
res = requests.get('http://cvpr2020.thecvf.com/program/main-conference')
# レスポンスの HTML から BeautifulSoup オブジェクトを作る
soup = BeautifulSoup(res.text, 'html.parser')
title_text = soup.find('title').get_text()
print(title_text)
#Main Conference | CVPR2020

CVPR2020 Scheduleの構成

f:id:catdance124:20200616183909p:plain — cvpr2020 schedule elements

ソースを見るとclass="table table-bordered"テーブルに

<tr class="blue-bottom">
<th>Poster #</th>
<th>Video Time 1</th>
<th>Video Time 2</th>
<th>Paper Title</th>
<th>Author(s)</th>
<th>Paper ID</th>
</tr>

こういう感じで入っているのがわかる．
ので下の感じで取得してみる．
"class":"table table-bordered"を指定してtrタグ内のthタグの要素を取得する．HTMLソースのまんま．

table = soup.findAll("table", {"class":"table table-bordered"})[0]
rows = table.findAll("tr")

# column番号で表示
for row in rows[:3]:
    for col_n, cell in enumerate(row.findAll('td')):
        print(col_n, cell.get_text())

出力はこんな感じ．ここから目的のデータのみを取ればいい．
ここで目的のデータはcolumnの番号だと3, 4のこと．

0 1
1 10:00
2 22:00
3 Unsupervised Learning of Probably Symmetric Deformable 3D Objects From Images in the Wild
4 Shangzhe Wu, Christian Rupprecht, Andrea Vedaldi
5 7
0 2
1 10:05
2 22:05
3 Footprints and Free Space From a Single Color Image
4 Jamie Watson, Michael Firman, Aron Monszpart, Gabriel J. Brostow
5 2582

メインスクリプト

↑の確認で1行ごとに配列index[3,4]を取ればいいことがわかったので，そんな感じで書く．
あとauthorsから1st authorだけを取り出したいので，.split(',')[0]を使用．

import csv

tables = soup.findAll("table", {"class":"table table-bordered"})
with open("cvpr2020.csv", "w", encoding='utf-8') as file:
    writer = csv.writer(file)
    for table in tables:
        for row in table.findAll("tr"):
            cells = [cell.get_text() for cell in row.findAll('td')]
            if len(cells) != 0:
                paper_title = cells[3]
                first_author = cells[4].split(',')[0]
                # print(first_author, ',', paper_title)
                writer.writerow([first_author, paper_title])

これでcsvファイルが出力されるので，下記コードで保存．（google colabを使用したので）

from google.colab import files
files.download('cvpr2020.csv')

終わりに

WEBスクレイピングは今までしたことがなかったが，pythonでBeautifulSoupを使用するとHTMLソースから直感的にデータを抽出することができた．
カンファレンスごとにスケジュールの表記は異なると思うが，同じような流れでスクレイピングは可能であると思う．

wide and deep

python(BeautifulSoup)でWEBスクレイピング（making CVPR2020 accepted papers list）

今回したこと

参考サイト

要件

CVPR2020 Scheduleの構成

メインスクリプト

終わりに