학습목표

  1. selenium 모듈 사용법 알아보기
In [4]:
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By

from bs4 import BeautifulSoup
import time

selenium

  • 웹페이지 테스트 자동화용 모듈
  • 개발/테스트용 드라이버(웹브라우저)를 사용하여 실제 사용자가 사용하는 것처럼 동작
  • 실습전 확인사항

selenium 예제

  • python.org 로 이동하여 자동으로 검색해보기
    1. python.org 사이트 오픈
    2. input 필드를 검색하여 Key 이벤트 전달
In [2]:
chrome_driver = './chromedriver.exe'
driver = webdriver.Chrome(chrome_driver)
driver.get('https://www.python.org/')
time.sleep(2)
search = driver.find_element_by_id('id-search-field')
search.clear()
time.sleep(2)
search.send_keys('pytest')
time.sleep(2)
search.send_keys(Keys.RETURN)
# time.sleep(3)
# driver.close()
C:\Users\Public\Documents\ESTsoft\CreatorTemp\ipykernel_9612\3640131986.py:2: DeprecationWarning: executable_path has been deprecated, please pass in a Service object
  driver = webdriver.Chrome(chrome_driver)
C:\Users\Public\Documents\ESTsoft\CreatorTemp\ipykernel_9612\3640131986.py:5: DeprecationWarning: find_element_by_* commands are deprecated. Please use find_element() instead
  search = driver.find_element_by_id('id-search-field')
In [ ]:
 

selenium을 이용한 다음뉴스 웹사이트 크롤링

  • driver 객체의 find_xxx_by 함수 활용
In [5]:
chrome_driver = './chromedriver.exe'
driver = webdriver.Chrome(chrome_driver)
url = 'https://news.v.daum.net/v/20190728165812603'
driver.get(url)

time.sleep(2)

src = driver.page_source

soup = BeautifulSoup(src)
comment_area = soup.select_one('span.alex-count-area')
driver.close()
comment_area.get_text()
# comment_area
C:\Users\Public\Documents\ESTsoft\CreatorTemp\ipykernel_9612\3197296021.py:2: DeprecationWarning: executable_path has been deprecated, please pass in a Service object
  driver = webdriver.Chrome(chrome_driver)
Out[5]:
'42'

selenium을 활용하여 특정 element의 로딩 대기

  • WebDriverWait 객체를 이용하여 해당 element가 로딩 되는 것을 대기
  • 실제로 해당 기능을 활용하여 거의 모든 사이트의 크롤링이 가능
  • WebDriverWait(driver, 시간(초)).until(EC.presence_of_element_located((By.CSS_SELECTOR, 'CSS_RULE')))
In [6]:
url = 'https://n.news.naver.com/article/094/0000010049?cds=news_media_pc&type=editn'
chrome_driver = './chromedriver.exe'
driver = webdriver.Chrome(chrome_driver)
driver.get(url)
# 해당 element가 로딩 될때까지 대기
myElem = WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.CSS_SELECTOR, '.u_cbox_count')))

src = driver.page_source

soup = BeautifulSoup(src)
comment_area = soup.select_one('.u_cbox_count')
driver.close()
print(comment_area.get_text())
C:\Users\Public\Documents\ESTsoft\CreatorTemp\ipykernel_9612\663769973.py:3: DeprecationWarning: executable_path has been deprecated, please pass in a Service object
  driver = webdriver.Chrome(chrome_driver)
10
In [ ]: