COMCBT에서 hwp를 받아오는 프로그램

Python 2017. 7. 25. 21:23

# -*- coding: utf-8 -*-
import urllib.request as req
from urllib.parse import urljoin
from bs4 import BeautifulSoup
from os import mkdir
import os, time, requests

# COMCBT에 받아야할 자료들은 총 3페이지로 구성되어있기 때문에 그 url들을 가져오는 함수
def load_index_page(url):
    urls = []

    for n in range(1,4):
        urls.append(url + str(n))
    return urls

# 위 함수에서 가져온 한페이지내에 이름, 주소들을 가져오는 함수
def load_href_name(url):
    links = []
    names = []
    hrefs = []
    urls = load_index_page(url)
    for url in urls:
        html = req.urlopen(url).read()
        soup = BeautifulSoup(html, 'html.parser')
        links.append(soup.select('.scElps a'))
    for link in links:
        for idx, a in enumerate(link):
            if idx > 2:
                hrefs.append(a.attrs['href'])
                names.append(str(a.string).split('\t\t\t\t\t\t\t')[1].strip('\t'))
    return hrefs, names

#위에서 받은 링크를 타고 들어가서 HWP를 받는 함수
def download_file(hrefs, names):
    path = './네트워크관리사'
    abs_href = []
    base_url = 'http://www.comcbt.com'
    for href in hrefs:
        abs_href.append(urljoin(base_url, href))
    if not os.path.exists(path):
        mkdir(path)

    for link, name in zip(abs_href, names):
        if not os.path.exists(path + '/' + name):
            mkdir(path + '/' + name)
        html = req.urlopen(link).read()
        soup = BeautifulSoup(html, 'html.parser')
        dl_links = soup.select('.scFiles a')
        try:
            for i, dl_link in enumerate(dl_links):
                dl_name = str(dl_link.text).split('hwp')[0] + 'hwp'
                dl_path = urljoin(base_url, dl_link.attrs['href'])
                res = requests.get(dl_path, allow_redirects=True)
                time.sleep(1)
                with open(path + '/' + name + '/' + dl_name, 'wb') as f:
                    f.write(res.content)
                    f.close()
                print('download : ', str(dl_link.text).split('hwp')[0] + 'hwp')
        except:
            print('download Failed:', dl_name)


if __name__ == '__main__':
    url = 'http://www.comcbt.com/xe/index.php?mid=jf&page='
    hrefs, names = load_href_name(url)
    download_file(hrefs, names)

이 프로그램을 만들면서 생겼던 문제가 있다.

첫번째는 일반적으로 다운로드 링크를 눌렀을 때 바로 다운로드가 되는것이 아니라 Redirect하여 다운로드 되는 현상이다.

구글 개발자도구 - 네트워크 - Preserve log를 눌러보면 우리눈엔 보이지않지만 네트워크가 지나온 페이지들을 보여준다.

먼저 다운로드 링크를 눌러보면 HTTP STATUS 코드가 302가 뜨고 그 후 새 페이지가 200번을 받아온다.

더 자세히 살펴보면 302가 떴을때 Response된 것들을 살펴보면 Location에 실 다운로드 주소가 반환된다.

그래서 redircet의 Location 주소를 받아오는 Requests.get(url, allow_redirects=TRUE)를 사용하였다.

두번째, 다운로드 받을 파일을 저장하는 형식이 중요하다.

내가 생각했던것은 urlretrieve() 함수로 받아오면 될 줄 알았는데, 이 함수는 HTML 파일을 다운로드 받는것이기에 잘 받아오지 못하였다.

그래서 with open() as f 를 이용해 새로운 빈 파일을 만든 후 형식을 hwp로 주고, 위에서 다운로드 받은 res에 담겨있는 바이트로 작성하였더니 되었다.

저작자표시 비영리 변경금지

'Python' 카테고리의 다른 글

5. Pandas 자료구조 (0)	2017.08.03
4. Numpy를 이용한 데이터 분석 (0)	2017.08.03
3. Numpy 함수 (0)	2017.08.02
2. Numpy 배열 인덱싱 (0)	2017.08.01
1. Numpy 배열의 생성과 연산 (0)	2017.08.01

'Python' 관련 글 more

Posted by Config

일	월	화	수	목	금	토
	1	2	3	4	5	6
7	8	9	10	11	12	13
14	15	16	17	18	19	20
21	22	23	24	25	26	27
28	29	30

COMCBT에서 hwp를 받아오는 프로그램

'Python' 카테고리의 다른 글

티스토리툴바