[크롤링 기본] JSON 포맷과 XML 포맷

솔라이아 2022. 8. 15. 16:32

2022. 8. 15. 16:32

JSON 포맷

JSON이란?

JavaScript Object Notation

웹 환경에서 서버와 클라이언트 사이에 테이터를 주고 받을 때 Rest API

형식 _ { "키" : "값" , "키" : "값" , ... }

0. 없으면 설치하기 ) pip install json

1. 라이브러리 가져오기 ) import json

2. 데이터 입력하기 ) data = """

{

json포맷

}

"""

3. 데이터 파싱하기 ) json_data = json.loads (data) _ loads(문자열) : 문자열 파싱

4. 읽기 ) print( json_data[키이름][인덱스번호] ... )

import json

# 네이버 쇼핑에서, android 라는 키워드로 검색한 상품 리스트 결과
data = """
{
    "lastBuildDate": "Sat, 22 Jun 2019 14:57:13 +0900",
    "total": 634151,
    "start": 1,
    "display": 10,
    "items": [
        {
            "title": "MHL 케이블 (아이폰, <b>안드로이드</b> 스마트폰 HDMI TV연결)",
            "link": "https://search.shopping.naver.com/gate.nhn?id=10782444869",
            "image": "https://shopping-phinf.pstatic.net/main_1078244/10782444869.5.jpg",
            "lprice": "16500",
            "hprice": "0",
            "mallName": "투데이샵",
            "productId": "10782444869",
            "productType": "2"
        },
        {
            "title": "파인디지털 파인드라이브 Q300",
            "link": "https://search.shopping.naver.com/gate.nhn?id=19490416717",
            "image": "https://shopping-phinf.pstatic.net/main_1949041/19490416717.20190527115824.jpg",
            "lprice": "227050",
            "hprice": "359000",
            "mallName": "네이버",
            "productId": "19490416717",
            "productType": "1"
        }
    ]
}
"""

json_data = json.loads(data)
print (json_data['items'][0]['title'])
print (json_data['items'][0]['link'])

JSON 포맷으로 크롤링하기

< 기본 형식 >

import requests

client_id = '아이디'

client_secret = '시크릿'

naver_open_api = ' 요청 url '

headers_params = { 'X-Naver-Client-Id' : 'client_id' , 'X-Naver-Client-Secret' : 'client_secret' }

res = requests.get( naver_open_api , headers = header_params )

if res.status.code == 200 :

data = res.json()

print(data) or for i in data['items'] :

print( i['title'] )

else :

print( 'Error code : ' , res.status_code )

< 1000개 출력해서 엑셀로 저장하기 >

import requests

import openpyxl

client_id = '아이디'

client_secret = '시크릿'

start, num = 1 , 0

excel_file = openpyxl.Workbook()

excel_sheet = excel_file.active

excel_sheet.column_dimensions['B'].width = 100

excel_sheet.column_dimensions['C'].width = 100

excel_sheet.appen( ['랭킹' , '제목' , '링크'] )

for index in range(10) :

start_number = start + (index * 100)

naver_open_api = ' 요청 url ... ? query=키워드 & display=100 & start= ' + str(start_number)

키워드가 들어간 페이지를 100개씩 특정 번호부터

header_params = { 'X-Naver-Client-Id' : client_id , 'X-Naver-Client-Secret' : client_secret }

res = requests.get ( naver_open_api , headers = header_params )

if res.status_code == 200 :

data = res.json()

for i in data['items'] :

num += 1

excel_sheet.append( [num , item['title'] , item['link'] ] )

else :

print( "Error Code : ", res.status_code )

excel_file.save( '파일이름.xlsx')

excel_file.close()

XML 포맷

XML 이란?

Extensivle Markup Language

Html과 같이 태그 형태로, bs4 BeautifulSoup 사용

형식 ) < 태그명 속성명='속성값' > 내용 < /태그명 >

단, css 아니므로 select 아닌 find로 파싱하기

XML 포맷으로 공공데이터 크롤링하기

import requests

from bs4 import BeautifulSoup

service_key = ' 서비스키 '

params = ' &키=값&키=값&...'

open_api = ' 요청 url ?ServiceKey=' + service_key + params _ 네이버는 헤더 정보를 headers에

공공데이터는 주소에 쓰게끔 설정되어 있음

res = requests.get(open_api)

soup = BeautifulSoup(res.content , 'html.parser')

data = soup.find_all('item')

for i in data :

stationname = item.find('stationname')

print( stationname.get_text() )

저작자표시 (새창열림)

'크롤링 > 기본 문법' 카테고리의 다른 글

[크롤링 기본] 실전 크롤링 예제 (0)	2022.08.21
[크롤링 기본] 정규표현식 (0)	2022.08.20
[크롤링 기본] Open/Rest API (0)	2022.08.15
[크롤링 기본] 엑셀 파일로 저장하기 (0)	2022.08.12
[크롤링 기본] HTTP response code (0)	2022.08.12

솔의 다락