[Web] 과제 (지니뮤직 크롤링)

지니뮤직을 크롤링해서 가져온다.

우선 기본 세팅부터 한다. 사용할 라이브러리를 import 해주고 데이터를 가져온다.

import requests

from bs4 import BeautifulSoup

URL = "https://www.genie.co.kr/chart/top200?ditc=M&rtm=N&ymd=20230101"

headers = {'User-Agent' : 'Mozilla/5.0 (Windows NT 10.0; Win64; x64)AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.86 Safari/537.36'}

data = requests.get(URL,headers=headers)

soup = BeautifulSoup(data.text, 'html.parser')

그러면 이제 1 ~ 50 위 곡을 스크래핑한다.

순위/ 곡 제목/ 가수를 출력한다.

우선 지니 뮤직에서 html이 어떻게 출력되는지 확인해 본다.

각 곡들은 tr 요소로 만들어지며 toby안에 정의되어 있다.

하나의 곡을 선택하고 selector를 카피해서 붙여놓고

music_list = soup.select('#body-content > div.newest-list > div > table > tbody > tr:nth-child(1)')

tr 전부 다 가져올 것이기 때문에 tr: 뒤의 내용은 지운다.

이제 music_list는 가져왔으니 각 요소에 지정된 이름으로 값을 가져온다. 그러기 위해서 일단 개발자 도구에서 각 요소를 클릭해서 class 이름을 찾아본다.

각 데이터는 number, title, artist로 되어있다.

가져온 music_list를 순회하면서 값들을 출력해 본다.

music_list = soup.select('#body-content > div.newest-list > div > table > tbody > tr')

for music in music_list:

# print(music)

rank = music.select_one('.number').text

title = music.select_one('.title').text

artist = music.select_one('.artist').text

print(rank, title, artist)

출력되는 문자들의 모양새가 이상하다. 이를 수정해 주기 위해서 text의 속성을 사용해 본다.

우선 랭킹을 출력하는 number에서 불필요한 순위 상승 내용을 지워준다.

50위까지만 출력할 것이기 때문에 문자열의 앞 두 글자만 출력하는 text[0:2]를 적용하고 그 뒤의 내용은 나오지 않도록 strip() 시켜준다. strip은 해당 문자의 양 옆의 공백을 제거해 주는 함수이다.

이상하게 다른 class 이름인데 rank 값이 같이 출력되는 이유는 모르겠다.

마찬가지로 title artist의 공백도 지워준다.

for music in music_list:

# print(music)

rank = music.select_one('.number').text[0:2].strip()

title = music.select_one('.title').text.strip()

artist = music.select_one('.artist').text.strip()

print(rank + " / " + title + " / " + artist)

출력이 문제없이 잘되는 걸 확인할 수 있다.

답안 코드

import requests

from bs4 import BeautifulSoup

headers = {'User-Agent' : 'Mozilla/5.0 (Windows NT 10.0; Win64; x64)AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.86 Safari/537.36'}

data = requests.get('https://www.genie.co.kr/chart/top200?ditc=M&rtm=N&ymd=20230101',headers=headers)

soup = BeautifulSoup(data.text, 'html.parser')

trs = soup.select('#body-content > div.newest-list > div > table > tbody > tr')

for tr in trs:

title = tr.select_one('td.info > a.title.ellipsis').text.strip()

rank = tr.select_one('td.number').text[0:2].strip()

artist = tr.select_one('td.info > a.artist.ellipsis').text

print(rank, title, artist)

리스트를 순회할때 속성을 더 특정시켜서 할 수 있는데 필요한 값만 가져다 쓰기 위해서는 이 방법이 더 확실한 거 같다.

이외 부분은 크게 차이가 없어 보인다.

저작자표시 (새창열림)

'스파르타 > Web' 카테고리의 다른 글

[Web] Get/Post - ( week 4 ) (0)	2023.04.24
[Web] Flask 서버구현 - ( week 4 ) (0)	2023.04.24
[Web] 스크래핑 데이터 저장 - ( week 3 ) (0)	2023.04.20
[Web] MongoDB Atlas ( week 3 ) (0)	2023.04.20
[Web] 데이터베이스 - ( week 3 ) (0)	2023.04.20

개발일지

[Web] 과제 (지니뮤직 크롤링) - ( week 3 )

답안 코드

'스파르타 > Web' 카테고리의 다른 글

티스토리툴바

[Web] 과제 (지니뮤직 크롤링) - ( week 3 )

답안 코드

'스파르타 > Web' 카테고리의 다른 글

관련글

티스토리툴바