Need assistance scraping HTML able from basketball-reference

I'm very new to webscraping with python/BeautifulSoup/urllib.request, and have been trying to figure out how to scrape this table for the longest time. I found some other code online and tried it out and have been trying to understand how they work and modifying them, but they always filter out the first column, which I need.

Code:

from urllib.request import urlopen
from bs4 import BeautifulSoup
import pandas as pd
import numpy 

# NBA season we will be analyzing
month = "january"
# URL page we will scrape (see image above)
url = "https://www.basketball-reference.com/leagues/NBA_2021_games-{}.html".format(month)
# this is the HTML for given URL
html = urlopen(url)
soup = BeautifulSoup(html)

# use findALL() to get the column headers
soup.findAll()
# use getText()to extract the text we need into a list
headers = [th.getText() for th in soup.findAll('tr', limit=2)[0].findAll('th')]
# exclude the first column as we will not need the ranking order from Basketball Reference for the analysis
headers=headers[1:]

# avoid the first header row
rows = soup.findAll('tr')[1:]

player_stats = [[td.getText() for td in rows[i].findAll('td')]

for i in range(len(rows))]
df = pd.DataFrame(player_stats, columns = headers)

This is what the HTML table looks like

Can someone show me how to scrape this website for the table? I can't figure this out for the life of me https://www.basketball-reference.com/leagues/NBA_2021_games-january.html



Read more here: https://stackoverflow.com/questions/65710921/need-assistance-scraping-html-able-from-basketball-reference

Content Attribution

This content was originally published by bmatt23 at Recent Questions - Stack Overflow, and is syndicated here via their RSS feed. You can read the original post over there.

%d bloggers like this: