Improve download link handling

The previous method relied on the main "download link" in the list page.
But this link was broken a solid 1/4 of the time, and far more often for
some artists.

Instead, during DB build, grab and parse each actual song page too, and
grab from it all possible download links. Use a ThreadPoolExecutor to do
this in a reasonable amount of time (default of 10 workers, but user
configurable).

Then when downloading, iterate over all download links, or provide some
user options for filtering these by ID or description.
This commit is contained in:
Joshua Boniface 2023-04-06 02:06:55 -04:00
parent 3a0ef3dcc6
commit 6ec8923336
2 changed files with 203 additions and 106 deletions

View File

@ -13,11 +13,6 @@ standardized format.
To use the tool, first use the "database" command to build or modify your local JSON database, then use the To use the tool, first use the "database" command to build or modify your local JSON database, then use the
"download" command to download songs. "download" command to download songs.
To avoid overloading or abusing the C3DB website, this tool operates exclusively in sequential mode by design; at
most one page is scraped (for "database build") or song downloaded (for "download") at once. Additionally, the tool
design ensures that the JSON database of songs is stored locally, so it only needs to be built once and then is
reused to perform actual downloads without putting further load on the website.
## Installation ## Installation
1. Install the Python3 requirements from `requirements.txt`. 1. Install the Python3 requirements from `requirements.txt`.
@ -39,8 +34,9 @@ fetch all avilable songs for all games, and either specify it with the `-u`/`--b
environment variable `C3DBDL_BASE_URL`. environment variable `C3DBDL_BASE_URL`.
1. Initialize your C3DB JSON database with `c3dbdl [options] database build`. This will take a fair amount 1. Initialize your C3DB JSON database with `c3dbdl [options] database build`. This will take a fair amount
of time to complete as all pages of the chosen base URL are scanned. Note that if you cancel this process, no of time to complete as all pages of the chosen base URL, and all song pages (30,000+) are scanned. Note that if
data will be saved, so let it complete! you cancel this process, no data will be saved, so let it complete! The default concurrency setting should make
this relatively quick but YMMV.
1. Download any song(s) you want with `c3dbdl [options] download [options]`. 1. Download any song(s) you want with `c3dbdl [options] download [options]`.
@ -86,6 +82,9 @@ Downloading song "Rush - Sweet Miracle" by ejthedj...
Downloading from https://dl.c3universe.com/s/ejthedj/sweetMiracle... Downloading from https://dl.c3universe.com/s/ejthedj/sweetMiracle...
``` ```
In addition to the above filters, within each song may be more than one download link. To filter these links,
use the "-i"/"--download-id" and "-d"/"--download-descr" (see the help for details).
Feel free to experiment. Feel free to experiment.
## Output Format ## Output Format

296
c3dbdl
View File

@ -10,11 +10,87 @@ from difflib import unified_diff
from colorama import Fore from colorama import Fore
from bs4 import BeautifulSoup from bs4 import BeautifulSoup
from urllib.error import HTTPError from urllib.error import HTTPError
from concurrent.futures import ThreadPoolExecutor, as_completed
CONTEXT_SETTINGS = dict(help_option_names=['-h', '--help'], max_content_width=120) CONTEXT_SETTINGS = dict(help_option_names=['-h', '--help'], max_content_width=120)
def fetchSongData(entry):
song_entry = dict()
messages = list()
def buildDatabase(pages=None): for idx, td in enumerate(entry.find_all('td')):
if idx == 2:
# Artist
song_entry["artist"] = td.find('a').get_text().strip().replace('/', '+')
elif idx == 3:
# Song
song_entry["title"] = td.find('div', attrs={'class':'c3ttitlemargin'}).get_text().strip().replace('/', '+')
song_entry["album"] = td.find('div', attrs={'class':'c3tartist'}).get_text().strip().replace('/', '+')
# Song page
tmp_links = td.find_all('a', href=True)
for link in tmp_links:
if link.get('href'):
song_entry["song_link"] = link.get('href')
break
elif idx == 4:
# Genre
song_entry["genre"] = td.find('a').get_text().strip()
elif idx == 5:
# Year
song_entry["year"] = td.find('a').get_text().strip()
elif idx == 6:
# Length
song_entry["length"] = td.find('a').get_text().strip()
elif idx == 8:
# Author (of chart)
song_entry["author"] = td.find('a').get_text().strip().replace('/', '+')
if song_entry and song_entry['author'] and song_entry['title'] and song_entry["song_link"]:
messages.append(f"> Found song entry for {song_entry['artist']} - {song_entry['title']} by {song_entry['author']}")
for entry_type in ["artist", "album", "genre", "year", "length"]:
if not song_entry[entry_type]:
song_entry[entry_type] = "None"
# Get download links from the actual song page
attempts = 1
sp = None
while attempts <= 3:
try:
messages.append(f"Parsing song page {song_entry['song_link']} (attempt {attempts}/3)...")
sp = requests.get(song_entry["song_link"])
break
except Exception:
sleep(attempts)
attempts += 1
if sp is None or sp.status_code != 200:
messages.append("Failed to fetch song page, aborting")
return None
song_parsed_html = BeautifulSoup(sp.text, 'html.parser')
download_section = song_parsed_html.find('div', attrs={"class": "portlet light bg-inverse"})
download_links = download_section.find_all('a', href=True)
dl_links = list()
for link_entry in download_links:
link = link_entry.get('href')
description = link_entry.get_text().strip()
if not "c3universe.com" in link:
continue
messages.append(f"Found download link: {link} ({description})")
dl_links.append({
"link": link,
"description": description,
})
if not dl_links:
messages.append("Found no c3universe.com download links for song, not adding to database")
return None
song_entry["dl_links"] = dl_links
# Append to the database
return messages, song_entry
def buildDatabase(pages, concurrency):
found_songs = [] found_songs = []
if pages is None: if pages is None:
@ -46,106 +122,109 @@ def buildDatabase(pages=None):
table_html = parsed_html.body.find('div', attrs={'class':'portlet-body'}).find('tbody') table_html = parsed_html.body.find('div', attrs={'class':'portlet-body'}).find('tbody')
entries = list()
for entry in table_html.find_all('tr', attrs={'class':'odd'}): for entry in table_html.find_all('tr', attrs={'class':'odd'}):
if len(entry) < 1: if len(entry) < 1:
break break
entries.append(entry)
song_entry = dict()
click.echo("Fetching and parsing song pages...")
for idx, td in enumerate(entry.find_all('td')): with ThreadPoolExecutor(max_workers=concurrency) as executor:
if idx == 1: future_to_song = {executor.submit(fetchSongData, entry): entry for entry in entries}
# Download link for future in as_completed(future_to_song):
song_entry["dl_link"] = td.find('a', attrs={'target':'_blank'}).get('href') try:
elif idx == 2: messages, song = future.result()
# Artist click.echo('\n'.join(messages))
song_entry["artist"] = td.find('a').get_text().strip().replace('/', '+') if song is None:
elif idx == 3: continue
# Song found_songs.append(song)
song_entry["title"] = td.find('div', attrs={'class':'c3ttitlemargin'}).get_text().strip().replace('/', '+') except Exception:
song_entry["album"] = td.find('div', attrs={'class':'c3tartist'}).get_text().strip().replace('/', '+') continue
elif idx == 4:
# Genre
song_entry["genre"] = td.find('a').get_text().strip()
elif idx == 5:
# Year
song_entry["year"] = td.find('a').get_text().strip()
elif idx == 6:
# Length
song_entry["length"] = td.find('a').get_text().strip()
elif idx == 8:
# Author (of chart)
song_entry["author"] = td.find('a').get_text().strip().replace('/', '+')
if song_entry and song_entry['author'] and song_entry['title']:
click.echo(f"Found song entry for {song_entry['artist']} - {song_entry['title']} by {song_entry['author']}")
for entry_type in ["artist", "album", "genre", "year", "length"]:
if not song_entry[entry_type]:
song_entry[entry_type] = "None"
found_songs.append(song_entry)
return found_songs return found_songs
def downloadSong(destination, filename, entry): def downloadSong(destination, filename, entry, dlid, dldesc):
click.echo(f"""Downloading song "{entry['artist']} - {entry['title']}" by {entry['author']}...""") click.echo(f"""> Downloading song "{entry['artist']} - {entry['title']}" by {entry['author']}...""")
try: if dlid is None:
p = requests.get(entry['dl_link']) dl_links = entry['dl_links']
if p.status_code != 200: else:
raise HTTPError(entry['dl_link'], p.status_code, "", None, None) try:
dl_links = [entry['dl_links'][dlid - 1]]
except Exception:
click.echo(f"Invalid download link ID {dlid}.")
return
parsed_html = BeautifulSoup(p.text, 'html.parser') if dldesc is not None:
download_url = parsed_html.body.find('div', attrs={'class':'lock-head'}).find('a').get('href') new_dl_links = list()
except Exception as e: for entry in dl_links:
click.echo(f"Failed parsing or retrieving HTML link: {e}") if dldesc in entry['description']:
return None new_dl_links.append(entry)
dl_links = new_dl_links
download_filename = filename.format( if not dl_links:
genre=entry['genre'], click.echo(f'No download link matching description "{dldesc}" found.')
artist=entry['artist'], return
album=entry['album'],
title=entry['title'],
year=entry['year'],
author=entry['author'],
orig_name=download_url.split('/')[-1],
)
download_filename = f"{destination}/{download_filename}"
download_path = '/'.join(f"{download_filename}".split('/')[0:-1])
if os.path.exists(download_filename): for dl_link in dl_links:
click.echo(f"File exists at {download_filename}") try:
return None p = requests.get(dl_link['link'])
if p.status_code != 200:
click.echo(f"""Downloading from {download_url}...""") raise HTTPError(dl_link['link'], p.status_code, "", None, None)
attempts = 1
p = None parsed_html = BeautifulSoup(p.text, 'html.parser')
try: download_url = parsed_html.body.find('div', attrs={'class':'lock-head'}).find('a').get('href')
with requests.get(download_url, stream=True) as r: except Exception as e:
while attempts <= 3: click.echo(f"Failed parsing or retrieving HTML link: {e}")
try: continue
r.raise_for_status()
break download_filename = filename.format(
except Exception: genre=entry['genre'],
click.echo(f"Download attempt failed: HTTP {r.status_code}; retrying {attempts}/3") artist=entry['artist'],
sleep(attempts) album=entry['album'],
attempts += 1 title=entry['title'],
if r is None or r.status_code != 200: year=entry['year'],
if r: author=entry['author'],
code = r.status_code orig_name=download_url.split('/')[-1],
else: )
code = "-1" download_filename = f"{destination}/{download_filename}"
raise HTTPError(download_url, code, "", None, None) download_path = '/'.join(f"{download_filename}".split('/')[0:-1])
if not os.path.exists(download_path): click.echo(f"""Downloading file "{dl_link['description']}" from {download_url}...""")
os.makedirs(download_path) if os.path.exists(download_filename):
click.echo(f"File exists at {download_filename}")
with open(download_filename, 'wb') as f: continue
for chunk in r.iter_content(chunk_size=8192):
f.write(chunk) attempts = 1
click.echo(f"Successfully downloaded to {download_filename}") p = None
except Exception as e: try:
click.echo(f"Download attempt failed: {e}") with requests.get(download_url, stream=True) as r:
return None while attempts <= 3:
try:
r.raise_for_status()
break
except Exception:
click.echo(f"Download attempt failed: HTTP {r.status_code}; retrying {attempts}/3")
sleep(attempts)
attempts += 1
if r is None or r.status_code != 200:
if r:
code = r.status_code
else:
code = "-1"
raise HTTPError(download_url, code, "", None, None)
if not os.path.exists(download_path):
os.makedirs(download_path)
with open(download_filename, 'wb') as f:
for chunk in r.iter_content(chunk_size=8192):
f.write(chunk)
click.echo(f"Successfully downloaded to {download_filename}")
except Exception as e:
click.echo(f"Download attempt failed: {e}")
continue
@ -158,7 +237,11 @@ def downloadSong(destination, filename, entry):
"-p", "--pages", "_pages", type=int, default=None, envvar='C3DBDL_BUILD_PAGES', "-p", "--pages", "_pages", type=int, default=None, envvar='C3DBDL_BUILD_PAGES',
help="Number of pages to scan (default is all)." help="Number of pages to scan (default is all)."
) )
def build_database(_overwrite, _pages): @click.option(
"-c", "--concurrency", "_concurrency", type=int, default=10, envvar='C3DBDL_BUILD_CONCURRENCY',
help="Number of concurrent song page downloads to perform at once."
)
def build_database(_overwrite, _pages, _concurrency):
""" """
Initialize the local JSON database of C3DB songs from the website. Initialize the local JSON database of C3DB songs from the website.
@ -173,7 +256,7 @@ def build_database(_overwrite, _pages):
exit(1) exit(1)
click.echo("Building JSON database; this will take a long time...") click.echo("Building JSON database; this will take a long time...")
songs_database = buildDatabase(_pages) songs_database = buildDatabase(_pages, _concurrency)
click.echo('') click.echo('')
click.echo(f"Found {len(songs_database)} songs, dumping to database file '{config['database_filename']}'") click.echo(f"Found {len(songs_database)} songs, dumping to database file '{config['database_filename']}'")
if not os.path.exists(config['download_directory']): if not os.path.exists(config['download_directory']):
@ -267,7 +350,17 @@ def database():
default=None, type=int, default=None, type=int,
help='Limit to this many songs (first N matches).' help='Limit to this many songs (first N matches).'
) )
def download(_filters, _limit, _file_structure): @click.option(
"-i", "--download-id", "_id",
default=None, type=int,
help='Download only "dl_links" entry N (1 is first, etc.), or all if unspecified.'
)
@click.option(
"-d", "--download-descr", "_desc",
default=None,
help='Download only "dl_links" entries with this in their description (fuzzy).'
)
def download(_filters, _id, _desc, _limit, _file_structure):
""" """
Download song(s) from the C3DB webpage. Download song(s) from the C3DB webpage.
@ -286,14 +379,12 @@ def download(_filters, _limit, _file_structure):
The default output file structure is: The default output file structure is:
"{genre}/{author}/{artist}/{album}/{title} [{year}].{orig_name}" "{genre}/{author}/{artist}/{album}/{title} [{year}].{orig_name}"
\b
Filters allow granular selection of the song(s) to download. Multiple filters can be Filters allow granular selection of the song(s) to download. Multiple filters can be
specified, and a song is selected only if ALL filters match (logical AND). Each filter specified, and a song is selected only if ALL filters match (logical AND). Each filter
is in the form: is in the form "--filter [database_key] [value]".
--filter [database_key] [value]
\b The valid "database_key" values are identical to the output file fields above, except
The valid "database_key" values are identical to the output file fields above. for "orig_name".
\b \b
For example, to download all songs in the genre "Rock": For example, to download all songs in the genre "Rock":
@ -303,6 +394,13 @@ def download(_filters, _limit, _file_structure):
Or to download all songs by the artist "Rush" and the author "MyName": Or to download all songs by the artist "Rush" and the author "MyName":
--filter artist Rush --filter author MyName --filter artist Rush --filter author MyName
In addition to filters, each song may have more than one download link, to provide
multiple versions of the same song (for example, normal and multitracks, or alternate
charts). For each song, the "-i"/"--download-id" and "-d"/"--download-descr" options
can help filter these out, or both can be left blank to download all possible files
for a given song. Mostly useful when being extremely restrictive with filters, less
so when downloading many songs at once.
\b \b
The following environment variables can be used for scripting purposes: The following environment variables can be used for scripting purposes:
* C3DBDL_DL_FILE_STRUCTURE: equivalent to "--file-structure" * C3DBDL_DL_FILE_STRUCTURE: equivalent to "--file-structure"
@ -331,7 +429,7 @@ def download(_filters, _limit, _file_structure):
click.echo(f"Downloading {len(pending_songs)} song files...") click.echo(f"Downloading {len(pending_songs)} song files...")
for song in pending_songs: for song in pending_songs:
downloadSong(config['download_directory'], _file_structure, song) downloadSong(config['download_directory'], _file_structure, song, _id, _desc)
@click.group(context_settings=CONTEXT_SETTINGS) @click.group(context_settings=CONTEXT_SETTINGS)